AMG-UC Annotated Bibliography
From IntraLibrary
This annotated bibliography is maintained as part of the AMG-UC project.
AMeGA (Automatic Metadata Generation Applications) Project: Final Report
Authors: Jane Greenberg (Principal Investigator), Kristina Spurgin, and Abe Crystal
Appendix A Authors: Michelle Cronquist and Amanda Wilson
Submitted to the Library of Congress, February 17, 2005
Web source: [1]
Building Quality Assurance into Metadata Creation
an Analysis based on the Learning Objects and e-Prints Communities of Practice
Authors: Jane Barton, Sarah Currier, Jessie M. N. Hey
Presented at: International Conference on Dublin Core and Metadata Applications, Seattle, 2003
Web source: [2]
Annotations on Building Quality Assurance...
MetaTools Final Report
Author: Malcolm Polfreman
Submitted to JISC, 30 October, 2008
Web source: [3]
Metadata Generation for Resource Discovery
Authors: Malcolm Polfreman, Vanda Broughton, Andrew Wilson
Submitted to JISC, February 2008
Web source: [4]
Annotations on Metadata Generation for Resource Discovery
System for Computer-aided Metadata Creation
Authors: Marek Hatala, Steven Forth
Presented at WWW 2003
Web source: [5]
Annotations on Computer-aided Metadata Creation
Automated Digital Libraries
How Effectively Can Computers Be Used for the Skilled Tasks of Professional Librarianship?
Author: William Y. Arms
Published in D-Lib Magazine, July/August 2000
Web source: [6]
Annotations on Automated Digital Libraries
Toward a Metadata Generation Framework
A Case Study at Johns Hopkins University
Authors: Mark Patton, David Reynolds, G. Sayeed Choudhury, Tim DiLauro
Published in D-Lib Magazine, November 2004
Web source: [7]
Annotations on Toward a Metadata Generation Framework
Ontology mappings to improve learning resource search
Authors: Dragan Gašević, Marek Hatala
Published in British Journal of Educational Technology, May 2006
Web source: [8]
Annotations on Ontology Mappings...
Feasibility study into approaches to improve the consistency with which repositories share material
Authors: Andrew Charlesworth, Nicky Ferguson, Eric Lease Morgan, Seb Schmoller, Neil Smith, David Zeitlyn
Submitted to JISC, 7 November 2008
Web source: [9]
Annotations on Approaches to Improve the Consistency...
Automating Metadata Generation: the Simple Indexing Interface
Authors: Kris Cardinaels, Michael Meire, Erik Duval
Presented at WWW 2005
Web source: [10]
Annotations on Automatic Metadata Generation: the Simple Indexing Interface
Automated Metadata
A review of existing and potential metadata automation within Jorum and an overview of other automation systems
Authors: Kenny Bair, Jorum Team
Submitted to JISC July 2006
Web source: [11]
Annotations on Automated Metadata
Automatic Metadata Creation for Supporting Interoperability Levels of Spatial Data Infrastructures
Authors: M. Manso-Callejo, M. Wachowicz, M. Bernabé-Poveda
Presented at GSDI (Global Spatial Data Infrastructure) 11 World Conference, June 2009
Web source: [12]
Networking Names
Author: Karen Smith-Yoshimura
Published by OCLC Research, April 2009
Web source: [13]
Annotations on Networking Names
Some Books on Text Mining
I have found the following four books helpful. They have enabled me to learn about the principles of text mining -- a super-set of functionality including automatic metadata generation (AMG).
- Bilisoly, R. (2008). Practical text mining with Perl. Wiley series on methods and applications in data mining. Hoboken, N.J.: Wiley. - Of all the books listed here, this one includes the most Perl programming examples, and it is not as scholarly as the balance of the list. Much of the book surrounds the description of regular expressions against texts. Its strongest suit is the creation of terminal-based concordance scripts. Very nice. Lot's of fun. The concordances return very interesting results. The book does describe clustering techniques too, but the on the overall topic of automatic metadata generation the book is not very strong.
- Konchady, M. (2006). Text mining application programming. Charles River Media programming series. Boston, Mass: Charles River Media. - This book is a readable survey text mining covering parts of speech (POS) tagging, information extraction, search engines, clustering, classification, summarization, and question/answer processing. Many models for each aspect of text mining are described, compared, and contrasted. To put the author's knowledge into practice, the book comes with a CD containing a Perl library for text mining, sample applications, and CGI scripts. This library is freely available on the Web. The chapters on information extraction and summarization will be of most interest to readers of this wiki.
- Nugues, P. M. (2006). An introduction to language processing with Perl and Prolog: An outline of theories, implementation, and application with special consideration of English, French, and German. Cognitive technologies. Berlin: Springer. - Of the four books listed here, this one is probably the most dense. I found its Perl scripts used to parse text more useful than the ones in Bilisoly, but this one included no concordance applications. I also found the description of n-grams to be very interesting -- the extraction of two-word phrases. I suspect the model they describe can be extended to n number of words. This book also discusses parts of speech (POS) processing but it is the only one that describes how to really parse language. Think semantics, lexicons, discourse, and dialog. After the first couple of chapters the Perl exampled disappear and give way to Prologue examples exclusively.
- Weiss, S. M. (2005). Text mining: Predictive methods for analyzing unstructured information. New York: Springer. - The complexity of this book lies between Konchady and Nugues; it includes a greater number of mathematical models than Konchady, but it is easier to read than Nugues. Broad topics include textual documents as numeric vectors, using text for prediction, information retrieval, clustering & classification, and looking for information in documents. Each chapter includes a section called "Historical and Bibliographical Remarks" which has proved to be very interesting reading.
When it comes to the process of automatic metadata generation (AMG) I found each of these books useful in their own right. Each provided me with ways to reading texts, parsing texts, counting words, counting phrases, and through the application of statistical analysis create lists and readable summaries denoting the "aboutness" of given documents.
(Written by Eric Lease Morgan, June 2, 2009)
Some Perl modules useful in regards to automatic metadata generation
As a Perl hacker I am interested in writing Perl scripts putting into practice some of the things I learn. Listed here are a number of modules that have gotten me further along in regard to text mining, and all of them are to some degree or another useful in regards to automatic metadata generation:
- Lingua::EN::Fathom - This library outputs interesting statistics regarding a given document: number of words and the number of times each occurs, number of sentences, complexity of words, number of paragraphs, etc. Of greatest interest are numbers (Fog, Flesch, and Flesch-Kincaid) denoting the readability of the text. Quick. Easy. Useful.
- Lingua::EN::Keywords - Given a text, this library outputs a list of what it thinks are the most significant individual words in a document, sans stop words. Not fancy.
- Lingua::EN::NamedEntity - Given a text, I believe this library comes pre-trained to extract names, places, and organizations from texts. It returns a Perl data structure listing the probabilities of a word or phrase being any particular entity. It may need to be re-trained to work for your corpus.
- Lingua::EN::Semtags::Engine - Given text this module will return words and phrases in a relevancy ranked order. Initially, I have had some problems using this module because it seems to take a long time to return. On the other hand, it looks promising since it returns both individual words as well as phrases.
- Lingua::EN::Summarize - Given a text this library returns sentences it thinks encapsulates the essence of the document. The result is readable -- grammatically correct. The process it uses to accomplish its task is self-proclaimed as unscientific.
- Lingua::EN::Tagger - This library marks up a document in pseudo XML with tags denoting parts of speech in a given document. To do this work it also can extract words, noun phrases, and sentences from a text. Zippy. Probability-based. Developers are expected to parse the tagged output and do analysis against it, such as count the number of times particular parts of speech occur.
- Lingua::StopWords - Returns a simple list of stop words. Easy, but I can't figure out how customizable it is. "One person's stop word list is another person research topic."
- Net::Dict - A network interface to DICT (dictionary) servers. While the DICT protocol is a bit long in the tooth, and not quite as cool as Web interfaces to things like Google or Wikipedia, this module does provide a handy way to look up definitions, a complimentary functionality to WordNet.
- Text::Aspell - A Perl interface to GNU Aspell which is great for spell-checking applications.
- TextMine - This is a set of modules written by Manu Konchady the author of Text Mining Application Programming. It includes submodules named Cluster, Entity, Index, Pos, Quanda (Q & A), Summary, Tokens, and WordNet. While this set of modules is the most comprehensive I've seen, and while they are probably the most theoretically based interfacing with things like WordNet to be thorough, my initial experience has been a bit frustrating since scripts written against the libraries do not turn very quickly. Maybe I'm feeding them documents that are too large and if so, then the libraries are not necessarily scalable.
- WordNet - There are a bevy of modules providing functionality against WordNet -- a "lexical database of English... Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations." Any truly thorough text mining application will take advantage of WordNet.
(Written by Eric Lease Morgan, June 2, 2009)
