AMG-UC Annotated Bibliography

From IntraLibrary

Jump to: navigation, search

This annotated bibliography is maintained as part of the AMG-UC project.

Contents

AMeGA (Automatic Metadata Generation Applications) Project: Final Report

Authors: Jane Greenberg (Principal Investigator), Kristina Spurgin, and Abe Crystal

Appendix A Authors: Michelle Cronquist and Amanda Wilson

Submitted to the Library of Congress, February 17, 2005

Web source: [1]

Annotations on AMeGA

Building Quality Assurance into Metadata Creation

an Analysis based on the Learning Objects and e-Prints Communities of Practice

Authors: Jane Barton, Sarah Currier, Jessie M. N. Hey

Presented at: International Conference on Dublin Core and Metadata Applications, Seattle, 2003

Web source: [2]

Annotations on Building Quality Assurance...

MetaTools Final Report

Author: Malcolm Polfreman

Submitted to JISC, 30 October, 2008

Web source: [3]

Annotations on MetaTools

Metadata Generation for Resource Discovery

Authors: Malcolm Polfreman, Vanda Broughton, Andrew Wilson

Submitted to JISC, February 2008

Web source: [4]

Annotations on Metadata Generation for Resource Discovery

System for Computer-aided Metadata Creation

Authors: Marek Hatala, Steven Forth

Presented at WWW 2003

Web source: [5]

Annotations on Computer-aided Metadata Creation

Automated Digital Libraries

How Effectively Can Computers Be Used for the Skilled Tasks of Professional Librarianship?

Author: William Y. Arms

Published in D-Lib Magazine, July/August 2000

Web source: [6]

Annotations on Automated Digital Libraries

Toward a Metadata Generation Framework

A Case Study at Johns Hopkins University

Authors: Mark Patton, David Reynolds, G. Sayeed Choudhury, Tim DiLauro

Published in D-Lib Magazine, November 2004

Web source: [7]

Annotations on Toward a Metadata Generation Framework

Ontology mappings to improve learning resource search

Authors: Dragan Gašević, Marek Hatala

Published in British Journal of Educational Technology, May 2006

Web source: [8]

Annotations on Ontology Mappings...

Feasibility study into approaches to improve the consistency with which repositories share material

Authors: Andrew Charlesworth, Nicky Ferguson, Eric Lease Morgan, Seb Schmoller, Neil Smith, David Zeitlyn

Submitted to JISC, 7 November 2008

Web source: [9]

Annotations on Approaches to Improve the Consistency...

Automating Metadata Generation: the Simple Indexing Interface

Authors: Kris Cardinaels, Michael Meire, Erik Duval

Presented at WWW 2005

Web source: [10]

Annotations on Automatic Metadata Generation: the Simple Indexing Interface

Automated Metadata

A review of existing and potential metadata automation within Jorum and an overview of other automation systems

Authors: Kenny Bair, Jorum Team

Submitted to JISC July 2006

Web source: [11]

Annotations on Automated Metadata


Automatic Metadata Creation for Supporting Interoperability Levels of Spatial Data Infrastructures

Authors: M. Manso-Callejo, M. Wachowicz, M. Bernabé-Poveda

Presented at GSDI (Global Spatial Data Infrastructure) 11 World Conference, June 2009

Web source: [12]

Annotations on Automatic Metadata Creation for Supporting Interoperability Levels of Spatial Data Infrastructures

Networking Names

Author: Karen Smith-Yoshimura

Published by OCLC Research, April 2009

Web source: [13]

Annotations on Networking Names

Some Books on Text Mining

I have found the following four books helpful. They have enabled me to learn about the principles of text mining -- a super-set of functionality including automatic metadata generation (AMG).

  • Bilisoly, R. (2008). Practical text mining with Perl. Wiley series on methods and applications in data mining. Hoboken, N.J.: Wiley. - Of all the books listed here, this one includes the most Perl programming examples, and it is not as scholarly as the balance of the list. Much of the book surrounds the description of regular expressions against texts. Its strongest suit is the creation of terminal-based concordance scripts. Very nice. Lot's of fun. The concordances return very interesting results. The book does describe clustering techniques too, but the on the overall topic of automatic metadata generation the book is not very strong.
  • Konchady, M. (2006). Text mining application programming. Charles River Media programming series. Boston, Mass: Charles River Media. - This book is a readable survey text mining covering parts of speech (POS) tagging, information extraction, search engines, clustering, classification, summarization, and question/answer processing. Many models for each aspect of text mining are described, compared, and contrasted. To put the author's knowledge into practice, the book comes with a CD containing a Perl library for text mining, sample applications, and CGI scripts. This library is freely available on the Web. The chapters on information extraction and summarization will be of most interest to readers of this wiki.
  • Nugues, P. M. (2006). An introduction to language processing with Perl and Prolog: An outline of theories, implementation, and application with special consideration of English, French, and German. Cognitive technologies. Berlin: Springer. - Of the four books listed here, this one is probably the most dense. I found its Perl scripts used to parse text more useful than the ones in Bilisoly, but this one included no concordance applications. I also found the description of n-grams to be very interesting -- the extraction of two-word phrases. I suspect the model they describe can be extended to n number of words. This book also discusses parts of speech (POS) processing but it is the only one that describes how to really parse language. Think semantics, lexicons, discourse, and dialog. After the first couple of chapters the Perl exampled disappear and give way to Prologue examples exclusively.
  • Weiss, S. M. (2005). Text mining: Predictive methods for analyzing unstructured information. New York: Springer. - The complexity of this book lies between Konchady and Nugues; it includes a greater number of mathematical models than Konchady, but it is easier to read than Nugues. Broad topics include textual documents as numeric vectors, using text for prediction, information retrieval, clustering & classification, and looking for information in documents. Each chapter includes a section called "Historical and Bibliographical Remarks" which has proved to be very interesting reading.

When it comes to the process of automatic metadata generation (AMG) I found each of these books useful in their own right. Each provided me with ways to reading texts, parsing texts, counting words, counting phrases, and through the application of statistical analysis create lists and readable summaries denoting the "aboutness" of given documents.

(Written by Eric Lease Morgan, June 2, 2009)

Some Perl modules useful in regards to automatic metadata generation

As a Perl hacker I am interested in writing Perl scripts putting into practice some of the things I learn. Listed here are a number of modules that have gotten me further along in regard to text mining, and all of them are to some degree or another useful in regards to automatic metadata generation:

  • Lingua::EN::Fathom - This library outputs interesting statistics regarding a given document: number of words and the number of times each occurs, number of sentences, complexity of words, number of paragraphs, etc. Of greatest interest are numbers (Fog, Flesch, and Flesch-Kincaid) denoting the readability of the text. Quick. Easy. Useful.
  • Lingua::EN::Keywords - Given a text, this library outputs a list of what it thinks are the most significant individual words in a document, sans stop words. Not fancy.
  • Lingua::EN::NamedEntity - Given a text, I believe this library comes pre-trained to extract names, places, and organizations from texts. It returns a Perl data structure listing the probabilities of a word or phrase being any particular entity. It may need to be re-trained to work for your corpus.
  • Lingua::EN::Semtags::Engine - Given text this module will return words and phrases in a relevancy ranked order. Initially, I have had some problems using this module because it seems to take a long time to return. On the other hand, it looks promising since it returns both individual words as well as phrases.
  • Lingua::EN::Summarize - Given a text this library returns sentences it thinks encapsulates the essence of the document. The result is readable -- grammatically correct. The process it uses to accomplish its task is self-proclaimed as unscientific.
  • Lingua::EN::Tagger - This library marks up a document in pseudo XML with tags denoting parts of speech in a given document. To do this work it also can extract words, noun phrases, and sentences from a text. Zippy. Probability-based. Developers are expected to parse the tagged output and do analysis against it, such as count the number of times particular parts of speech occur.
  • Lingua::StopWords - Returns a simple list of stop words. Easy, but I can't figure out how customizable it is. "One person's stop word list is another person research topic."
  • Net::Dict - A network interface to DICT (dictionary) servers. While the DICT protocol is a bit long in the tooth, and not quite as cool as Web interfaces to things like Google or Wikipedia, this module does provide a handy way to look up definitions, a complimentary functionality to WordNet.
  • Text::Aspell - A Perl interface to GNU Aspell which is great for spell-checking applications.
  • TextMine - This is a set of modules written by Manu Konchady the author of Text Mining Application Programming. It includes submodules named Cluster, Entity, Index, Pos, Quanda (Q & A), Summary, Tokens, and WordNet. While this set of modules is the most comprehensive I've seen, and while they are probably the most theoretically based interfacing with things like WordNet to be thorough, my initial experience has been a bit frustrating since scripts written against the libraries do not turn very quickly. Maybe I'm feeding them documents that are too large and if so, then the libraries are not necessarily scalable.
  • WordNet - There are a bevy of modules providing functionality against WordNet -- a "lexical database of English... Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations." Any truly thorough text mining application will take advantage of WordNet.

(Written by Eric Lease Morgan, June 2, 2009)

Personal tools