February 12, 1999
Department of Computer Science
Cornell University, Ithaca, NY
Machine Learning for Information Extraction Systems
A major obstacle to building robust systems that can read, summarize, and extract information from text is the need for large amounts of linguistic knowledge to handle the myriad syntactic, semantic, and pragmatic ambiguities that pervade virtually all aspects of text analysis. This talk will first briefly summarize existing work that addresses this knowledge engineering bottleneck for information extraction systems. We will then present a new approach to partial parsing of natural language texts that supports large-scale information extraction applications and relies on machine learning methods. The approach combines corpus-based grammar induction with a very simple pattern-matching algorithm and an optional constituent verification step. In spite of its simplicity, we will show that performance is surprisingly good for applications that require or prefer fairly simple constituent bracketing.