November 12, 1999
Extracting Information from the World Wide Web
Consider the fact that although your computer workstation can now retrieve any of 600,000,000 pages on the World Wide Web, it unfortunately cannot understand their content. This is, of course, because web pages are written to be understandable to people, not computers.
The goal of our research is to automatically extract a very large database of facts that mirror the content of the Web, and that can be manipulated by computer. If we can achieve this goal, it will enable using the web as a gargantuan data base and knowledge base to support a rich variety of applications. Our approach is to use machine learning algorithms to train a system to automatically extract information from web hypertext. For example, in one set of experiments our system was trained to extract descriptions of faculty, students, research projects, and courses from web sites of computer science departments. It then used these learned extraction routines to build a database containing thousands of new entries by automatically browsing new university web sites. The system is currently running 24 hours per day, and over the past eight months has built a knowledge base containing over 100,000 assertions, with an accuracy of roughly 70%.
This talk will present the machine learning algorithms we have developed to date, along with experimental results suggesting these methods can be quite effective for information extraction in certain domains.
Tom M. Mitchell is the Fredkin Professor of Artificial Intelligence and Learning in the School of Computer Science, Carnegie Mellon University. He is also the Founding Director of CMU's Center for Automated Learning and Discovery, an interdisciplinary center for research on data mining. Mitchell is best known for his research on machine learning, in which he has developed applications such as online calendars that learn their users' scheduling preferences, web browsers that learn to extract information from hypertext, and systems that predict birth risks in new pregnancies based on hospital records of previous pregnancies. Mitchell is the author of the widely used textbook "Machine Learning" (McGraw Hill, 1997), President-Elect of the American Association for Artificial Intelligence, and a member of the Computer Science and Telecommunications Board of the National Academy of Sciences' National Research Council. Mitchell received his B.S. degree from the Massachusetts Institute of Technology in 1973, and his Ph.D. in electrical engineering from Stanford University in 1979.