October 16, 1998

Sam Scott
Institute of Interdisciplinary Studies (Cognitive Science), Carleton University, Ottawa

Forming Better Features for Text Categorization

Any attempt to use the standard techniques of Machine Learning to categorize text must involve some method of converting natural language documents to lists of feature-value pairs. The most common method is known as the "bag of words" approach. Each word in the language is a "feature" and values are assigned for each document based on the frequency of appearance of each word. This reduces each document to an unordered list of tokens - synctactic structures are broken up and semantic relationships are ignored. The "bag of words" representation performs fairly well in some domains, but does quite poorly in others. Intuitively, it should be possible to do better.

The work presented in this talk explores various methods for forming features using some of the information left out of the "bag of words" model. Word ordering is partially preserved by forming a "bag of phrases" feature set, where phrases can be identified syntactically using a simple noun phrase grammar or statistically using a keyphrase Extractor. Semantic relationships are partially preserved by forming meta-features of generalized word meanings from the knowledge contained in an on-line lexical reference system (WordNet). The new feature sets are tested with a symbolic rule-based learner on two major collections of English texts, and the results are compared to those obtained using the "bag of words" model.

Back to the TAMALE home page