January 15, 1999

Joel Martin
Institute for Information Technology
National Research Council, Ottawa

Clustering Documents in Any Language?
Clustering Sequences based on Frequent Subsequences

A collection of documents can be more useful if it is organized or clustered, but most automatic clustering techniques rely on a preprocessing step that identifies words or stems and discards a known list of irrelevant words. Designing the preprocessing step for a new language is usually time consuming because a human language user must choose a set of stemming rules and stop words by hand using some form of trial and error.

I will present some work in progress that would allow clustering in an arbitrary language without requiring a language user to identify stems and a stoplist. The system learns a suffix tree description (essentially a grammar) of frequent subsequences in a large collection of documents and then uses that knowledge to cluster the documents and produce descriptive labels for those documents. Initial results suggest that in English, the automatic technique is comparable to using hand-generated stemming rules and hand-picked stoplists.

Back to the TAMALE home page