February 4, 2000

Martin Fontaine

Structural Identification of Unintelligible Documents

This presentation describes some techniques and approaches to solve the document identification problem in a certain particular context. We are trying to classify documents by their structure not by their content. This means that we do not assume that the documents are written in a natural language. The only assumption made is that the target concept that we are trying to learn (with machine learning techniques) can be expressed with a regular expression. The following topics will be discussed:

  1. A new approach for textual features extraction and features pruning inspired from compression technique and specially developed to efficiently extract features from large training set containing significantly large documents.
  2. The utility of grammatical inference techniques for text classification.
  3. A hybrid approach between grammatical inference and decision tree.

Back to the TAMALE home page