February 18, 2000
An Evaluation of a Rule-Based Parser of English Sentences
Parsers are essential components of many Natural Language Processing applications. In such applications, parsing is the process of analyzing and assigning grammatical types to each word, phrase, and clause in a sentence. A broad-coverage parser attempts to analyze any grammatical sentence of a natural language. A parser depends upon a grammar, which is a set of codified rules of language, to analyze the sentence structure. Often in such systems, the strength of the parser/grammar has a direct effect on the desired results. Achieving good results rests on reliable evaluation methods to determine and eliminate weaknesses in the parser or grammar.
DIPETT (Domain Independent Parser of English Technical Text) is a broad-coverage parser of English technical text that was developed for a Ph.D. research project (Delisle, 1994), and is used primarily in the TANKA (Text Analysis for Knowledge Acquisition) project. The TANKA project seeks to build a model of a technical domain by semi-automatically processing written text that describes the domain. No other source of domain-specific knowledge is available. The accuracy and completeness of a semantic representation generated by TANKA is partly determined by the accuracy of DIPETT's syntactic analysis of the text.
Test suites have long been accepted in NLP because they provide for controlled data that is systematically organized and documented. I will discuss the use of test suites for evaluating the performance of parsers of English sentences, and present two test suites used to evaluate DIPETT. The evaluation results were used to make significant improvements to DIPETT, and the test suites were used in part to compare DIPETT's performance to that of several other publicly available large-coverage parsers.