December 11, 1998

Mario Jarmasz
School of Information Technology and Engineering
University of Ottawa

Corpus Linguistics: a paradigm for solving NLP problems

Development of large electronic corpora for use in Computational Linguistics started in the late 1970's. Advances in software and NLP (Natural Language Processing) technologies have facilitated the transformation of text archives into electronic corpora. Many researchers have turned to Corpus Linguistics in the past decade to develop large-scale linguistic applications. The use of large corpora is not a new concept in Linguistics. The richness of the corpora, the increase in their size and the fact that many are easily accessible are some reasons that make Corpus Linguistics attractive today.

In this talk I will present the different aspects of Corpus Linguistics. A definition of the corpus will be introduced along with the various types of corpora that are currently available. An overview of fields interested in corpora and possible applications such as the construction of an electronic thesaurus, information retrieval systems and machine translation systems will be demonstrated. I will also present some statistical methods for empirical investigations of corpora as well as the steps involved in creating an electronic corpus.

This presentation is based on the book Les linguistiques de corpus (Habert, Nazerenko, Salem, 1997).

