October 22, 1999

Jérôme Tétreault
School of Information Technology and Engineering
University of Ottawa

Bilingual Text alignment base on Word Occurrence Information

Bilingual parallel corpora represent one of the most valuable source of information for the development of translation resources. Aligned corpora, which are obtained by aligning corresponding segments (usually sentences) of texts, have proved very useful in many tasks, such as statistical machine translation, bilingual lexicography, and word sense disambiguation.

In this talk, I will give a brief overview of published work on parallel texts alignment, outlining different approaches and their domain of application. I will present, in more details, an algorithm which uses dynamic programming techniques to compare word ccurrence vectors. This algorithm is based on previous work by Fung, to which some modifications have been introduced. The algorithm aims at extracting approximate bilingual lexicons from bilingual corpora, assuming no knowledge of either language and no prior sentence-level or paragraph-level alignment. Results of extracted bi-lexicons using the Hansard corpus will be presented. We envisage that the extracted bi-lexicon could further be used to produce a set of anchor points between the texts, allowing alignment at a finer level with high accuracy.

Back to the TAMALE home page