Evaluating taggers

Bill Teahan asks how he can evaluate how good his tagger is. I'm 
sure people who have experience in building, training and testing 
taggers will be able to help him. I'd just like to make a couple 
of observations as a tagged-corpus _user_.

The first (prosaic) point is to mention the British National 
Corpus which provides 100 million words of material tagged for 
POS. But of course it is not yet licenced for use outside Europe, 
which will be frustrating for Bill.

The second point is that the use of other tagged corpora merely 
means that you're testing one tagger against another. There is no 
way of avoiding, at some point, manual checking, and correction 
of at least a subset of a corpus. That is, I take it, how 
existing taggers are 'trained' - by resubmitting a corrected 
version of their output.

But what interests me more is the _status_ of POS tagging. As 
taggers become more freely available, and POS tagging more robust, 
so the kind of data getting submitted to them becomes more 
heterogeneous. For example, tagging works tolerably well now for 
printed text (which has been designed to conform to limited 
'standards' of construction). But most taggers fall over pretty 
dreadfully when they're asked to deal with transcripts of spoken 
language. There are four problems, as far as I can see.

(1) Spoken transcripts may contain orthographically 'distressed' 
    material which the tagger will not recognise in its look-up 
    tables. The same element may appear in different parts of the 
    transcript in different surface form. Transcripts also typically 
    contain interpolated material (such as that describing context or 
    the speech event)

(2) Spoken transcripts are often inaccurate which leads to 
    incorrect assignment of POS. Sometimes this is as simple as a 
    transcriber writing 'their' instead of 'they're'. This problem 
    applies to the spoken material in the first release of the BNC.

(3) Spoken language often contains fragmentary utterances, 
    restarts, and so on, leading to sequences which the tagger may 
    regard as anomalous.

(4) Spoken language does interesting things, not comprehended by 
    traditional linguistic analysis. The traditional concept of POS 
    can itself look threatened.

Spoken language is only one kind of orthographically non standard 
material which linguists are currently interested in. International
corpora of varieties of English must be testing POS tagging to the 
limits. And I have had problems with historical material (Early Modern 
English) and texts generated on email & computer conferencing. I rather 
hope that POS tagging software has now reached a stage of sophistication 
which will allow us to ask interesting questions about the status of 
'part of speech' categories in English (or indeed, other languages).

David Graddol