Re: Evaluating taggers

While I agree with Mark Johnson about the need to keep task firmly in
mind while evaluating syntactic analysis, I believe that there are
aspects of syntactic analysis for about which a very reasonable
consensus reigns.  Mark complained:

> I suppose one could try to ask the harder question ``What category
> distinctions must any adequate account make?'', but I am skeptical
> that there is a theory-independent answer.  For one thing, theoretical
> assumptions about the interaction of morphology, syntax and semantics
> would probably influence the kinds and structure of the category
> distinctions.
> The same thing seems to be true about bracketting, despite the
> putative psychological claims made by some theorists.  Claims that a
> certain style of bracketting is `theory-neutral' seem to me to be more
> sociological rather than scientific in nature: i.e., a
> `theory-neutral' bracketting is one which hopefully a majority of
> contemporary linguists would more or less assent to.

The point is not that there are theory-neutral systems of bracketing,
but rather that all systems agree on some of the bracketings in
practically all cases, e.g., NP and PP constituents (also subordinate
finite clauses).  We clearly can evaluate grammars/parsers on their
accuracy here.  The point about part of speech categories is parallel,
I believe.  So there may be theories in which say, finite verbs, are
not "fundamental".  But it is reasonable to ask whether a concrete
grammar can identify the finite verbs in a number of tokens.

Moreove, we should.  Recognizing such structures is a reasonable
test of empirical accuracy in a grammar, and it could be of practical 
use in some applications.  

I think there's a parallel in the speech community, where the
insistence on practical evaluation drove a good deal of improvement in
the early 1980's.  They too had to agree on evaluating part of the
recognition data, namely that represented by text, ignoring accent,
minor mispronunciation, tempo, pitch patterns, and intensity, even
when these clearly contributed to meaning.  What was left was of
practical and scientific interest, and that's what counts.

TSNLP is a current EU project investigating evaluation along 
the lines I mention.  I'm NOT a member of the project, so please
don't ask me more about it, but there's a home page at:


