comparing corpora: sum&thanks

sometime ago I posted the following query:

I've been comparing corpora by contrasting their
respective word frequency lists using a program
that reads both lists and returns chi-square
statistics. The program extracts those words 
whose frequencies are different between the two 
corpora and presents them as 'keywords'. The
keywords are therefore words which are used
significantly more or less often than expected
in one corpus than in the other.

>From what you can gather from this quick
description, do you think this is an acceptable 
procedure for comparing two corpora? I have never
seen this approach being referred to in the
literature. Any ideas?

The posting has attracted considerable attention,
so I've received several replies, all very illuminating.
A summarized version is appended below. I want to thank
all respondents for the valuable info they provided.

The integral versions of all replies are at:


In broad terms, the prevaling opinions are:
1.Different topics would cause the frequencies to be
2.Chi-square is not an adequate test

Tony Berber Sardinha     | tony1@liverpool.ac.uk
AELSU                    | Fax 44-51-794-2739
University of Liverpool  |
PO Box 147               | 
Liverpool L69 3BX        |
UK                       |

Carlos McEvilly:

It depends on the nature of the corpora, and your
goal.  If they are on the same topic I would expect
that most of the keywords extracted by the process
you describe would be of little value; ...

One interesting way to extract keywords from a 
corpus with a diverse range of contents would be
to save statistics for the entire corpus, then
compare normalized statistics for each document
within the corpus to the normalized statistics
for the entire document ...

For a more sophisticated approach see Cohen, J.
D., "Highlights: Language- and Domain-Independent
Automatic Indexing Terms for Abstracting" in
the Journal of the American Society for
Information Science (JASIS), v46 n3, pp 162-174, 


I can think of a few reasons why the simple frequency comparison with X2 you
describe hasn't been widely used:

(a) it assumes a 1:1 match between form and function, so doesn't take account
of polysemes and homographs, and (unless it's a little more sophisticated than
you describe) acts only on single wordforms rather than lemmas.

(b) most of the corpora we use are of mixed origin; it only takes a small
difference in the distribution of topics between corpora to produce a large
number of statistically significant, but ultimately meaningless, differences in
wordform frequencies.

(c) X2 is notoriously unreliable when there is a high degree of 'clumping' in
the data - as when the corpus consists of 500 2000-word texts, each with a
single topic. (Ie, significance will be exaggerated, because you aren't dealing
with independently chosen words!) However, most linguists haven't yet realised
that X2 is inherently unsuited to many linguistic applications, and continue to
use it because it's relatively easy to calculate. (The alternative being a
rather hideous Poisson probability model.)

Not to say that your method doesn't have its good points - but it needs to be
used with a fair degree of caution. 

Rob Freeman

The idea of comparing the frequencies of words in texts as a way of
characterizing the texts is something we have been working on most of 
this year in Hong Kong. This is part of a project to compare the English
of ESL students here with that of native speakers. In the comparisons 
we have done our biggest problem has been assessing the significance of the 
differences which we find. I should say we're not helped in that by my limited 
statistical sophistication...A chi-squared test matches the data to a 
particular distribution doesn't it? Won't you need to divide your
sample to calculate the parameters of the distibution for each word?
If our experience is anything to judge by you will need very large
text samples to do this, most of the words will occur so infrequently 
that you have difficulty splitting your sample.



I have been doing a similar thing, as one step in my overall  
development of the theoretical notion of context.  I did not know of  
any others in the field who are using such a technique.  


Paul Rayson

Re your message on comparing corpora.  I am an RA working in UCREL at
Lancaster. We have developed a system which builds text norms from a corpus
(basically it's frequency profile) and compares that against smaller samples
using a chi-squared analysis.  We've published in the proceedings of ICAME91
(which didn't appear until 1993!)  although it was work in progress at that
The project involved a Market Research Company who wanted to do a combination
of qualitative and quantitative research for their clients.  We are continuing
on a follow up at the moment.  The Ch-squared statistic is very useful for this
although some people prefer a t-test.


Lindsay Evett

We have done something along those lines for topic identification; simply,
compare word frequencies for a domain corpus and a general corpus; those
words which occur very frequently in the domain corpus compared to
the general corpus are considered to be significant for the domain. 
When processing later documents, if a document has a significant 
number of those words, it is judged to be about that domain.



if you are using pearson's chi^2, then this is not a good method.

my article in computational linguistic volume 19 (Accurate methods for
the statistics of surprise and coincidence) covers this sort of
comparison.  the specific example in the article is analysis of
bigrams, but the same method applies without significant change to
word frequency comparison.

in a nutshell, the problem with using the normal chi^2 test is that
the expected frequencies in some of the cells of the contingency table
are much too small for the gaussian approximation to apply.  the
effect is that the significance some difference is radically
overstated (by 200 orders of magnitude in an example in the paper).

there is a much more effective test which does not depend on
approximate normality.  based on the exact form of a likelihood ratio
test, this test is much less sensitive to the problems of overstated
significance.  this test has been called the G^2 statistic and is
closely related to the kuhlback-leibler measure of divergence of two
distributions.  i can provide source code to whoever is interested in
doing this sort of analysis.

as an example, here is a comparison of two (tiny) corpora.  one is a
record of 2400 words of email regarding cmu common lisp.  the second
is about 1700 words of email regarding the tipster document manager.

these are only the 100 most significant differences.  

[NB: The example was truncated]

each line of output contains a G^2 score, a < or > and the word in
question.  if there is a >, then the first corpus (cmucl) has more
occurences than the second corpus.  i have added comments which all
begin with a # to many of the lines.

bash$ compare cmucl.counts muffin.counts | sort -nr | head -100
51.999 < document		# muffin is about document managers
49.458 > _$			# cmucl talks about assembly language
47.020 < crl.nmsu.edu		# people at crl are cc'ed on muffin mail
46.846 > dlopen			# dynamic loading is a big deal regarding lisp
40.080 < tdm			# acronym for tipster document manager
27.286 > rld			# rld_* occurs lots in listings
25.984 > mikemac		# a frequent contributor recently
25.209 < attribute		# the document manager manages attributes
24.682 > Mike			# see mikemac above
23.723 < bbn.com		# due to cc lists
23.381 > cs.cmu.edu		# likewise, but for cmucl
23.381 > McDonald		# see mikemac



This procedure sounds similar to work done on trying
to decide the authors of articles. Various statistical
distributions such as the Poisson distribution were
used to find the more-than-chance occurrence of words
used by the authors which serve to classify the texts
into different authorships. One pioneering work
done in 1964 is extensively described in the second edition
of the book:

author = 	{Frederick Mosteller
           and  David L. Wallace},
title = 	{Applied Bayesian and Classical Inference - The Case
		  of The Federalist Papers},
year = 		1984,
publisher = 	{Springer Series in Satistics, Springer-Verlag},
keywords =	{Bayesian and classical inference, authorship,
		  statistical NLP}


Matthias Romppel

Some points concerning your request occur to me:
I have seen a similar approach at least in one philological study 
analyzing German poet Gottfried Benn's changes in vocabulary during 
life course (exact reference not at hand).
Actually I used this approach in a different context (differential/
political psychology) comparing the word frequency lists of two sets 
of political speeches which differed in degree
of a specific content analytical measure. I think there might be other
applications of this approach in these fields.
As far as the appropiateness of this approach is concerned IMHO some points 
have to be considered: 
Generally correct interpretation of Chi-square-values requires expected 
frequencies to be greater than 5 (might be no problem).
What are the wordlists like, especially has there been a suffix removal 
Depending on what you are aiming at, an effect size measure might be more


Adam Kilgarriff

	I tried doing this.  Unfortunately it doesn't work.
Chi-square tells you whether the two corpora might possibly be two
samples drawn randomly from the same population.  In fact this is
NEVER the case.  Words in texts just aren't random.  Even when I
look at two halves of the same text, the first half and the second
half, their freq lists were significantly different.  The problem
doesn't really change if you look at very high-frequency words either
- although it appears at first glance that the realtive frequencies
converge to the same proportion, the chi-square test tells you they
don't quite: if the two subcorpora were truly random samples, then,
for a very common item with hundreds of thousands of occurrences, the
proportions would be very very nearly the same for the two.  Very
nearly isn't good enough.

	If you look at the printed LOB freq tables, in the LOB book
(Johanssen and ?) you'll see they have a column saying whether words
are significantly different in their British vs American frequencies.
A lot are, and ALL high frequency words are.  It would appear the
authors did not fully take on board the arguments above, as then they
would have realised that this is not a characteristic of Br vs. Am
differences, but of the stats of lg corpora.

	What we need to find is a test a bit like chi-square but which
is not statistically 'pure', but rather, includes some constant which
can be fixed empirically so that we can say "these two corpora represent
the same sort of text to degree X".  This is one of the bits of
research I have on the go.


Gunnel Kallgren

Frequency lists may be a good way of finding genre differences between
corpora. Many of the larger corpora contain just one or two text types
and I think it is a bit hazardous to draw far-reaching conclusions from
such material, be it on lexical, syntactic, or other matters. (As for me,
I am working with a balanced Swedish corpus and am well aware of how
surprisingly much text types can differ along various parameters.) By 
comparing single-genre corpora with balanced ones, one might get an idea
of how much they differ from the average and thus whether they can be
expected to say something about language in general or just their own
text type. Perhaps.


Tony Rose

I'm most interested to hear of other people looking at this approach.
It's an idea we have also investigated, so far with regard to text recognition
applications. We didn't use quite the same statistical method, but the
basic idea was the same. We took a set of domain-specific corpora,
and a large, undifferentiated or "general" corpus, and compared each set of
domain-specific word frequecies with those of the general corpus.
This gave us a set of domain specific keywords for each domain.
We then represented these associations as codes attached to each entry in 
our lexicon, analogous to the domain codes present in LDOCE.
The difference was that our statistical process left you with not only 
a set of codes for each word, but also a measure of strength of association
for each code (LDOCE just gives binary info, i.e. code/no code.)

We tested it by using it as a further source of knowledge for lattice
disambiguation experiments, with data obtained from various OCR / script
recognition systems. Results were reasonably encouraging; I've attached
some references below, which describe our methodology in greater detail.
So at least there's some empirical support for the approach!

TG Rose & LJ Evett (forthcoming) Handwriting Recognition using Domain
Information. Proc. 7th Int. Conference of the Graphonomics Society,
London, Ontario, August 1995.

TG Rose & LJ Evett (1993) Text recognition using collocations and
domain codes. Proc. 1st Annual Workshop on Very Large Corpora,
Ohio State University, USA, 65-73.

TG Rose & LJ Evett (1993) Semantic analysis for large vocabulary cursive
script reecognition. Proc. 2nd Conference on Document Analysis & Recognition,
Tsukuba Science City, Japan, 236-239.



You could start by reading:

Dunning, Ted (1993) "Accurate Methods for the Statistics
of Surprise and Coincidence" Computational Linguistics 19(1)

However, I know Ted has done some work closer to what you
describe (some descriptions were posted to comp.ai.nat-lang
a while back).  So I would think he could give you better advice
than I can.


Stephanie W. Haas

Regarding your request for articles on corpora word frequencies,
some colleagues and I have been doing somewhat similar work.
The following papers may be helpful:

Haas, Stephanie W. (1995). Domain terminology patterns in different
disciplines:  Evidence from abstracts.  _Proceedings of the Fourth
Annual Symposium on Document Analysis and Information Retrieval_,

Losee, Robert M., Jr. & Haas, Stephanie W. (1995).  Sublanguage terms:
Dictionaries, usage, and automatic classification.  _Journal of the
American Society for Information Science (in press, may be in the
next issue to come out).

Haas, Stephanie W. & He, Shaoyi. (1993).  Toward the automatic
identification of sublanguage vocabulary.  _Information Processing
& Management_, 29, 6, 721-732.



we have developed our own programs;

We are also doing some work on cohesion, trying to identify text structure,
not using the frequencies though.