PARSING, LINGUISTIC RESOURCES AND SEMANTIC ANALYSIS FOR ABSTRACTING AND CATEGORISATION William J. Black Centre for Computational Linguistics UMIST, Manchester January 17th, 1994 (*) 1 Introduction This paper discusses the linguistic aspects of text processing, based on experience with two related applications: abstracting of technical papers and text categorisation. Whereas systems addressing both applications can be and have been developed without formal linguistic analysis, we have preferred to develop an architecture in which re-usable linguistic resources and analysers play a part. 2 Abstracting by extracting The extraction approach to abstracting is one which seeks to achieve robustness in simplicity, acknowledging the prematurity of natural language processing for such tasks. Essentially, an abstract is made up by concatenating sentences extracted from the source text by a mechanism that selects them as content-indicative. That mechanism can be based on pattern-matching, as described by Paice (1981) and Black and Johnson (1988), or it can be statistical, as described by Luhn (1958), Edmonson (1969) or Earl (1970). However, merely concatenating text sentences together risks incoherence, particularly noticeable if the extraction criteria take no account of chains of reference in the texts. A graphic example encountered by Black and Johnson (op cit) was a paper describing a sequence of three experiments, where three sentences were extracted each containing "the experiment". Each such token referred to a different experiment. In this case, the presence of anaphora in the extract renders it insufficiently coherence for use as an abstract. Liddy et al (1987) also report on the effect of the occurrence of anaphora on the statistical base for numerical measures of concept occurrence. This time, the presence of anaphora has a deleterious effect on the selection part of the extracting process. For both reasons, an important refinement of extraction-based abstracting is to attempt to control for the use of pronominal anaphora and other referring expressions. Paice and Husk (1987) reported a relatively small rulebase which discriminates referring from non-referring uses of the pronoun "it" with a high degree of accuracy, and Liddy et al (op cit) six rules to do the same for "that". Armed with such rules, an extracting program can assimilate sentences preceding those in which a referring pronoun occurs. Pronominal anaphora do not, with very rare exceptions, refer further back in texts than the preceding sentence. However, the situation is much less straightforward when non-pronominal reference is taken into account. The BLAB project was constituted to study the extracting process in such a way that definite noun-phrase referring expressions could occur in extracts without undermining coherence. An alternative method of extracting was developed which used logical aspects of discourse structure as a theoretical basis. Sentences containing referring expressions cannot be interpreted independently. Another way of saying this is that they are not propositional. In the same spirit as the earlier projects, BLAB did not seek to resolve the referring expressions, but to discriminate between occurrences of propositional from non-propositional sentences. Details of this approach are given in Paice et al (1993) and Johnson et al (1993). A relatively small set of rules seemed to be effective in discriminating between referring and non-referring uses of "the". In implementation, these rules were reduced to 6 in total. This produced an alternative to the selection-based method used in the previous work, and was instead based on elimination of sentences that would, taken on their own, render the extract incoherent. This produced extracts that were much longer than those produced under the selection methods, containing around 20% of the original. The method provided no basis for tailoring the length of the extract (unlike the indicator-phrase selection method which weighted and ranked sentences for their content-indicativeness). In this work, some evaluation methods were developed for comparing content-indicativeness with expert-tagged extracts, and for evaluating coherence. Description of these methods and results are in Paice et al (1993) and in a paper in preparation. Another outcome was that the extracting and coherence-preserving rules were implemented within a modular architecture, and could be interfaced to different preprocessing and surface analysis components. This allowed evaluation to be conducted on different components separately. 2.1 Linguistic resources for abstracting At the outset of the project, we did not know what the solution to the problem posed by definite referring expressions would be, and hence what input data the rules would work on. But it was suspected that a more sophisticated linguistic analysis might be needed than had been the case for pronominal anaphora. In any case, a new team would work on this problem and the previous specially-developed rule language, "GARP" was not thought easily maintainable. Thus a surface linguistic analyser was developed that would provide linguistic descriptions on which discourse-level rules could operate. The syntactic analyser for such a system must be first and foremost robust. It should produce some result on any input. This seems to force some basic choices. For example, bottom-up processing is to be preferred, although it is also a requirement that the analyser can abduce the syntactic description of much of the vocabulary used from the local syntactic context. Another requirement is that it should not produce too many analyses. One is ideal. A syntactic level of description at least avoids the generation of some quantifier scope ambiguities, but there are many potential structural ambiguities that one might take into account. The solution taken in BLAB to the problem of prepositional phrase attachments was not to define a rule which could attach them to an antecedent. Sub-second "parsing" is then possible for quite long sentences. A second set of attachment heuristics operates after the initial parse to deal with such problems (1). There is certainly a limit to the potential for purely structural approaches to abstracting, as exemplified by BLAB, but the problems of discourse structure that this work raised are also relevant to approaches depending more on subject knowledge. However, the linguistic resources needed for a more truly knowledge-based approach to abstracting would almost certainly have to permit lexical semantic analysis as well as structural anaylsis. This is also true of the new application in which we are working on text processing, namely categorisation. 3 Categorisation Text categorisation is more like indexing for current awareness than abstracting. The task is to assign texts from a source such as a newswire to categories related to the job functions or interests of the members of the consuming organisation. Nevertheless, many of the characteristics of the application are the same. Robustness of analysis and speed are probably as important as richness of analysis at least for the time being. On the basis of a linguistic analysis and the semantic processing needed for categorisation, it should be possible to provide a text summarisation service by analysis and generation. Nonetheless, one clear difference between abstracting as done by BLAB and categorisation as being done by COBALT is that in the latter case, semantic processing is of the essence. The COBALT project is based on the adaptation and integration of components from previous text-processing projects. For surface syntactic analysis, which we describe below, the basis can be traced from the BLAB analyser through the linguistic resources developed for generation in a dialogue project, PLUS. For the semantic analysis, the antecedent is based on the NOMOS project, whose objective was knowledge-base construction via text analysis. In NOMOS, semantic analysis involved a series of processes driven by heuristic rules, operating on syntactic trees to do conceptual disambiguation, collation of analysis fragments, resolution of attachment ambiguities etc. It was however related to a quite different text genre, namely legislative texts, from the newswire data for COBALT, and this imposes at the very least changes to the content of the semantic heuristics. 3.1 Linguistic Resources for Categorisation It is possible to approach categorisation in such a way that linguistic analysis is hardly required, or at least it need not be based on conventional syntactic analysis. The CONSTRUE application developed by Reuters and Carnegie Group uses a facilitating software shell known as TCS. This embodies a pattern-matching language in which parts of the syntactic context are represented by gaps specified only by length between pairs of words assumed to be in semantic relation. This is very much the same approach taken earlier to extracting by pattern-matching (Paice, 1981), and it can also be seen to have some of the characteristics of the approach to NLP described as "semantic grammar". It is our contention that such an approach has several defects, despite an initially impressive performance on unseen texts. For one thing, whilst an "amateur linguist" can develop a TCS rule-base, the non-linguistic approach fails to capture generalisations, is bound to produce lower precision than equivalent semantic discrimination rules operating on a basis of analysed test, provides little in the way of knowledge engineering methodological support, requires more to be redone in porting the generic application to new concrete cases, and is incapable of sustaining an evolution of the application requirements beyond simple categorisation. Of course, these contentions remain just that at present, since this is ongoing work, so the remainder of this abstract describes the approach being taken to linguistic processing for categorisation and the rationale for various choices that have been made. 3.2 The COBALT Linguistic Analysis Module Like the BLAB analyser, this is bottom up. Like its antecedent in PLUS, it produces a quasi-logical form (2) as output. This seems to us to be a reasonable level of initial description for genuine lexical semantic analysis for several reasons: It is easier to specify the interface with the semantic component independent of the linguistic resources to be used than would be a syntactic tree, whose topography is based on the linguistic rules used; It is, however, a better alternative than a fully-scoped logical form, since it is easier to minimise the number of competing analyses at this level; It may never be important for this application to resolve quantifier scopes. The linguistic analysis used is a unification-based categorial grammar (3), augmented by function composition and type-raising rules (and supported by derivational equivalence-based methods for eliminating spurious ambiguity, as described by Barry (1988) and Hepple and Morrill (1989)). Within this framework, it is easy to experiment with different approaches to such important choices as whether prepositional phrases subcategorise for their attachments or whether heads subcategorise for their optional modifiers. This is very much current work and will be elaborated on in the presentation. The linguistic resources are being developed with the aid of a corpus of newswire texts that has been made available to us, and which is also being used for semantic rule development and application based categorisation rules. Special effort is being directed to the analysis of proper names and other "sublanguage" features. The design of the analyser intentionally assumes an incomplete lexicon. _________________________________________________________________ Footnotes *In automatic abstracting research, we have been collaborating with Paice at Lancaster on refining extraction-based approaches by taking more account of linguistic discourse structure. This has been done in the context of a British Library-funded project BLAB. In categorisation, we are working with two AI-oriented companies, Quinary SpA of Milan and Step Informatique of Paris, in a CEC-funded project COBALT within the Linguistic Research and Engineering programme. The author gratefully acknowledges the financial support of the two grant-awarding bodies and Brother International PLC. 1) to the extent that it is necessary to deal with such ambiguities, since the referring/non-referring discrimination does not often depend on postmodifiers except for "of". 2) Quasi-logical form is intentionally indefinite here. 3) Again, indefiniteness is intentional. _________________________________________________________________ References Barry, G.D. (1988) Parsing Strategies for Categorial Grammars. MSc Dissertation, University of Manchester. Black, W.J. and Johnson, F.C. (1988) A Practical Evaluation of Two Rule-Based Automatic Abstracting Techniques. Expert Systems for Information Management 1(3), 159-177. Earl, L.L. (1970) Experiments in Automatic Abstracting and Indexing. Information Storage and Retrieval 6(4), 313-334. Edmonson, H.P. (1969) New Methods in Automatic Extracting. JACM 16 (2), 264-285. Hepple, M. and Morrill, G. (1989) Parsing and Derivational Equivalence. Proc. 4th EACL, Manchester, 10-18. Johnson, F.C., Paice, C.D., Black, W.J. and Neal, A.P. (1993) The application of linguistic processing to automatic abstract generation. Journal of Document and Text Management 1(3) to appear. Liddy, E.D. et al. (1987) Liddy, E.D. (reference is missing here). Luhn, H.P. (1958) The Automatic Creation of Literature Abstracts. IBM J. R & D 2(2), 156-165. Paice, C.D. (1981). The Automatic Generation of Literature Abstracts: An Approach Based on the Identification of Self-Indicating Phrases. In: R.N. Oddy et al. Information Retrieval Research, London: Butterworths, 172-191. Paice, C.D. and Husk, G.D. (1987) Towards the automatic recognition of anaphoric features in English Text: The impersonal pronoun "it". Computer Speech and Language, 2, 109-132. Paice, C.D., Black, W.J., Johnson, F.C. and Neal, A.P. (1993) The construction of literature abstracts by computer. Final Report to the British Library R & D Division. University of Lancaster, Department of Computing.