The linguistic aspects of our work have been discussed in (Barker and Szpakowicz, 1995; Copeck, et al., 1992; Delisle, et al., 1996; Delisle, et al., 1993; Delisle and Szpakowicz, 1995) . Word-sense disambiguation is presented in (Feng, et al., 1994; Li, et al., 1995) . Work on the application of advanced ML techniques to the refinement of a symbolic representation acquired from a technical text, including Horn clause representation for linguistic structures, appears in (Delannoy, et al., 1993; Delannoy and Rios, 1994; Delisle, et al., 1994; Matwin and Szpakowicz, 1993a; Matwin and Szpakowicz, 1993b) .
We have developed several tools for natural language processing (NLP), including the DIPETT parser, the HAIKU semantic analyzer, and utilities to access public domain lexical sources. These tools will serve as starting points for our work on the project proposed here, and will be developed into new tools that we describe in Section 4. In the proposed approach to text processing, linguistic clues of the kind we have investigated in detail will be combined with frequency analyses and with an extensive use of available lexical and terminological repositories. We envisage quick application of our past and proposed work in practice. This expectation can be confirmed by the rapidly growing need for information filtering on the Internet, and by considerable commercial interest in our research plans.
We propose to implement a system that generates summaries of English technical texts. The system will automatically produce a shorter version of the text, which retains the main points of the original. We will not employ knowledge-intensive NLP techniques such as discourse analysis. We will focus instead on surface NLP combined with ML techniques to produce an efficient and robust system. Summary generation will proceed in three steps: (i) identify keywords in the summarized text, using machine-learned keyword identification rules--see Section 4.2.2; (ii) select salient sentences, based on occurrences of keywords; (iii) produce rough summaries from sentences returned by (ii).
Most of the existing summarization techniques do not attempt even superficial parsing, and they are non-adaptive: they cannot learn from examples of adequately chosen keywords and good summaries. We believe that the task of summarization will greatly benefit from the use of ML and linguistic techniques. Our short-term objective is to build a stand-alone system that extracts the central concepts from a text. Extraction will be based on surface linguistic properties and on simple statistics, with the emphasis on the former. We will take advantage of public-domain lexical resources, lexical and surface-syntactic information present in the text, and frequency analysis. We have designed a detailed approach and validated it, partially manually and partially using our existing tools; the results are encouraging (Barker, et al., 1996).
Interest in text summarization extends outside the research community to the broad sector of the economy concerned with on-line communication in volume (Economist, 1994) . The system will be used to good advantage as part of a World Wide Web search engine. We have looked at a number of Web search engines; our conclusions are summarized in Section 3.2. A Web search usually shows only a few initial sentences from a retrieved document, or a few in which the exact search key appears (along with a link to a complete document). The user has to browse the selected documents, and read or at least skim them all. This may involve a mass of text. A system like ours gives the user an option that will make this task easier: ask for the retrieved text to be summarized if it is longer than some threshold. Integration of our system with a search engine will require preparation more complex than just skipping graphics and sound, because pure text seems quite infrequent. Since naive separation of text from markup ignores useful clues on text structure, we will try to preserve as much of it as possible.
An attempt will be made at the end of the project to commercialize the prototype system that arises from the proposed work, associating with one of the companies that support this research.
Certain systems that participated in the competitions associated with several Message Understanding Conferences also belong in this category, though their overt purpose is linguistically motivated information retrieval into templates rather than summarization. Such systems usually create a symbolic representation of the contents of the text, which is then used in inference required for question-answering, translation, and summarization, and so on. Inference makes use not only of the representation of the text, but also domain knowledge in the form of scripts, rules and other similar knowledge structures, specialized and expensive to build.
Brittleness is the main drawback of systems employing deep language processing. They may fail when confronted with a text that reaches even slightly beyond their linguistic or domain knowledge. This approach to language processing is also labour-intensive and computationally costly, and knowledge acquisition for such systems cannot be automated until a non-trivial critical mass of knowledge has been accumulated.
The other approach to text summarization avoids constructing a representation of the knowledge contained in a text. Instead, it selects and modifies elements of the input text. It can fairly be labelled shallow text processing. It employs corpus-based, statistical techniques, surface linguistic analysis, and the use of large, public domain linguistic resources such as on-line text corpora and machine-readable lexicons. (Adams and Neufeld, 1993) discuss a robust tagger based on the statistical approach.
The use of heuristics derived from many texts in many genres (for example, based on occurrence, co-occurrence and exclusion of noun-phrases in a given text) makes systems of this second class robust despite their lack of profound semantic insight. Work in information retrieval, text skimming and the generation of paraphrases successfully employs shallow text processing (Hammond, et al., 1995; Lewis and Sparck Jones, 1996) .
We propose a text summarization system based on shallow text processing. Once the system has been implemented and tested, we will address the limitations of the shallow approach by judiciously augmenting it with the techniques of text processing which we have mastered over many years of continuous research in this and related areas.
Summarization is relevant to all five types. While not entirely accurate and by definition not comprehensive, it serves to filter a text, indicating what it is about. A good summary will tell a reader whether he or she wants to read the whole document. A wide variety of texts can benefit from summarization, including newspapers and journals, press releases, scientific reports and organizational memos.
Summarization differs significantly from the keyword-based search used in information and document retrieval systems. Given a query composed of keywords, such systems return complete documents or perhaps canned abstracts of complete documents. A new document must be indexed by a set of keywords to become accessible. Summarization produces a custom-made summary "on the fly" for every new text.
Summarization would also be helpful in a situation when there is a large number of hits (that is, retrieved documents) in answer to a query composed of a relaxed set of keywords. Summarization would compress the volume of text to a manageable size, filtering out irrelevant or weakly relevant documents.
The following very brief literature review leaves out numerous relevant but less recent publications.
(Bookman, 1994) presents a system based on a "structured connectionist model" of memory. This model interacts with a working memory model built from an input text to identify the "basic conceptual roots" of a text. These are in turn used by a natural language generator to produce a summary. The model handles several difficult issues, such as computing the closeness between two concepts.
The task of template filling, as done by the systems presented at Message Understanding Conferences (MUC4, 1992; MUC5, 1994) is knowledge-intensive: much knowledge must be encoded, usually manually. The SHOGUN system (Jacobs, et al., 1993) relies on domain-specific lexical knowledge to annotate input text with type and role information prior to semantic interpretation and discourse processing. The Proteus System (Grishman and Sterling, 1993) makes use of a domain-specific concept hierarchy as well as "lexico-semantic models" (generalized multi-word textual patterns) during semantic analysis. (Lehnert, et al., 1993) describes a text processing system that uses concept node definitions, semantic features, and other knowledge (much of which is obtained automatically or semi-automatically). The system employs both corpus-based techniques and ML techniques. In (Cowie and Lehnert, 1996) , a claim is made for shallow knowledge for information extraction.
Summarizing large texts (or a large number of short texts) can hardly depend on such painstaking preparation. Domain-related preparation as we see it should be limited to very basic parametrization--at most, selecting a thesaurus.
At the other end of the spectrum, knowledge-scant approaches to summary generation rely on such criteria as (Paice and Jones, 1993) :
* frequency and distribution of words in the text,
* sentence position (in paragraphs, or relative to a text structure, expressed for example in a markup language),
* presence of keywords,
* presence of positive surface cues, or absence of negative cues,
* presence of surface discourse indicators.
Statistical techniques in information retrieval are based on computing differential rates of occurrence and rates of adjacency of words. This allows detection of domain terms, and possible keywords for a given text (Salton and McGill, 1983) , relying on an assumed occurrence of relevant words in the text higher than in a previously computed table of frequencies, after eliminating function words (articles, conjunctions and so on).
NetSumm (http://www.labs.bt.com/innovate/informat/netsumm/index.htm), an experimental system from British Telecom Laboratories, proposes online summarization on the Web, with user-adjustable compression rate, and can display a highlighted text or selected passages only. The latter exemplifies a shortcoming of selection methods: dangling pronoun references, when the sentence containing the referent has not been selected. Selection sometimes seems arbitrary; details of the heuristics, if any, are not indicated by the developer. Our tests on a collection of short texts also show that a summary too often consists only of the title and the first sentence.
We conclude that a system based on partial surface-syntactic analysis, dictionary lookup, simple keyword identification methods and simple word disambiguation techniques can bring a substantial quality improvement to summarization without an unacceptably high processing cost.
More recent work in information retrieval and document summarization has tried to incorporate linguistic knowledge: syntax, relation, and discourse. (Jacquemin and Royauté, 1994) presents a coupling of syntax-based term extraction with more traditional information retrieval methods.
(Miike, et al., 1994) summarizes Japanese text based on partial syntactic analysis and on the presence of discourse markers--such as the equivalents of thus, previously, besides--within paragraphs and for the first sentence of each paragraph. Semantic roles such as topic, purpose, background, feature, or conclusion are thus assigned sentence by sentence. The system also tracks structural links between sentences in serial constructions marked by "the feature ... as follows" or "first ... second". Criteria relying on a combination of user-defined keywords, discourse and textual structure are used to select the most relevant passages.
Despite the concern for text cohesion (what makes the input text coherent, and what it takes to deliver a coherent output), the question of reference across sentences is one of the less explained in the literature. It is sometimes unclear whether references would be resolved at all. A case in point is the article (Economist, 1994) . It is illustrated by a summary that, upon examination, appears to consist of half of the sentences with no external reference (themselves just 30% of the total number of sentences), and so craftily avoids the problem of reference resolution.
Most existing techniques for dealing with pronouns rely on a large amount of precoded semantic knowledge, making them inappropriate in the context of this proposal. In the first phase of the project, we will adapt our existing linguistic analysis tools to find candidate referents for pronouns. A pronoun explicitization tool described in (Delisle, 1994) needs only minimal surface-syntactic information and no semantic knowledge to suggest potential referents. These referents--perhaps all of them--would be inserted and marked as tentative (for example, Smith? proposed the idea at the January meeting.). In the second phase, we will seek a more comprehensive treatment: apply linguistically motivated heuristics and try machine learning on data gathered during the first phase.
It would be unrealistic to tackle even one, let alone several difficult NLP problems at once. We are proposing instead a goal-oriented integration of circumscribed, existing NLP techniques.
1. Preprocess linguistically the text to be summarized.
2. Identify the keywords using rules described in section 4.2.2.
3. Select sentences with a heuristically significant occurrence of keywords.
4. Perform simple postprocessing to turn a sequence of sentences into a rough summary.
There is a pragmatic, alternative way of identifying concept names. If the domain of the text is known, such names can be fetched from an existing terminological database for this domain. This solution is particularly applicable in Canada. Due to the requirements of bilingualism, many organizations already maintain on-line dictionaries of the names of objects and activities in their domain. We intend to tap those among those resources that are publicly available, such as WordNet or the TERMIUM database (TERMIUM, 1993) . Since it too tends to be slow, we would only turn to it when other sources of information fail
Two methods for keyword identification are considered at present. Keywords may be identified by considering the frequency of candidate noun groups in a large corpus. We will experiment with the publicly available Cornell University's SMART system to establish its suitability for our project.
Alternatively, if a collection of texts annotated with keywords exists in a given domain, keywords might be determined using the new approach developed by P. Turney at NRC [personal communication; in progress]. His method gives a large number of candidate rules to an inductive learning system. Examples of such rules that may help identify keywords are "select noun-phrases that occur frequently in the first paragraph" or "select phrases that occur in titles of sections". Next, the system is trained on a body of summarized texts, and only the rules that perform well are retained. The quality of performance is defined as the ratio of the number of keywords proposed by the system to the number of keywords with which the text was originally annotated manually. To be fair, this comparison will focus only on keywords that actually occur in the text, and not on "synthetic" keywords assigned by human readers based on their deep understanding of the domain of the text. This approach has the advantage of adaptability to different kinds of texts, developed with different writing rules and styles.
Salient sentences will be determined by first identifying the activities and objects relevant to the subject matter communicated by the text. These will be essentially keywords, determined in the previous phase. The sentences that convey the most information about the subject matter will then be selected. Selection will be based on a relevancy measure, obtained by counting referential links between the salient activities and objects. Several such measures have been proposed in the literature (Aretoulaki, 1994; Bookman, 1994; Ono, et al., 1994) . Those measures of referential links are at least partly based on surface linguistic clues. It is interesting to observe that rules for the ranking of relevancy could be learned, when the summarization system is developed. A number of ranked examples could be provided, and the ranking rules could be acquired inductively by one of the several learning algorithms at our disposal.
Relevancy also depends on the discourse structure. (Miike, et al., 1994) shows how it is possible to consider the role of text passages: exposition, justification, illustration, development or conclusion. Many of their heuristics are based on simple linguistic clues which do not require deep NLP.
(Sumita, et al., 1993) presents an interesting approach to the extraction of rhetorical structure (Mann and Thompson, 1988) from surface-linguistic cues. This structure groups sentences into a tree of relations. For instance, one can distinguish the structure of examples, parallel argument (Firstly, ... Secondly, ... Then ...), specialization (This is particularly true of...), and so on. The rhetorical structure acquired in this manner can then be compressed by removing entire sentences belonging to a certain rhetorical category, such as specialization or example. We want to use this approach to arrange the selected sentences into a rough summary.
Our system would stop here. Now a user with a good command of the written language, but not necessarily a domain expert, could polish the resulting summary linguistically, should this post-editing step be required.
In effect, we will measure four characteristics: recall, precision, brevity, ease. It should be noted that brevity will be a function of parameter-setting, in a manner that is possible even in some of today's summarization systems. We will, then, actually measure the level at which our system meets the goal of producing a summary at a requested percentage of the original text. The four characteristics will also be weighted and combined into a single quality measure. At the end of the project we will compare our system with selected summarization systems available at that time.
Aretoulaki, M., "Towards a Hybrid Abstract Generation System," New Methods of Language Processing Conference, UMIST (Manchester, UK), 1994, 220-227.
Barker, K., J.-F. Delannoy, S. Matwin and S. Szpakowicz, "Preliminary validation of a text summarization algorithm," University of Ottawa, Department of Computer Science, TR-96-04, 1996.
Barker, K. and S. Szpakowicz, "Interactive Semantic Analysis of Clause-Level Relationships," PACLING'95, Brisbane, 1995, 22-30.
Bookman, L. A., Trajectories Though Knowledge Space (A Dynamic Framework for Machine Comprehension), Kluwer, 1994.
Copeck, T., S. Delisle and S. Szpakowicz, "Parsing and Case Analysis in TANKA," 15th Intl Conf on Computational Linguistics COLING-92, Nantes, 1992, 1008-1012.
Cowie, J. and W. Lehnert, "Information Extraction," Comm. ACM, vol. 39, 1996, 80-91.
Delannoy, J. F., C. Feng, S. Matwin and S. Szpakowicz, "Knowledge Extraction from Text: Machine Learning for Text-to-rule Translation," European Conference on Machine Learning ECML-93, Workshop on Machine Learning Techniques and Text Analysis, Vienna, 1993, 1-7.
Delannoy, J. F. and R. Rios, "Translating A Detailed Linguistic Semantic Representation into Horn-clause logic," XI Brazilian Symposium on Artificial Intelligence (SBIA), Fortaleza, 1994, 257-267.
Delisle, S., "Text Processing without A-Priori Domain Knowledge: Semi-Automatic Linguistic Analysis for Incremental Knowledge Acquisition," Department of Computer Science, University of Ottawa, PhD Thesis, 1994.
Delisle, S., K. Barker, T. Copeck and S. Szpakowicz, "Interactive Semantic Analysis of Technical Texts: Case Pattern Acquisition," Computational Intelligence, vol. 12, 1996, (in print).
Delisle, S., K. Barker, J.-F. Delannoy, S. Matwin and S. Szpakowicz, "From Text to Horn Clauses: Combining Linguistic Analysis and Machine Learning," Tenth Canadian Conf on AI, CSCSI, Banff, 1994, 9-16.
Delisle, S., T. Copeck, S. Szpakowicz and S. Barker, "Pattern Matching for Case Analysis: A Computational Definition of Closeness," ICCI-93, 1993, 310-315.
Delisle, S. and S. Szpakowicz, "Realistic Parsing: Practical Solutions of Difficult Problems," PACLING'95, Brisbane, 1995, 59-68.
Economist, "Short Cuts," in The Economist, 1994, pp. 85-86.
Feng, C., T. Copeck, S. Szpakowicz and S. Matwin, "Semantic Clustering. Acquisition of Partial Ontologies from Public Domain Lexical Sources," AAAI Knowledge Acquisition for Knowledge-Based Systems Workshop, Banff, 1994, 1-1--1-16.
Grishman, R. and J. Sterling, "Description of the Proteus System as Used for MUC-5," MUC-5, 1993, 181-194.
Hammond, K., R. Burke, C. Martin and S. Lytinen, "FAQ Finder: A Case-Based Approach to Knowledge Navigation," 11th Conf on AI for Applications, Los Angeles, 1995, 80-86.
Jacobs, P. S., G. Krupka, L. Rau, M. Mauldin, T. Mitamura, T. Kitani, I. Sider and L. Childs, "Description of the SHOGUN System Used for MUC-5," MUC-5, 1993, 109-120.
Jacobs, P. S. and L. F. Rau, "Innovations in Text Interpretation," Artificial Intelligence, vol. 63(1-2), 1993, 143-191.
Jacquemin, C. and J. Royauté, "Retrieving Terms and their Variants in a Lexicalised Unification-Based Framework," SIGIR94, 1994, 132-141.
Laurendeau, C., "Automated Acquisition of Technical Concepts from Unrestricted English Text Using Noun Phrase Classification," Department of Computer Science, University of Ottawa, Master's thesis, 1992.
Lehnert, W., J. McCarthy, S. Soderland, E. Riloff, C. Cardie, J. Peterson, F. Feng, C. Dolan and S. Goldman, "Description of the CIRCUS system Used for MUC-5," MUC-5, 1993, 277-291.
Lehnert, W. G., "Plot Units: A Narrative Summarization Strategy," in Strategies for Natural Language Processing, W. G. Lehnert and M. H. Ringle, (ed.). LEA, 1982, 375-412.
Lewis, D. D. and K. Sparck Jones, "Natural Language Processing for Information Retrieval," Comm. ACM, vol. 39, 1996, 92-101.
Li, X., S. Matwin and S. Szpakowicz, "A WordNet-based Algorithm for Word Sense Disambiguation," IJCAI-95, Montreal, 1995, 1368-1374.
Mann, W. C. and S. A. Thompson, "Rhetorical Structure Theory: Toward a Functional Theory of Text Organization," Text, vol. 8(3), 1988, 243-281.
Matwin, S. and S. Szpakowicz, "Machine Learning Techniques in Knowledge Acquisition from Text," THINK, vol. 1(2), 1993a, 37-50.
Matwin, S. and S. Szpakowicz, "Text Analysis: How Can Machine Learning Help?," First Conference of the Pacific Association for Computational Linguistics (PACLING), Vancouver, 1993b, 33-42.
Miike, S., E. Itoh, K. Ono and K. Sumita, "A Full-Text Retrieval System with a Dynamic Abstract Generation Function," SIGIR, 1994, 152-161.
Miller, G. A., "WordNet: An On-Line Lexical Database," International J of Lexicography, vol. 3(4), 1990, 235-312.
MUC4, Proceedings MUC-4, 4th Message Understanding Conference, Morgan Kaufmann, 1992.
MUC5, Proceedings MUC-5, 5th Message Understanding Conference, Morgan Kaufmann, 1994.
Ono, K., K. Sumita and S. Miike, "Abstract Generation Based on Rhetorical Structure Extraction," COLING-94, 1994, 1-5.
Paice, C. D. and P. A. Jones, "A 'Select and Generate' Approach to Automatic Abstracting," in Proc 14th Information Retrieval Colloquium, Lancaster 1992, Series 'Workshops in Computing', T. McEnery and C. Paice, (ed.). Springer-Verlag, 1993, 114-154.
Rino, L. and D. Scott, "Content Selection in Summary Generation," XI Brazilian Symposium on Artificial Intelligence (SBIA), Fortaleza, 1994, 411-423.
Salton, G., J. A., C. Buckley and A. Singhai, "Automatic Analysis, Theme Generation, and Summarization of Machine-Readable Texts," Science, vol. 264, 1994, 1421-1426.
Salton, G. and M. J. McGill, Introduction to Modern Information Retrieval, McGraw Hill, 1983.
Sumita, K., K. Ono and S. Miike, "Document Structure Extraction for Interactive Document Retrieval Systems," ACM SIGDOC Newsletter, 1993, 301-306.
TERMIUM, "TERMIUM," Terminology and Linguistic Services Directorate, Translation Bureau, Department of the Secretary of State, 1993.