Subjects:

Abstracting

Discourse

Information Extraction

Information Retrieval

Machine Learning for Natural Language Processing

Presentation of Summaries

Psychology and Summarization

Summarization of Multimedia Documents

Summarization Systems

Summarization System Evaluation

Manual Abstracting

Abstract: Abstracting assistance features are being prototyped in the TEXNET text network management system. Sentence weighting methods available include: weighting negatively or positively on the stems in a selected passage; weighting on general lists of cue words; adjusting weights of selected segments, and weighting on occurrences of frequent stems. The user may adjust a number of parameters: the minimum length of extracts; the threshold for a ''frequent'' word/stem; and the amount a sentence weight is to be adjusted for each weighting type.
Abstract; Experimental subjects wrote abstracts of an article using a simplified version of the TEXNET abstracting assistance software. In addition to the full text, the 35 subjects were presented with either keywords or phrases extracted automatically. The resulting abstracts, and the times taken, were recorded automatically; some additional information was gathered by oral questionnaire. Results showed considerable variation among subjects, but 37% found the keywords or phrases ''quite'' or ''very'' useful in writing their abstracts. Statistical analysis failed to support several hypothesized relations: phrases were not viewed as significantly more helpful than keywords; and abstracting experience did not correlate with originality of wording, approximation of the author abstract, or greater conciseness. Results also suggested possible modifications to the software.
Abstract: Four working steps taken from a comprehensive empirical model of expert abstracting are studied in order to prepare an explorative implementation of a simulation model. It aims at explaining the knowledge processing activities during professional summarizing. Following the case-based and holistic strategy of qualitative empirical research, we develop the main features of the simulation system by investigating in detail a small but central test case-four working steps where an expert abstractor discovers what the paper is about and drafts the topic sentence of the abstract. Following the KADS methodology of knowledge engineering, our discussion begins with the empirical model (a conceptual model in KADS terms) and aims at a computational model which is implementable without determining the concrete implementation tools (the design model according to KADS). The envisaged solution uses a blackboard system architecture with cooperating object-oriented agents representing cognitive strategies and a dynamic text representation which borrows its conceptual relations in particular from RST (Rhetorical Structure Theory). As a result of the discussion we feel that a small simulation model of professional summarizing is feasible.
This paper takes the view that an abstract itself is a text which is subjected to general and specific conditions of text-production. It is assumed that the goal-namely the forming of the abstract as a text-controls the whole process of abstracting. This goal-oriented view contrasts to most approaches in this domain which are source-text oriented. Further, production strategies are described in terms of text structure building processes which are re-constructed with methods of modelling in the area of text-linguistics and computational linguistics. This leads to a close relationship between the representation of the model and the resulting text. In this view, examples are given in which authentical material of abstracts is analysed according to the model. The model itself integrates three text levels (content, function, form) which are combined and represented in terms of the writer's activities.
Abstract: Documents of the journal ''Nachrichten fur Dokumentation'' written over a twenty-year period (1969-1989) by 50 different authors have been used as textcorpus. The analysis of the abstracts revealed that only 15 out of 50 abstracts consist exclusively of ''standard'' abstract sentences and that no abstract satisfies all requirements of the abstracting guidelines. In this respect, they signal the abstracting guidelines as ''wishful thinking'', which supports the idea of machine-supported abstracting by linguistic features. CONNY is an interactive linguistic abstracting model for technical texts offering the abstractor general abstracting guidelines operating on the surface structure. It condenses the parts of source text assessed as abstract relevant on source text, sentence and abstract level with regard to lexic, syntax and semantic.

Anaphora resolution

The authors proposes an extension to the DRT for pronominal anaphora and ellipsis resolution integrating focusing theory. They give rules based on to keep track of the focus of attention along the text and bind pronouns preferentially to focused entities. The choice of antecedents is based on pragmatic constraints which put an ordering on preferences between antecedent candidates. The scope of the framework coveres sentences containing restrictive relative clauses and subject ellipsis.
The authors present an algorithm for anaphora resolution which is a modified and extended version of that developed by Lappin and Leass. The resolution process to work from the output of a part of speech tagger, enriched only with annotations of grammatical function of lexical items in the input stream. The algorithm constructs coreference classes to which are associated saliences, which are determined by the status of the members of the coreference class it represents with respects to contextual, grammatical and syntactic constraints.
Given the less than fully robust status of syntactic parsers, the authors outline an algorithm for anaphora resolution to open-ended text types, styles and genres. The algorithm compensates for shallower level of analysis with mechanisms for identifying different text forms for each discourse referent, and assigns a salience measure for each disocurse referent over the text.

Anchor/link identification

The authors discuss the problem to identify relations between bridging descriptions and their antecedents. Using wordNet, the authors propose methods to identify anchor/link, in a collection of wall street journal, according to their types: synonymy/hyponymy/meronymy, names, events, compound nouns, discourse topic, inference

Discourse segmentation

The paper describes an algorithm that use lexical cohesion relations for partitioning expository texts into coherent multi-paragraph discourse units which reflect the subtopic structure of the texts. Contrary to many discourse models that assume a hierarchical segmentation of the discourse, the author chooses to cast expository text into a linear sequence of segments. The algorithms discovers subtopic structure using term repetition as a lexical cohesion indicator. The author presents two methods: the first one compares, for a given window size, each pair of adjacent blocks of text according to how similar they are lexically. This method assumes that the more similar two blocks of text are, the more likely it is that the current subtopic continues, and, conversely, if two adjacent blocks of text are dissimilar, this implies a change in subtopic flow. The second method keeps track of active chains of repeated terms, where membership in a chain is determined by location in the text. The method determines subtopic flow by recording where in the discourse the bulk of one set of chains ends and a new set of chains begins. The core algorithm has three main parts: tokenization, similarity determination and boundary identification.
This paper proposes an indicator of text structure called the lexical cohesion profile which locates segment boundaries in a text. The approach assumes that the words in a segment are linked together via lexical cohesion relations as semantic similarities between words which are measured using a semantic network. A lexical cohesion profile of the text is a sequence of lexical cohesiveness of the word lists which can be seen through a fixed-width window.
The paper presents passage-retrieval techniques which are based on chronological decomposition into text segments and semantic decomposition into text themes in order to characterize text structure and then to replace texts by important text excerpts. Roughly, assuming that each text or text excerpts is represented by a vector of weighted terms which represent the occurrence characteristics of terms (e.g. words, phrases), a pairwise similarity coefficients, showing the similarity between pairs of texts based on coincidences in the term assignments, can be calculated. Graph structures are used to reprsent relationships between text components, i.e. the vertices are doucments and a link appears between to nodes when they are similar. Various elements of text structure are immediately desirable from a text-relationship maps, for example, the importance of a paragraph might be related to the number of incident branches of the corresponding node on the map, or a central node might then be characterized as one with a large number of associated paragraphs.

Discourse-based summarization (Coherence-based & Cohesion-based)

The authors investigate on a technique to produce a summary of text which relies on a model of the topic progression in the text derived from lexical chains. Summarization proceeds in three steps: the original text is first segmented, lexical chains are constructed, strong chains are identified and significant sentences are extracted from the text. The text segmentation is obtained from Hearst's algorithm. The procedure for constructing lexical chains follows three steps: select a set of candidate words, for each candidate word, find an appropriate chain relying on a relatedness criterion among members of a chains, if it is found, insert the word in the chain and update it accordingly. The summaries are built using scoring which is based on chain length and the extraction of significant sentences is based on heuristics using chain distribution, for example, choose the sentence that contains the first appearance of a chain member in the text.
The authors identify a set of salient phrasal units referred to as topic stamps and which they organized them in a capsule overview of the document. This set is identified by reducing the phrase set to a referent set using a procedure of anaphora resolution. In fact the assumption is as follow: every phrase constitutes a "mention" of a participant in the discourse and anaphora resolution allows to determine which expressions constitute mentions of the same referent. The expressions that are coreferential are grouped in equivalence classes, each one corresponds to a unique referent and the whole set of equivalence classes constitutes the referent set. The importance-based ranking of referents is required for the identification of topic stamps. The topic stamps are organised in order of appearance and assigned to discourse segments which are defined using a similarity-based algorithm which detects changes in topic by using a lexical similarity measure.
The author presents a discourse analyzer which computes representations of the structure of discourse using information available in syntactic and logical form analyses. The system uses heuristics which score the rhetorical relations that it hypothesizes in order to guide it in producing the plausible discourse representation. The heuristics integrate considerations about clauses (i.e. arguments of relations) such as subordinate and coordinate structures, conjunction, subject continuity, adverb, and some specific clue words. Then, she gives an algorithm that constructs the RST trees. As the rhetorical relations between the same two terminal nodes are scored and sorted in descending order according to heuristic score, this allows to overcome the problem of constructing multiple RST trees less plausible.
The paper focuses on the development of a discourse model used in a text structuring module that recognizes discourse level structure within a large-scale information retrieval system. She developed the news text schema that are used in automatically structuring texts. The process of decomposing of texts to assign component label to each sentence uses sources of evidence such as lexical clues, order of components tense distribution, syntactic sources and continuation clues. The dempster-shafer theory was used to coordinate information from the various evidence sources. Then, she developed the attribute model of the news text in which pieces of text are evaluated for their specific value on each of eight dimensions: time of event, tense, importance, attribution, objectivity, definiteness, completion and causality. Finally, she revised the news text schema via addition of some of these distinguishing attributes to the earlier components.
The paper presents a method for summarizing similarities and differences in a pair of related documents using a graph representation for text. Entities, denoted by text items such as words, phrases and proper names, are represented positionally as nodes in a graph along with edges corresponding to semantic and topological relations between concepts. The algorithm takes the topic as input wich represents a set of nodes into the graph. To determine which items are salient, the graph is searched for nodes semantically related to the topic using a spreading activation technique. Then the problem of finding similarities and differences becomes one of comparing graphs which have been activated by a common topic. The system makes use of a sentence and a paragraph tagger.
Several techniques have been used in text summarization to determine what is salient. This paper focus on two classes of techniques based respectively on a representation of text structure in terms of text cohesion and text coherence. Automatic text summarization can be characterized as involving three phases of processing: analysis, refinment and synthesis. The considered cohesion relations are proper name, anaphora, reiteration, synonymy and hypernymy. A text is represented by a graph, whose nodes represent word instances at different positions and links are typed and represent cohesion relations. The salience based on cohesion is computed using tf*idf method, spreading method, or local weighting method. The coherence of text is represented using Marcu's parser and the salience based on coherence correspond the nucluarity function. They conduct some experiments based on the two techniques and they conclude that cohesion methods are less accurate than coherence methods.
Starting from a segmentation of text into minimal units and the set of relations holding among these units, the author provides a first-order formalization of rhetorical structure trees using the distinction between the nuclei and the satellites that pertain to discourse relations. He gives an algorithm based on a set of constraints to construct the possible rhetorical trees.
Approaches in RS-tree pruning agreed that the nucleus fonction that pertain to a rhetorical structure tree constitutes an adequate summarization of the text. The summarization program takes the RS-tree produced by the rhetorical parser and selects the textual units that are more salient in that text. The longer the summary ones wants to generate, the farther the selected salient units will be from the root.
Given a set of semantics units among which a set of rethorical relations hold, the author gives a bottom-up approach to text planning based on a composition of discourse trees.
The author derives the rhetorical structure of texts using discourse usages of cue words (He uses a list of 1253 occurrences of cue phrases). The system determines the set of all discourse markers and the set of elementary textual units, hypothesizes a set of relations between the elements, uses a constraint satisfaction procedure to determine all the discourse trees, and assign a weight to each discourse trees and determines the trees with maximal weight.
The author discusses some ways to improve the discourse-based summarization programs and some kinds of weighting to exploit better the nuclearity function, and conducts some experiments showing the recall and precision results of the discourse-based method.
Event summaries are generated from data (e.g. weather, financial and medical knowledge bases) rather than from text reduction. So, the main process consists in selecting and prsenting summaries of events. The paper outlines tactics about these pocesses. The process of selecting events can be based on semantic patterns (which are domain dependant), link analysis (the importance of events is determined by the amount and type of links between events), and statistical analysis. Presentational techniques can help shorten the length of information. Exploiting the context set in previous portions, for example, subsequent references are related using the notions of temporal, spatial and topic focus (e.g. linguistic constructus such as tense and aspect and temporal and spatial adverbs). Also, selecting a particular medium in which to realize information can result un a savings in the amount of time required to present a given set of content, for example, movement events can be more rapidly displayed and perceived graphically than textually. This approach has an application in a battle simulator.
The authors described an automatic abstract generation system for japanese based on rhetorical structure extraction. The system first extracts the rhetorical structure using connectives expressions, then it generates the abstract of each section of the document by examining its rhetorical structure. In order to determine important text segmnents, the system imposes penalties on both nodes for each rhetorical relation according to its relative importance (e.g. penaltoies are imposed on the satellite of the relation). Then the system recursively cuts out the nodes, from the terminal nodes, which are imposed the highest penalty. The list of terminal nodes of the final structure becomes an abstract for the original document.
The author relates the problem of text summarization and argues about its difficulties. In her view, this is because the interaction among syntactic and semantic knowledge, knowledge of discourse and world knowledge. In her sense, an automatic summarization engine would need access to the information necessary to construct an adequate semantic representation. This would necessarily require a complex model of world knowledge.
The authors developed a disocurse model based upon the theory of discourse structure and the analysis of data corresponding to naturally produced summaries written by domain-expert writers. Summarisation involves, among other things, the selective choice of key information units, the paper focus on some techniques for automatic content selection. The authors gives some heuristics for compression of summaries which take into account discourse relations for pruning discourse relation trees, such as delete particular from general-particular relations, delete any optional discourse segment, etc.
Summaries are generated according to pipelined processes of selection and organization, so first the knowledge base is pruned and then the discourse of the summary is organized according to the resulting knowledge base, the communicative goal and the central proposition. The process of pruning take into account not only the rhetorical relations holding between text segments but also the communicative goals holding between them. So the gist preservation is addressed at all levels of discourse processing. This is done according to the mapping between intentions and semantic relations.
The author introduces the idea that text summarization depends not only on sentence interpretation and the local context representation but also on the recognition and use of large-scale discourse structure. She discusses different approaches to discourse representation and their value for summarising.
The paper establishes a framework for text summarisation and presents strategies adopted in automatic summarising. She discusses similarities and differences between summarising and indexing. She gives the factors affecting summarising (i.e. nature of the input, the purpose of summary and the the output of summary), the structure of the process of summarising (i.e. summarising requires some meaning representation of the text source or works from the surface text alone). She also discusses some strategies: linguistic appraoches, domain approaches and communicative approaches.
The authors propose a method that generates summaries of news based on the discourse macro structure (DMS). Their approach falls into the constatation that certain types of text conform to a set of style and organization constraints, for example, for the news text the DMS is: background and what is the news, and the summarization is based on DMS template filling. The exctraction of components of the DMS are based on scoring of paragraphs using metrics. The metrics integrate weighting of paragraphs which is based on term frequency, terms occuring in the title and in the paragraphs, noun phrases, words occuring only in some paragraphs, certain cue phrases, some indications, etc. They classify their approach as summarization-based query expansion.
The authors present an extension of Kupiec and al.'s methodology for trainable statistical sentence extraction. The extension concerns the using of knowledge about the discourse-level structure. They are interested in the identification of argumentative units such as background, topic, related work, purpose, solution, result and conclusion. The presented system uses some heuristics based on indicator phrase quality, indicator phrase identity, location, sentence length, thematic word, title and header.

Formal models

The authors present an approach to text summarization that is entirely embedded in the formal description of a classification-based model of terminological knowledge representation and reasoning. Text summarization is considered a formally guided transformation process on knowledge representation structures as derived by a natural language text parser. The system uses a language that distinguishes between properties and conceptual relationships. The text condensation process examines the text knowledge base generated by the parser to determine the thematic descriptions. Only the most significant concepts, relationships and properties are considered as part of a topic description, this is done using operators. Analyzing a text paragraph by paragraph yields a set of consecutive topic descriptions, each characterizing the topic of one or more adjacent paragraphs. Summaries are represented by text graph. The construction of a text graph proceeds from the examination of every pair of basic topic descriptions and takes their conceptual commonalities to generate more generic thematic characterizations.

Topic identification

The authors investigate on a technique to produce a summary of text which relies on a model of the topic progression in the text derived from lexical chains. Summarization proceeds in three steps: the original text is first segmented, lexical chains are constructed, strong chains are identified and significant sentences are extracted from the text. The text segmentation is obtained from Hearst's algorithm. The procedure for constructing lexical chains follows three steps: select a set of candidate words, for each candidate word, find an appropriate chain relying on a relatedness criterion among members of a chains, if it is found, insert the word in the chain and update it accordingly. The summaries are built using scoring which is based on chain length and the extraction of significant sentences is based on heuristics using chain distribution, for example, choose the sentence that contains the first appearance of a chain member in the text.
The authors identify a set of salient phrasal units referred to as topic stamps and which they organized them in a capsule overview of the document. This set is identified by reducing the phrase set to a referent set using a procedure of anaphora resolution. In fact the assumption is as follow: every phrase constitutes a "mention" of a participant in the discourse and anaphora resolution allows to determine which expressions constitute mentions of the same referent. The expressions that are coreferential are grouped in equivalence classes, each one corresponds to a unique referent and the whole set of equivalence classes constitutes the referent set. The importance-based ranking of referents is required for the identification of topic stamps. The topic stamps are organised in order of appearance and assigned to discourse segments which are defined using a similarity-based algorithm which detects changes in topic by using a lexical similarity measure.
The paper describes an algorithm that use lexical cohesion relations for partitioning expository texts into coherent multi-paragraph discourse units which reflect the subtopic structure of the texts. Contrary to many discourse models that assume a hierarchical segmentation of the discourse, the author chooses to cast expository text into a linear sequence of segments. The algorithms discovers subtopic structure using term repetition as a lexical cohesion indicator. The author presents two methods: the first one compares, for a given window size, each pair of adjacent blocks of text according to how similar they are lexically. This method assumes that the more similar two blocks of text are, the more likely it is that the current subtopic continues, and, conversely, if two adjacent blocks of text are dissimilar, this implies a change in subtopic flow. The second method keeps track of active chains of repeated terms, where membership in a chain is determined by location in the text. The method determines subtopic flow by recording where in the discourse the bulk of one set of chains ends and a new set of chains begins. The core algorithm has three main parts: tokenization, similarity determination and boundary identification.
The authors argue that the process of summarization consists in topic identification, topic interpretation and generation processes. They described a system's architecture and some details about its processes. The topic identification is based on the optimal position policy as a list that indicates in what ordinal positions in the text high topic bearing sentences occur. This method is obtained by training, given a collection of of genre related texts with keywords. The topic interpretation is based on concept fusion using WordNet and the notion of concept signature. The system proceeds by concept counting instead of word counting. The concept signature will identify the most pertinent signatures subsuming the topic words, and the signature head's concept will then be used as the summarizing fuser concepts.
This paper proposes an indicator of text structure called the lexical cohesion profile which locates segment boundaries in a text. The approach assumes that the words in a segment are linked together via lexical cohesion relations as semantic similarities between words which are measured using a semantic network. A lexical cohesion profile of the text is a sequence of lexical cohesiveness of the word lists which can be seen through a fixed-width window.
Several methods have been tried to perform topic identification. Some involve parsing and semantic analysis of the text which are less robust. Others, such as the cue words and position methods are more robust but less accurate. The position methods are based on the intuition that sentences of greater topic centrality tend to occur in certain specifiable locations (e.g. text's title, first and last sentences of each paragraphs, etc.). However this intuition is contradictory, unless some restricted cases. Otherwise, the texts in a genre generally observe a predictable discourse structure and discourse structure differs significantly over text genres and subject domains. So, the position method cannot be defined for any text but tailored to genre and domain using training. The authors conduct some experiments, based on the corpus for TIPSTER program, in which they empirically determined the yield of each sentence position in the corpus, measuring against the topic keywords, then they ranked the sentence positions by their average yield to produce an optimal position policy for topic positions for the genre. Finally, comparing to abstracts accompanying the texts, they measured the coverage of sentences extracted according to the policy, cumulatively in the position order specified by the policy. The high degree of coverage indicated the effectiveness of the position method.
The author presents a method for identifying the central ideas in a text based on a representation-based concept counting paradigm. The method is not based on the word counting but on the concept counting. To represent and generalize concepts, the method uses the hierarchical concept taxonomy WordNet.
The author presents methods to identify topics using positional method based on optimal position policy which identify importance sentence position, cue phrases, and topic signatures which provide a way to represent concept co-occurrence patterns (i.e. a list of pairs keyconcept and weight, for example (earthquake, (Richter scale, w1), (death toll, w2), ...)
The author describes an approach to text summarization based on thematic representation of a text. The construction of thematic representation is based on comparing terms of the thesaurus using morphological representation of the text and terms. Thesaurus projection is a set of text descriptors together with relations to related text descriptors. In this structure, thematic nodes, are determinated using descriptor frequency in the text, and which correspond to topics ot subtopics discussed in a text. Summaries are generated by expressions of the main thematic nodes picked from the text.
The paper presents passage-retrieval techniques which are based on chronological decomposition into text segments and semantic decomposition into text themes in order to characterize text structure and then to replace texts by important text excerpts. Roughly, assuming that each text or text excerpts is represented by a vector of weighted terms which represent the occurrence characteristics of terms (e.g. words, phrases), a pairwise similarity coefficients, showing the similarity between pairs of texts based on coincidences in the term assignments, can be calculated. Graph structures are used to reprsent relationships between text components, i.e. the vertices are doucments and a link appears between to nodes when they are similar. Various elements of text structure are immediately desirable from a text-relationship maps, for example, the importance of a paragraph might be related to the number of incident branches of the corresponding node on the map, or a central node might then be characterized as one with a large number of associated paragraphs.

Statistical-based summarization

The authors propose a method, given a base corpus and word co-occurrences with higher resolving power, to establish links between the paragraphs of the article. The paragraphs wich presents the larger number of links to other paragraphs is considered a most significant one. Briefly, the steps of the proposed method are: in a base corpus, compute the frequency of each word and co-occurrence considering window spanning from -5 or +5, and similarly to each document, the pairs occurring repeatedly in differents paragraphs establish links between them, the central important paragraph is that presenting a larger number of links to other paragraphs.
The authors propose an approach for extracting key paragraphs based on the idea that whether a word is a key in an article or not depends on the domain to which the article belongs. The context is structured into domain, article and paragraph. A keyword satisfies the two conditions: its deviation value in the paragraph is smaller than that of the article and its deviation value in the article is smaller that that of the domain. The authors apply a term weighting method to extract keywords based on X2 method. Then for extracting key paragraphs, they represent every paragraphs as a vector of keywords, the clustering algorithms, based on a semantic similarity value between paragraphs, is applied to the sets and produces a set of semantic clusters which are ordered in the descending order of their semantic similarity values.
Abstract: To summarize is to reduce in complexity, and hence in length, while retaining some of the essential qualities of the original. This paper focusses on document extracts, a particular kind of computed document summary. Document extracts consisting of roughly 20\% of the original can be as informative as the full text of a document, which suggests that even shorter extracts may be useful indicative summaries. The trends in our results are in agreement with those of Edmundson who used a subjectively weighted combination of features as opposed to training the feature weights using a corpus. We have developed a trainable summarization program that is grounded in a sound statistical framework.
The authors describe the system ANES (Automatic News Extraction System). The system comprises two devices: Reader (converts the input into tokens, sentences and paragraphs and counts word occurrences and word weights) and Extractor (performs sentence weighting and determines the particular sentences to be included in the summary). The process of summary generation has five major constituents: training (to determine the typical frequency of occurrence of words averaged across represented publications), TF*IDF (sentence selection by segregating out a list of signature words which were performed using tf*idf), sentence weighting (based on the sum upon the weights of individual signature words), sentence selection (based on sentence weighting, location, etc.).
The authors reports on some experiments in document summarisation based on Kupiec et al.'s method. Kupiec et al. use supervised learning to automatically adjust feature weights, using a corpus of research papers and corresponding summaries generated by professional abstractors. In these experiments, the gold standard sentences are those summary sentences that can be aligned with sentences in the source texts. Once the alignement has been carried out, the system tries to determine the characteristic properties of aligned sentences according to a number a features, e.g. presence of particular cue phrases, location in the text, sentence length, occurrence of thematic words, and occurrence of proper names. Each document sentence receive scores for each of the features, resulting in an estimate for the sentence's probability to also occur in the summary. In the authors's experiments, summaries are written by the authors of the documents to summarize and not by professional abstarctors.
The authors present an extension of Kupiec and al.'s methodology for trainable statistical sentence extraction. The extension concerns the using of knowledge about the discourse-level structure. They are interested in the identification of argumentative units such as background, topic, related work, purpose, solution, result and conclusion. The presented system uses some heuristics based on indicator phrase quality, indicator phrase identity, location, sentence length, thematic word, title and header.
The paper describes a system for generating text abstracts which relies on a purely statistical principle, i.e. a combination of tf*idf weights of words in a sentence. The system takes an article from the corpus and build a word weight matrix for all content words across all sentences (tf*idf-value), determines the sentence weights for all sentences (sum over tf*idf-value), sorts the sentences according to their weights and extract the N highest weighted sentences.

Paragraph extraction

The authors propose a method, given a base corpus and word co-occurrences with higher resolving power, to establish links between the paragraphs of the article. The paragraphs wich presents the larger number of links to other paragraphs is considered a most significant one. Briefly, the steps of the proposed method are: in a base corpus, compute the frequency of each word and co-occurrence considering window spanning from -5 or +5, and similarly to each document, the pairs occurring repeatedly in differents paragraphs establish links between them, the central important paragraph is that presenting a larger number of links to other paragraphs.
The authors propose an approach for extracting key paragraphs based on the idea that whether a word is a key in an article or not depends on the domain to which the article belongs. The context is structured into domain, article and paragraph. A keyword satisfies the two conditions: its deviation value in the paragraph is smaller than that of the article and its deviation value in the article is smaller that that of the domain. The authors apply a term weighting method to extract keywords based on X2 method. Then for extracting key paragraphs, they represent every paragraphs as a vector of keywords, the clustering algorithms, based on a semantic similarity value between paragraphs, is applied to the sets and produces a set of semantic clusters which are ordered in the descending order of their semantic similarity values.
The paper presents passage-retrieval techniques which are based on chronological decomposition into text segments and semantic decomposition into text themes in order to characterize text structure and then to replace texts by important text excerpts. Roughly, assuming that each text or text excerpts is represented by a vector of weighted terms which represent the occurrence characteristics of terms (e.g. words, phrases), a pairwise similarity coefficients, showing the similarity between pairs of texts based on coincidences in the term assignments, can be calculated. Graph structures are used to reprsent relationships between text components, i.e. the vertices are doucments and a link appears between to nodes when they are similar. Various elements of text structure are immediately desirable from a text-relationship maps, for example, the importance of a paragraph might be related to the number of incident branches of the corresponding node on the map, or a central node might then be characterized as one with a large number of associated paragraphs.

Sentence extraction

The authors present a system for text summarization that combines frequency-based, knowledge-based and discourse-based techniques. Summarization proceeds by extracting of features using term frequency, signature words, subsequent references to full names and aliases, WordNet, and morphological analysis of variants that refer to the same word. Then, and in order to select sentences for summary, each sentence in the document is scored using different combinations of signature word features. After, the top n highest scoring are chosen as a summary of the content of the document.
The paper presents an architecture for a hybrid connectionist-symbolic machine for text summarisation. The main process is about the content selection and in order to identify generic content selection features, an extensive corpus analysis was carried out on a variety of real-word texts. In fact, the process of content selection is based on the mappings between the surface cues (i.e. lexical items with a semantic/rhetorical load) and the intermediary (i.e. rhetorical semantic criteria) and pragmatic (i.e. theories about communicating agents) features.
The author presents a telegraphic text reduction system which works on the sentence level rather than the document level. The general principle of reduction is based on general linguistic intuitions, for example, proper nouns are generally more important than commom nouns, nouns are more important than adjectives, adjectives are more important than articles, subclauses are less important than clauses, etc. The input text is marked up with linguistic structural annotations, and the output are generated according to the level of reduction requested. This is made (tokenisation, annotation with grammatical tags, part-of-speech disambiguation, part-of-speech tagger, syntactic dependencies, etc) using finite-state techniques. Then, the reduction levels are applied retaining more or less words around the skeletal parts of the sentence. The output is a telegraphic version of the text which can then be fed into a speech synthesizer.
The paper described an approach which is intended to be the basic architecture to extract a set of consice sentences that are indicated or predicted by goals and contexts. The sentence selection algorithm measures the informativeness of each sentence by comparing with the determined goals. The measurement represents the number of different sentence expressions related to the goals, the total number of sentence expressions related to the goals, and the total number of sentence expressions being not related to the goals.
The authors describe the system ANES (Automatic News Extraction System). The system comprises two devices: Reader (converts the input into tokens, sentences and paragraphs and counts word occurrences and word weights) and Extractor (performs sentence weighting and determines the particular sentences to be included in the summary). The process of summary generation has five major constituents: training (to determine the typical frequency of occurrence of words averaged across represented publications), TF*IDF (sentence selection by segregating out a list of signature words which were performed using tf*idf), sentence weighting (based on the sum upon the weights of individual signature words), sentence selection (based on sentence weighting, location, etc.).
The authors discuss text summarization in the automated editing system of questions and answers package. The summarization is considered in order to construct the node page, of questions answers package, and which contains the question or the problem that is discussed in the thread and a summary that should be as short as possible. The summary extraction consists of the feature-detection and sentence extraction. The feature-detection correspond to string-pattern matching between regular expressions and text portions. In order to condense the extracted texts, the authors propose some rules of rewriting.
The authors reports on some experiments in document summarisation based on Kupiec et al.'s method. Kupiec et al. use supervised learning to automatically adjust feature weights, using a corpus of research papers and corresponding summaries generated by professional abstractors. In these experiments, the gold standard sentences are those summary sentences that can be aligned with sentences in the source texts. Once the alignement has been carried out, the system tries to determine the characteristic properties of aligned sentences according to a number a features, e.g. presence of particular cue phrases, location in the text, sentence length, occurrence of thematic words, and occurrence of proper names. Each document sentence receive scores for each of the features, resulting in an estimate for the sentence's probability to also occur in the summary. In the authors's experiments, summaries are written by the authors of the documents to summarize and not by professional abstarctors.
The authors present an extension of Kupiec and al.'s methodology for trainable statistical sentence extraction. The extension concerns the using of knowledge about the discourse-level structure. They are interested in the identification of argumentative units such as background, topic, related work, purpose, solution, result and conclusion. The presented system uses some heuristics based on indicator phrase quality, indicator phrase identity, location, sentence length, thematic word, title and header.

Template-filling-based information extraction

FASTUS is a system for extracting information from real-world text based on finite-state machines. It employs a nondeterministic finite-state language model that produces a phrasal decomposition of a sentence into noun groups, verb groups and particles. The paper presents nuances between information extraction (FASTUS system) and text understanding (TACITUS system). In Fact, FASTUS participated in MUC-4, and the summarization is based on filling in slots of templates the content of newpaper articles on latin american terrorism. The template-filling requires identifying the perpetrators and victims of a terrorist act, the occupations of the victims, the type of physical entity attacked or destroyed, the date, the location, and the effect on the targets. The operation of FASTUS is composed of four steps: triggering, recognizing phrases, recognizing patterns amd merging incidents. The recall and precision of FASTUS are respectivelly 44% and 55%. It can read 2375 words per minute and can analyze one text in anaverage of 9.6 seconds.
The authors classify and review current approaches to software infrastructure of NLP systems, and then present the system GATE. The language engineering systems that provide software infrastructure for NLP can be classified as: additive or markup-based (e.g. SGML), referential or annotation-based (e.g. TIPSTER), and abstraction-based (i.e. the original text is preserved in processing only, e.g. ALEP). GATE adopt an hybrid approach based mainly on TIPSTER. An application in GATE have been realised LaSIE which participated in MUC-6.
After the sixth in the series of Message Understanding Conferences, The authors give the history of these series with an evaluation. We notify that the systems participating in MUC tasks allow the extraction of information based on named entity, coreference, template element and scenario templates.
From their experiences of template design for information extraction (MUC, TIPSTER and TREC), the authors discuss the problem of template design as a problem of knowledge representation such as what are essential facts about situations described in text ? Essential facts are determined by a semantic model of the domain. They give an ontology of template design based on the basic entities, their properties and the relations among them, and the kinds of changes in such properties and relations.
The content of this report is quite similar to the previous paper.
In the domain of Healthcare, medical professionals use online resources to find journal articles that discuss results pertaining to patients currently under their care. The authors present a design for generating summaries that are tailored to characteristics of the patient under consideration. After a text analysis phase, the authors observe that journal articles in medicine use a standard format structure which includes sections formally marked: introduction, methods, statistical analysis, results discussion and previous work. Within a single section, certain types of information are found, for example, in the methods section, descriptions of the patients in the study are included. The summary design for retrieving matching information from the articles, comprises: match patient characteristics using the standard structure of the article, categorize the article as either prognosis, diagnosis ot treatment article using specific phrases that indicate the category, identify patient stratification and extract results using the standard structure and merge extracted sentence fragments by post-processing the sentences using symbolic techniques to group them together. Experiments have been made in presenting a prototype of the system to professionals.
The authors present a core system for an information extraction based on generic linguistic knowledge sources. The inputs of the system are ASCII texts and the output templates. The processing of the data comprises: tokenizer, morphological and lexical processing, fragment processing and fragment combination with template generation. Three application system have been implemented: appointment scheduling via email, classification of event announcements sent via email and extraction of company information from newspaper articles.
Authors' abstract:

We present a methodology for summarization of news about current events in the form of briefings that include appropriate background (historical) information. The system that we developed, SUMMONS, uses the output of systems developed for the DARPA Message Understanding Conferences to generate summaries of multiple documents on the same or related events, presenting similarities and differences, contradictions, and generalizations among sources of information. We describe the various components of the system, showing how information from multiple articles is combined, organized into a paragraph, and finally, realized as English sentences. A feature of our work is the extraction of the descriptions of entities such as people and places for reuse to enhance a briefing.

Comments:

SUMMONS, developed at Columbia University, is a prototype summarizer -- briefing generator, to be exact -- for a sequence of news items on closely related incidents. The system has been restricted to the domain of terrorist events, and there are good reasons for this narrow focus. SUMMONS feeds on the rich resources of the Columbia NLP group, especially language generation tools, and on the University of Massachusetts' information extraction system that participated in MUC-4. That is, the system could be assembled with relatively little effort. The authors also do not shy away from manual intervention into data when it helps highlight their system's achievements.

A set of news items for summarization is preprocessed (!) by the message understanding system, among other preliminary, largely manual, steps. SUMMONS' main task is a manipulation of the templates produced by this preprocessor, so that the language generator can receive its data. The process is incremental; every new template may add to the summary if one of eight content planning operators discovers a difference that must be accounted for.

The project is, generally speaking, firmly in the language generation area, and its concerns are far from ours. In particular, the very interesting subproject on extracting descriptions of entities from newswire data seems inapplicable in our work. The narrowness of the application domain, while clearly necessary for the success of this project, is not an option in IIA. SUMMONS is, nonetheless, an impressive system even in its present stage, and the ensuing research will be worth watching.

The authors present the system PROFILE which combines the extraction of entity names and the generation of descriptions using FUF/SURGE which is based on functional descriptions. The system extraction can use an on-line browser of newswire or a description storage from an old newswire. So PROFILE maintains a database about descriptions. This is proceeded as follow: extraction of descriptions, categorization of descriptions and organization of descriptions in a database of profiles.
The authors propose a method that generates summaries of news based on the discourse macro structure (DMS). Their approach falls into the constatation that certain types of text conform to a set of style and organization constraints, for example, for the news text the DMS is: background and what is the news, and the summarization is based on DMS template filling. The exctraction of components of the DMS are based on scoring of paragraphs using metrics. The metrics integrate weighting of paragraphs which is based on term frequency, terms occuring in the title and in the paragraphs, noun phrases, words occuring only in some paragraphs, certain cue phrases, some indications, etc. They classify their approach as summarization-based query expansion.

Text categorization

The authors describe first algorithms that classify texts using extraction patterns and semantic features associated with role fillers in the domain of MUC (i.e. domain of terrorism). Second, the authors describe an automatic generation of extraction patterns using preclassified texts as input. This is achieved by the word-augmented relevancy signatures algorithm that uses lexical items to represent domain-specific role relationships instead of semantic features. The system proceeds in two stages: it uses heuristic rules to generate an extraction pattern for every noun phrase in the corpus, the result of this stage is a giant dictionary of extraction patterns. In the second stage, they process the training corpus a second time using the new extraction patterns. For each pattern, they estimate its relevance rate and then rank them. The algorithm is used in text categorization system which allow to generate classification terms.
The paper presents a corpus-based method that can be used to build semantic lexicons. The input to the system is a set of words for a category and a representative text corpus. The output is a ranked list of other words that also belong to the category. The algorithm uses simple statistics and a bootstrapping mechanism to generate a ranked list of potential category words. A human then reviews the top words and selects the best ones for the dictionary.

Information retrieval

The paper puts information retrieval in the context of an overall information-seeking task as a problem of users quering large volumes of text. It gives an outline of an overview on how natural language processing and natural language resources like dictionaries and thesauri can be incorporated into information retrieval tasks. Then, it looks at some developments in natural language processing and how are used in information retrieval such as corpus indexing to determine a vocabulary of words and phrases using phrase-recognition techniques. The author concludes that natural language processing and information retrieval as the both currently stand do not sit comfortably together and where he believes natural language processing will continue to help information retrieval.
The paper presents experiments in information retrieval using natural language processing. The experiments are based on using syntactic analysis to derive term dependencies and structured representations of term-term relationships which encode sysntactic ambiguities due to prepositional phrase attachment, conjunction, the scope of modifiers, etc. This representation can be used to postpone the interpretation of document text, if a text fragment has multiple syntactic interpretations, an algorithm weighting various interpretations of the text fragment can be used. Then, the author focus on its own experiment onto using natural language processing resources rather than natural language processing tools in information retrieval. The proposed approach uses WordNet (only nouns) as a basis for measuring the semantic similarity between pairs of nouns (words used in queries and documents). They partition the WordNet network into hierarchical concept graphs and they use a word-word distance estimator technique based on computing probabilities of occurrences of nouns within a corpus of text.
The author gives an overview of information retrieval comparing with information extraction. Then, he gives some techniques about representing and matching a user's query against a set of documents. Finally, he gives a brief overview of natural language processing and some techniques that can be used for information retrieval such as indexing by base forms, indexing by word senses, and indexing by phrases and the use of linguistic resources such as dictionaries and thesauri.
The authors report on their natural language information retrieval project as related to TREC-5. Their system encompasses several statistical and natural language processing techniques for text analysis. These has been organized into a stream model. Stream indexes are built using a mixture of different indexing approaches: term indexing and weighting strategies. The statistical retrieval engine is assisted with natural language techniques in selecting appropriate indexing terms and to assign them validated weights. Each method correponds to an indexing stream such as: eliminate stopwords, morphological stemming, part-of-speech tagging, phrase boundary detection, word co-occurrence metrics, head-modifier, and proper names. In their approach, queries are not simple statement specifying the semantic criteria of relevance but are expanded by pasting in entire sentences, paragraphs and other sequences directly from any text document.
The author present a system which is used in the context of information retrieval. The summarization system is based on extraction of relevant sentences by weighting using title method, location method, term occurrence information. The summary for each document was then generated by outputting the top-scoring sentences until a desired summary length was reached. The authors conduct some experiments in which they prove that IR using their system is better (i.e. accurate, fast) than a typical IR system.

Text indexing

The authors describe a multilingual (English and Japanese) information browsing and retrieval system. The system consists of the indexing module, the client module, the term translation model and the web crawler. The indexing module creates and loads indices into a database while the client module allows browsing and retrieval of information in the database through a web browser-based graphical user interface. The term translation model is bi-directional (user terms into foreign languages and indexed terms into user's language). The web crawler can be used to add textual information from the WWW. The system indexes names of people, entities, locations and scientific and technical terms.
Abstract: A method of drawing index terms from text is presented. The approach uses no stop list, stemmer, or other language- and domain-specific component, allowing operation in any language or domain with only trivial modification. The method uses n-gram counts, achieving a function similar to, but more general than, a stemmer. The generated index terms, which the author calls ''highlights,'' are suitable for identifying the topic for perusal and selection. An extension is also described and demonstrated which selects index terms to represent a subset of documents, distinguishing them from the corpus. Some experimental results are presented, showing operation in English, Spanish, German, Georgian, Russian, and Japanese.

ML in NLP

The paper presents a case-based learning approach of natural language learning tasks. In the training phase, the goal is to collect a set of cases that describe ambiguity resolution episodes for a particular problem in text analysis. As cases are created, they are stored in a case base, and after training the system can use the case base to resolve ambiguities in novel instance of the particular problem. To improve the performance of the case-based learning algorithm, a technique for feature set selection for case-based learning of natural language is presented. The technique is applied to the task of relative pronoun disambiguisation.
The paper presents a framework for knowledge acquisition for natural language processing system. The system relies on three major components: a corpus of text, a robust sentence analyzer and an inductive learning module. There are two phases to the framework: a training or an acquisition phase and an application phase. The system is applied to the problem for sentence analysis.
Authors' abstract:

A key problem in text summarization is finding a salience function which determines what information in the source should be included in the summary. This paper describes the use of machine learning on a training corpus of documents and their abstracts to discover salience functions which describe what combination of features is optimal for a given summarization task. The method addresses both "generic" and user-focused summaries.

Comments:

Learning is based on an eclectic set of eleven features: location, thematic, cohesion. The raw values of the latter two are presumably produced automatically and discretized into {1, 0} in a manner not described in the paper. The training data come from 198 articles in cmp-lg, in SGML form, with figures, captions, references, and cross-references (!?) replaced by place-holders. Authors' abstracts are used in learning generic summarization. Learning user-focused summarization is based on a clever generation of abstracts from sample articles. Sentences most similar to an abstract (c%, c is the compression rate) are classified as positive examples.

The system learns readable, intuitively appealing rules. Only two examples of such rules are shown in the paper.

I will leave practical conclusions to the ML people on our team.

Graphic user interface

The paper discusses the notion of dynamic document abstractions as a viewing metaphor of the information contained in the capsule overviews of documents. The authors use visualisation techniques to present the sequence of topically salient phrases of the capsule overview analysis.
The author describes a system which is based on the view that summarization is essentially the task od synthesizing hypertext structure in a document so that parts of the document important to the user are accessible up front while other parts are hidden in multiple layers of increasing details.In fact, the system shows the main summary along with automatically generated keywords in the document and several labels in between paragraphs that have hypertext links to parts of the document not included in the summary. The main modules of the system are document structure analysis (the paragraphs of an HTML document are broken into sentences by looking for the sentence boundaries), sentence selection (word frequency analysis, corpus statistics, and keyword and keyword pattern analysis), sentence simplification (a phrase-tree pruning algorithm based on phrase-structure heuristics), summary construction (summary is constructed by extracting the corresponding parts of the source document to the simplified parts of selected sentences, and generates labels to the hidden parts by using the section heading of the hidden part or selecting the highest scoring sentence in the hidden part), and user customization (setting the length of the summary, specifying keywords, specifying number of frequent or title words to find, controlling sentence ranking heuristics).
The authors discuss text summarization in the automated editing system of questions and answers package. The summarization is considered in order to construct the node page, of questions answers package, and which contains the question or the problem that is discussed in the thread and a summary that should be as short as possible. The summary extraction consists of the feature-detection and sentence extraction. The feature-detection correspond to string-pattern matching between regular expressions and text portions. In order to condense the extracted texts, the authors propose some rules of rewriting.
The paper presents a prototype system for key term manipulation and visualization in a real-world commercial environment. The system consists of two components: generating and organizing, using statistics in lexical analysis, a set of key terms in a hierarchical structure and fed into a graphic user interface.

Psychological-approach-based summarization

In the sense of the author, modelling the abstracting process means to develop a grounded theory and a naturalistic model, i.e. a conceptual model of abstracting. 36 abstracting processes of 6 experts have been recorded on tape via thinking-aloud protocols, transcribed and interpreted. As a result, she figured out how the expert abstractors organize their working processes, and which intellectual tools and standard strategies they use. She identified a set of choices of abstracting tools grouped as planning, control and general literacy, information acquisition and relevance assessment. She gave an empirical design of a simulation system in an implementation oriented blackboard
The work described in the paper is the continuation of the previous work. The authors give more details about their method of narrative summarization called "plot units" and conduct some experiments with their method. These experiments are conducted with comparison of Rumelhart's story grammars.
This paper constitutes precursor works on text summarization. It deal with a psychological approach about text summarization. She argues that in order to summarize text we must access a high level analysis that highlights the story's central concepts. She exhibits a technique of memory representation based on conceptual structures called "affect units" or "plot units", which are an abstraction of "affect states" (i.e. positive events, negative events or mental states) and "affect links" (i.e. motivation, actualization, termination, equivalence). These affect units overlap with each other when a narrative is cohesive. these overlapping intersections are interpreted as arcs in a graph of affect units. Summarization is based on structural features of the graph, for example a pivotal unit (i.e. a node of maximal degree) encodes the gist of the story.

Multimedia summarization

The author presents approaches to graphics summarization. These procedures have to consider the content and relations of the graphical elements within figures. So, the summarization is based on metadata, a given representation of the content of the figures.
The paper reports on the extension of broadcast news access system to provide multimedia summaries. Automated video summarization entails four basic steps: analysis of the source video, selection the key information from that video, condensation of the information into compact form and generation of summary tailored to the interests of a particular user. Since video is a multistream artifact, they expect analysis to occur concurrently and possibly in cooperation on audio, imagery, and any associated text sources. The output of video analysis is an annotated version of the source. The content selection is based on technique of counting, exploiting clues, or statistical analysis of words, images and sounds. The content is condensed by aggregating similar content. The generation of summaries encompasses planning the structure and order the content to be presented, selecting the appropriate media and environment, then realizing and laying this out.
The paper reports on techniques to segment news video using anchor/reporter, topic shifts identified in closed-caption text, correlation of multiple streams of analysis to improve story segmentation, the extraction of facts from the linguistic stream, and the visualization of extracted information. They reports story segmentation and proper name extraction results using these techniques.
Here also, the multimedia briefings that include coordinated speech, text and graphics are generated from healthcare databases and not from text reduction. The system MAGIC takes as input online data collected during the surgical operation as well as information stored in the main databases. The output are multimedia briefings to provide an update on patient status. The system exploits the extensive online data and use FUF/SURGE sentence generator which produces sentences annotated with prosodic information and pause durations. This output is sent to a speech synthesizer in order to produce final speech.

Summarization from data bases (data bases considered as templates filled)

Event summaries are generated from data (e.g. weather, financial and medical knowledge bases) rather than from text reduction. So, the main process consists in selecting and prsenting summaries of events. The paper outlines tactics about these pocesses. The process of selecting events can be based on semantic patterns (which are domain dependant), link analysis (the importance of events is determined by the amount and type of links between events), and statistical analysis. Presentational techniques can help shorten the length of information. Exploiting the context set in previous portions, for example, subsequent references are related using the notions of temporal, spatial and topic focus (e.g. linguistic constructus such as tense and aspect and temporal and spatial adverbs). Also, selecting a particular medium in which to realize information can result un a savings in the amount of time required to present a given set of content, for example, movement events can be more rapidly displayed and perceived graphically than textually. This approach has an application in a battle simulator.
Text summarisation usually reduces texts into summaries. The approach presented in this paper allows the construction of summaries from data bases. Two systems were developped using this approach: STREAK (i.e. summaries of basketball games) and PLANDOC (i.e. summaries of telephone network planning activity). With analogy to natural language generation, the problem for summary generation is viewed as falling into two separate classes: conceptual summarization (i.e what information should be included in a summary) and linguistic summarization (i.e. the task of determining how to convey as much information as possible in a short amount of text). Their approach is focused on the latter. They define a system which uses syntactic and lexical devices to convey more information. The approach is based on rules of revision from an intial draft (i.e. basic sentence pattern). These rules include adjunctization, conjoin, absorb, nominalization and adjoin.
Here also, the multimedia briefings that include coordinated speech, text and graphics are generated from healthcare databases and not from text reduction. The system MAGIC takes as input online data collected during the surgical operation as well as information stored in the main databases. The output are multimedia briefings to provide an update on patient status. The system exploits the extensive online data and use FUF/SURGE sentence generator which produces sentences annotated with prosodic information and pause durations. This output is sent to a speech synthesizer in order to produce final speech.
The authors present an approach to text summarization that is entirely embedded in the formal description of a classification-based model of terminological knowledge representation and reasoning. Text summarization is considered a formally guided transformation process on knowledge representation structures as derived by a natural language text parser. The system uses a language that distinguishes between properties and conceptual relationships. The text condensation process examines the text knowledge base generated by the parser to determine the thematic descriptions. Only the most significant concepts, relationships and properties are considered as part of a topic description, this is done using operators. Analyzing a text paragraph by paragraph yields a set of consecutive topic descriptions, each characterizing the topic of one or more adjacent paragraphs. Summaries are represented by text graph. The construction of a text graph proceeds from the examination of every pair of basic topic descriptions and takes their conceptual commonalities to generate more generic thematic characterizations.
This paper discusses the portability of the previous system STREAK, which was initially didicated to generate newswire sports summaries about the basketball games domain, to the stock market domain. It argues that 59% of the revision rule are fully portable, with at least another 7% partially portable.
The author presents a system which generates Dutch-spoken soccer reports. The input of the generation algorithm is a typed data structure which is automatically derived from the information on teletext pages stating the main events in a particular soccer match. The paper then focus on two aspects, first the modelling of accentuation by taking contrastive information into account, then the determining choice of referring expressions.

Summarization overview

The author gives broadly an historical view of works in natural language processing according to four main phases, and especially an overview of automatic summarising. She streches the need of methods that have source text supply their own importnt content and we don't want just surface sentence extraction niether message understanding-type methods.
The author describes the basic process of text summarising through 3 stages : I (i.e. source text interpretation to source representation), T (i.e. source representation transformation to summary representation) and G (i.e. summary text generation from summary representation). She gives the pros and cons of text extraction and fact extraction. She streches some factors related to methodologies and strategies, presented yet in her paper "What might be in summary". She concludes by advising the shallow approach described as follow: I (parse to logical form, decompose to simple predictions, derive prediction cohesion graph using common predicate, common argument within sentence and similar arguments across sentences). T (node set selection via weights for edge type, scoring function seeking centrality, representativeness and coherence or greedy algorithm), G (synthesise text from selected predictions).

Summarization systems

Abstract: In our document understanding project ALV we analyse incoming paper mail in the domain of single-sided German business letters. These letters are scanned and after several analysis steps the text is recognized. The result may contain gaps, word alternatives, and even illegal words. The subject of this paper is the subsequent phase which concerns the extraction of important information predefined in our "message type model". An expectation driven partial text skimming analysis is proposed focussing on the kernel module, the so-called "predictor". In contrast to traditional text skimming the following aspects are important in our approach. Basically, the input data are fragmentary texts. Rather than having one text analysis module ("substantiator") only, our predictor controls a set of different and partially alternative substantiators.
With respect to the usually proposed three working phases of a predictor -- start, discrimination, and instantiation -- the following differences are remarkable. The starting problem of text skimming is solved by applying specialized substantiators for classifying a business letter into message types. In order to select appropriate expectations within the message type hypotheses a twofold discrimination is performed. A coarse discrimination reduces the number of message type alternatives, and a fine discrimination chooses one expectation within one or a few previously selected message types. According to the expectation selected substantiators are activated. Several rules are applied both for the verification of the substantiator results and for error recovery if the results are insufficient.
The authors argue that the process of summarization consists in topic identification, topic interpretation and generation processes. They described a system's architecture and some details about its processes. The topic identification is based on the optimal position policy as a list that indicates in what ordinal positions in the text high topic bearing sentences occur. This method is obtained by training, given a collection of of genre related texts with keywords. The topic interpretation is based on concept fusion using WordNet and the notion of concept signature. The system proceeds by concept counting instead of word counting. The concept signature will identify the most pertinent signatures subsuming the topic words, and the signature head's concept will then be used as the summarizing fuser concepts.
The author presents methods to identify topics using positional method based on optimal position policy which identify importance sentence position, cue phrases, and topic signatures which provide a way to represent concept co-occurrence patterns (i.e. a list of pairs keyconcept and weight, for example (earthquake, (Richter scale, w1), (death toll, w2), ...)
Abstract: Most information retrieval systems today are word based. But simple word searches and frequency distributions do not provide these systems with an understanding of their texts. Full natural language parsers are capable of deep understanding within limited domains, but are too brittle and slow for general information retrieval.
My dissertation is an attempt to bridge this gap by using a text skimming parser as the basis for an information retrieval system that partially understands the texts stored in it. The objective is to develop a system capable of retrieving a significantly greater fraction of relevant documents than is possible with a keyword based approach, without retrieving a larger fraction of irrelevant documents. As part of my dissertation, I am implementing a full-text information retrieval system called FERRET (Flexible Expert Retrieval of Relevant English Texts). FERRET will provide information retrieval for the UseNet News system, a collection of 247 news groups covering a wide variety of topics. Currently FERRET reads SCI.ASTRO, the Astronomy news group, and part of my investigation will be to demonstrate the addition of new domains with only minimal hand coding of domain knowledge. FERRET will acquire the details of a domain automatically using a script learning component.
Abstract: We present a natural language system which summarizes a series of news articles on the same event. It uses summarization operators, identified through empirical analysis of a corpus of news summaries, to group together templates from the output of the systems developed for ARPA's Message Understanding Conferences. Depending on the available resources (e.g., space), summaries of different length can be produced. Our research also provides a methodological framework for future work on the summarization task and on the evaluation of news summarization systems.

System portability

The paper presents an architecture for a hybrid connectionist-symbolic machine for text summarisation. The main process is about the content selection and in order to identify generic content selection features, an extensive corpus analysis was carried out on a variety of real-word texts. In fact, the process of content selection is based on the mappings between the surface cues (i.e. lexical items with a semantic/rhetorical load) and the intermediary (i.e. rhetorical semantic criteria) and pragmatic (i.e. theories about communicating agents) features.
This paper discusses the portability of the previous system STREAK, which was initially didicated to generate newswire sports summaries about the basketball games domain, to the stock market domain. It argues that 59% of the revision rule are fully portable, with at least another 7% partially portable.

System evaluation

The author examines the evaluation of text summarization systems according to two tasks: categorization which will evaluate generic summaries and adhoc retrieval which will evaluate user-directed summaries. The evaluation criteria can be based on quantitative measures (i.e. categorization/relevance decision, time required, summary length) and qualitative measure (i.e. user preference). Some features and tasks can also be addressed, including cohesiveness of a summary, optimal length of a summary, and multi-document summaries.
The authors conduct experiments for evaluating summaries. Summaries can be evaluated by their quality or their performance in a particular task. In the ideal summary based evaluation, the authors prove that summaries are influenced by their lengths and precision and recall are not a good measure for evaluating summaries. In the task based evaluation, precision and recall are sensitive to the lengths of summaries, the time required for the length of summaries is not proportional to the length of summaries, and there is no correlation between length and improvment task. Summarization system that can help more in the task and in less time is most suitable, no matter how long the summaries are.