Subjects:

Abstracting

Discourse

Information Extraction

Information Retrieval

Machine Learning for Natural Language Processing

Presentation of Summaries

Psychology and Summarization

Summarization of Multimedia Documents

Summarization Systems

Summarization System Evaluation

Miscellaneous

Manual Abstracting

Abstract: Abstracting assistance features are being prototyped in the TEXNET text network management system. Sentence weighting methods available include: weighting negatively or positively on the stems in a selected passage; weighting on general lists of cue words; adjusting weights of selected segments, and weighting on occurrences of frequent stems. The user may adjust a number of parameters: the minimum length of extracts; the threshold for a ''frequent'' word/stem; and the amount a sentence weight is to be adjusted for each weighting type.
Abstract; Experimental subjects wrote abstracts of an article using a simplified version of the TEXNET abstracting assistance software. In addition to the full text, the 35 subjects were presented with either keywords or phrases extracted automatically. The resulting abstracts, and the times taken, were recorded automatically; some additional information was gathered by oral questionnaire. Results showed considerable variation among subjects, but 37% found the keywords or phrases ''quite'' or ''very'' useful in writing their abstracts. Statistical analysis failed to support several hypothesized relations: phrases were not viewed as significantly more helpful than keywords; and abstracting experience did not correlate with originality of wording, approximation of the author abstract, or greater conciseness. Results also suggested possible modifications to the software.
Abstract: Four working steps taken from a comprehensive empirical model of expert abstracting are studied in order to prepare an explorative implementation of a simulation model. It aims at explaining the knowledge processing activities during professional summarizing. Following the case-based and holistic strategy of qualitative empirical research, we develop the main features of the simulation system by investigating in detail a small but central test case-four working steps where an expert abstractor discovers what the paper is about and drafts the topic sentence of the abstract. Following the KADS methodology of knowledge engineering, our discussion begins with the empirical model (a conceptual model in KADS terms) and aims at a computational model which is implementable without determining the concrete implementation tools (the design model according to KADS). The envisaged solution uses a blackboard system architecture with cooperating object-oriented agents representing cognitive strategies and a dynamic text representation which borrows its conceptual relations in particular from RST (Rhetorical Structure Theory). As a result of the discussion we feel that a small simulation model of professional summarizing is feasible.
This paper takes the view that an abstract itself is a text which is subjected to general and specific conditions of text-production. It is assumed that the goal-namely the forming of the abstract as a text-controls the whole process of abstracting. This goal-oriented view contrasts to most approaches in this domain which are source-text oriented. Further, production strategies are described in terms of text structure building processes which are re-constructed with methods of modelling in the area of text-linguistics and computational linguistics. This leads to a close relationship between the representation of the model and the resulting text. In this view, examples are given in which authentical material of abstracts is analysed according to the model. The model itself integrates three text levels (content, function, form) which are combined and represented in terms of the writer's activities.
Abstract: Documents of the journal ''Nachrichten fur Dokumentation'' written over a twenty-year period (1969-1989) by 50 different authors have been used as textcorpus. The analysis of the abstracts revealed that only 15 out of 50 abstracts consist exclusively of ''standard'' abstract sentences and that no abstract satisfies all requirements of the abstracting guidelines. In this respect, they signal the abstracting guidelines as ''wishful thinking'', which supports the idea of machine-supported abstracting by linguistic features. CONNY is an interactive linguistic abstracting model for technical texts offering the abstractor general abstracting guidelines operating on the surface structure. It condenses the parts of source text assessed as abstract relevant on source text, sentence and abstract level with regard to lexic, syntax and semantic.

Anaphora resolution

The authors proposes an extension to the DRT for pronominal anaphora and ellipsis resolution integrating focusing theory. They give rules based on to keep track of the focus of attention along the text and bind pronouns preferentially to focused entities. The choice of antecedents is based on pragmatic constraints which put an ordering on preferences between antecedent candidates. The scope of the framework coveres sentences containing restrictive relative clauses and subject ellipsis.
The authors present an algorithm for anaphora resolution which is a modified and extended version of that developed by Lappin and Leass. The resolution process to work from the output of a part of speech tagger, enriched only with annotations of grammatical function of lexical items in the input stream. The algorithm constructs coreference classes to which are associated saliences, which are determined by the status of the members of the coreference class it represents with respects to contextual, grammatical and syntactic constraints.
Given the less than fully robust status of syntactic parsers, the authors outline an algorithm for anaphora resolution to open-ended text types, styles and genres. The algorithm compensates for shallower level of analysis with mechanisms for identifying different text forms for each discourse referent, and assigns a salience measure for each disocurse referent over the text.

Anchor/link identification

The authors discuss the problem to identify relations between bridging descriptions and their antecedents. Using wordNet, the authors propose methods to identify anchor/link, in a collection of wall street journal, according to their types: synonymy/hyponymy/meronymy, names, events, compound nouns, discourse topic, inference

Discourse segmentation

The paper describes an algorithm that use lexical cohesion relations for partitioning expository texts into coherent multi-paragraph discourse units which reflect the subtopic structure of the texts. Contrary to many discourse models that assume a hierarchical segmentation of the discourse, the author chooses to cast expository text into a linear sequence of segments. The algorithms discovers subtopic structure using term repetition as a lexical cohesion indicator. The author presents two methods: the first one compares, for a given window size, each pair of adjacent blocks of text according to how similar they are lexically. This method assumes that the more similar two blocks of text are, the more likely it is that the current subtopic continues, and, conversely, if two adjacent blocks of text are dissimilar, this implies a change in subtopic flow. The second method keeps track of active chains of repeated terms, where membership in a chain is determined by location in the text. The method determines subtopic flow by recording where in the discourse the bulk of one set of chains ends and a new set of chains begins. The core algorithm has three main parts: tokenization, similarity determination and boundary identification.
This paper proposes an indicator of text structure called the lexical cohesion profile which locates segment boundaries in a text. The approach assumes that the words in a segment are linked together via lexical cohesion relations as semantic similarities between words which are measured using a semantic network. A lexical cohesion profile of the text is a sequence of lexical cohesiveness of the word lists which can be seen through a fixed- width window.
The paper presents passage-retrieval techniques which are based on chronological decomposition into text segments and semantic decomposition into text themes in order to characterize text structure and then to replace texts by important text excerpts. Roughly, assuming that each text or text excerpts is represented by a vector of weighted terms which represent the occurrence characteristics of terms (e.g. words, phrases), a pairwise similarity coefficients, showing the similarity between pairs of texts based on coincidences in the term assignments, can be calculated. Graph structures are used to reprsent relationships between text components, i.e. the vertices are doucments and a link appears between to nodes when they are similar. Various elements of text structure are immediately desirable from a text-relationship maps, for example, the importance of a paragraph might be related to the number of incident branches of the corresponding node on the map, or a central node might then be characterized as one with a large number of associated paragraphs.

Discourse-based summarization (Coherence-based & Cohesion-based)

The authors investigate on a technique to produce a summary of text which relies on a model of the topic progression in the text derived from lexical chains. Summarization proceeds in three steps: the original text is first segmented, lexical chains are constructed, strong chains are identified and significant sentences are extracted from the text. The text segmentation is obtained from Hearst's algorithm. The procedure for constructing lexical chains follows three steps: select a set of candidate words, for each candidate word, find an appropriate chain relying on a relatedness criterion among members of a chains, if it is found, insert the word in the chain and update it accordingly. The summaries are built using scoring which is based on chain length and the extraction of significant sentences is based on heuristics using chain distribution, for example, choose the sentence that contains the first appearance of a chain member in the text.
The authors identify a set of salient phrasal units referred to as topic stamps and which they organized them in a capsule overview of the document. This set is identified by reducing the phrase set to a referent set using a procedure of anaphora resolution. In fact the assumption is as follow: every phrase constitutes a "mention" of a participant in the discourse and anaphora resolution allows to determine which expressions constitute mentions of the same referent. The expressions that are coreferential are grouped in equivalence classes, each one corresponds to a unique referent and the whole set of equivalence classes constitutes the referent set. The importance-based ranking of referents is required for the identification of topic stamps. The topic stamps are organised in order of appearance and assigned to discourse segments which are defined using a similarity-based algorithm which detects changes in topic by using a lexical similarity measure.
The author presents a discourse analyzer which computes representations of the structure of discourse using information available in syntactic and logical form analyses. The system uses heuristics which score the rhetorical relations that it hypothesizes in order to guide it in producing the plausible discourse representation. The heuristics integrate considerations about clauses (i.e. arguments of relations) such as subordinate and coordinate structures, conjunction, subject continuity, adverb, and some specific clue words. Then, she gives an algorithm that constructs the RST trees. As the rhetorical relations between the same two terminal nodes are scored and sorted in descending order according to heuristic score, this allows to overcome the problem of constructing multiple RST trees less plausible.
The paper focuses on the development of a discourse model used in a text structuring module that recognizes discourse level structure within a large-scale information retrieval system. She developed the news text schema that are used in automatically structuring texts. The process of decomposing of texts to assign component label to each sentence uses sources of evidence such as lexical clues, order of components tense distribution, syntactic sources and continuation clues. The dempster-shafer theory was used to coordinate information from the various evidence sources. Then, she developed the attribute model of the news text in which pieces of text are evaluated for their specific value on each of eight dimensions: time of event, tense, importance, attribution, objectivity, definiteness, completion and causality. Finally, she revised the news text schema via addition of some of these distinguishing attributes to the earlier components.
The paper presents a method for summarizing similarities and differences in a pair of related documents using a graph representation for text. Entities, denoted by text items such as words, phrases and proper names, are represented positionally as nodes in a graph along with edges corresponding to semantic and topological relations between concepts. The algorithm takes the topic as input wich represents a set of nodes into the graph. To determine which items are salient, the graph is searched for nodes semantically related to the topic using a spreading activation technique. Then the problem of finding similarities and differences becomes one of comparing graphs which have been activated by a common topic. The system makes use of a sentence and a paragraph tagger.
Several techniques have been used in text summarization to determine what is salient. This paper focus on two classes of techniques based respectively on a representation of text structure in terms of text cohesion and text coherence. Automatic text summarization can be characterized as involving three phases of processing: analysis, refinment and synthesis. The considered cohesion relations are proper name, anaphora, reiteration, synonymy and hypernymy. A text is represented by a graph, whose nodes represent word instances at different positions and links are typed and represent cohesion relations. The salience based on cohesion is computed using tf*idf method, spreading method, or local weighting method. The coherence of text is represented using Marcu's parser and the salience based on coherence correspond the nucluarity function. They conduct some experiments based on the two techniques and they conclude that cohesion methods are less accurate than coherence methods.
Starting from a segmentation of text into minimal units and the set of relations holding among these units, the author provides a first-order formalization of rhetorical structure trees using the distinction between the nuclei and the satellites that pertain to discourse relations. He gives an algorithm based on a set of constraints to construct the possible rhetorical trees.
Approaches in RS-tree pruning agreed that the nucleus fonction that pertain to a rhetorical structure tree constitutes an adequate summarization of the text. The summarization program takes the RS-tree produced by the rhetorical parser and selects the textual units that are more salient in that text. The longer the summary ones wants to generate, the farther the selected salient units will be from the root.
Given a set of semantics units among which a set of rethorical relations hold, the author gives a bottom-up approach to text planning based on a composition of discourse trees.
The author derives the rhetorical structure of texts using discourse usages of cue words (He uses a list of 1253 occurrences of cue phrases). The system determines the set of all discourse markers and the set of elementary textual units, hypothesizes a set of relations between the elements, uses a constraint satisfaction procedure to determine all the discourse trees, and assign a weight to each discourse trees and determines the trees with maximal weight.
The author discusses some ways to improve the discourse-based summarization programs and some kinds of weighting to exploit better the nuclearity function, and conducts some experiments showing the recall and precision results of the discourse-based method.
Event summaries are generated from data (e.g. weather, financial and medical knowledge bases) rather than from text reduction. So, the main process consists in selecting and prsenting summaries of events. The paper outlines tactics about these pocesses. The process of selecting events can be based on semantic patterns (which are domain dependant), link analysis (the importance of events is determined by the amount and type of links between events), and statistical analysis. Presentational techniques can help shorten the length of information. Exploiting the context set in previous portions, for example, subsequent references are related using the notions of temporal, spatial and topic focus (e.g. linguistic constructus such as tense and aspect and temporal and spatial adverbs). Also, selecting a particular medium in which to realize information can result un a savings in the amount of time required to present a given set of content, for example, movement events can be more rapidly displayed and perceived graphically than textually. This approach has an application in a battle simulator.
The authors described an automatic abstract generation system for japanese based on rhetorical structure extraction. The system first extracts the rhetorical structure using connectives expressions, then it generates the abstract of each section of the document by examining its rhetorical structure. In order to determine important text segmnents, the system imposes penalties on both nodes for each rhetorical relation according to its relative importance (e.g. penaltoies are imposed on the satellite of the relation). Then the system recursively cuts out the nodes, from the terminal nodes, which are imposed the highest penalty. The list of terminal nodes of the final structure becomes an abstract for the original document.
The author relates the problem of text summarization and argues about its difficulties. In her view, this is because the interaction among syntactic and semantic knowledge, knowledge of discourse and world knowledge. In her sense, an automatic summarization engine would need access to the information necessary to construct an adequate semantic representation. This would necessarily require a complex model of world knowledge.
The authors developed a disocurse model based upon the theory of discourse structure and the analysis of data corresponding to naturally produced summaries written by domain-expert writers. Summarisation involves, among other things, the selective choice of key information units, the paper focus on some techniques for automatic content selection. The authors gives some heuristics for compression of summaries which take into account discourse relations for pruning discourse relation trees, such as delete particular from general- particular relations, delete any optional discourse segment, etc.
Summaries are generated according to pipelined processes of selection and organization, so first the knowledge base is pruned and then the discourse of the summary is organized according to the resulting knowledge base, the communicative goal and the central proposition. The process of pruning take into account not only the rhetorical relations holding between text segments but also the communicative goals holding between them. So the gist preservation is addressed at all levels of discourse processing. This is done according to the mapping between intentions and semantic relations.
The author introduces the idea that text summarization depends not only on sentence interpretation and the local context representation but also on the recognition and use of large-scale discourse structure. She discusses different approaches to discourse representation and their value for summarising.
The paper establishes a framework for text summarisation and presents strategies adopted in automatic summarising. She discusses similarities and differences between summarising and indexing. She gives the factors affecting summarising (i.e. nature of the input, the purpose of summary and the the output of summary), the structure of the process of summarising (i.e. summarising requires some meaning representation of the text source or works from the surface text alone). She also discusses some strategies: linguistic appraoches, domain approaches and communicative approaches.
The authors propose a method that generates summaries of news based on the discourse macro structure (DMS). Their approach falls into the constatation that certain types of text conform to a set of style and organization constraints, for example, for the news text the DMS is: background and what is the news, and the summarization is based on DMS template filling. The exctraction of components of the DMS are based on scoring of paragraphs using metrics. The metrics integrate weighting of paragraphs which is based on term frequency, terms occuring in the title and in the paragraphs, noun phrases, words occuring only in some paragraphs, certain cue phrases, some indications, etc. They classify their approach as summarization-based query expansion.
The authors present an extension of Kupiec and al.'s methodology for trainable statistical sentence extraction. The extension concerns the using of knowledge about the discourse-level structure. They are interested in the identification of argumentative units such as background, topic, related work, purpose, solution, result and conclusion. The presented system uses some heuristics based on indicator phrase quality, indicator phrase identity, location, sentence length, thematic word, title and header.

Formal models

The authors present an approach to text summarization that is entirely embedded in the formal description of a classification-based model of terminological knowledge representation and reasoning. Text summarization is considered a formally guided transformation process on knowledge representation structures as derived by a natural language text parser. The system uses a language that distinguishes between properties and conceptual relationships. The text condensation process examines the text knowledge base generated by the parser to determine the thematic descriptions. Only the most significant concepts, relationships and properties are considered as part of a topic description, this is done using operators. Analyzing a text paragraph by paragraph yields a set of consecutive topic descriptions, each characterizing the topic of one or more adjacent paragraphs. Summaries are represented by text graph. The construction of a text graph proceeds from the examination of every pair of basic topic descriptions and takes their conceptual commonalities to generate more generic thematic characterizations.

Topic identification

The authors investigate on a technique to produce a summary of text which relies on a model of the topic progression in the text derived from lexical chains. Summarization proceeds in three steps: the original text is first segmented, lexical chains are constructed, strong chains are identified and significant sentences are extracted from the text. The text segmentation is obtained from Hearst's algorithm. The procedure for constructing lexical chains follows three steps: select a set of candidate words, for each candidate word, find an appropriate chain relying on a relatedness criterion among members of a chains, if it is found, insert the word in the chain and update it accordingly. The summaries are built using scoring which is based on chain length and the extraction of significant sentences is based on heuristics using chain distribution, for example, choose the sentence that contains the first appearance of a chain member in the text.
The authors identify a set of salient phrasal units referred to as topic stamps and which they organized them in a capsule overview of the document. This set is identified by reducing the phrase set to a referent set using a procedure of anaphora resolution. In fact the assumption is as follow: every phrase constitutes a "mention" of a participant in the discourse and anaphora resolution allows to determine which expressions constitute mentions of the same referent. The expressions that are coreferential are grouped in equivalence classes, each one corresponds to a unique referent and the whole set of equivalence classes constitutes the referent set. The importance-based ranking of referents is required for the identification of topic stamps. The topic stamps are organised in order of appearance and assigned to discourse segments which are defined using a similarity-based algorithm which detects changes in topic by using a lexical similarity measure.
The paper describes an algorithm that use lexical cohesion relations for partitioning expository texts into coherent multi-paragraph discourse units which reflect the subtopic structure of the texts. Contrary to many discourse models that assume a hierarchical segmentation of the discourse, the author chooses to cast expository text into a linear sequence of segments. The algorithms discovers subtopic structure using term repetition as a lexical cohesion indicator. The author presents two methods: the first one compares, for a given window size, each pair of adjacent blocks of text according to how similar they are lexically. This method assumes that the more similar two blocks of text are, the more likely it is that the current subtopic continues, and, conversely, if two adjacent blocks of text are dissimilar, this implies a change in subtopic flow. The second method keeps track of active chains of repeated terms, where membership in a chain is determined by location in the text. The method determines subtopic flow by recording where in the discourse the bulk of one set of chains ends and a new set of chains begins. The core algorithm has three main parts: tokenization, similarity determination and boundary identification.
The authors argue that the process of summarization consists in topic identification, topic interpretation and generation processes. They described a system's architecture and some details about its processes. The topic identification is based on the optimal position policy as a list that indicates in what ordinal positions in the text high topic bearing sentences occur. This method is obtained by training, given a collection of of genre related texts with keywords. The topic interpretation is based on concept fusion using WordNet and the notion of concept signature. The system proceeds by concept counting instead of word counting. The concept signature will identify the most pertinent signatures subsuming the topic words, and the signature head's concept will then be used as the summarizing fuser concepts.
This paper proposes an indicator of text structure called the lexical cohesion profile which locates segment boundaries in a text. The approach assumes that the words in a segment are linked together via lexical cohesion relations as semantic similarities between words which are measured using a semantic network. A lexical cohesion profile of the text is a sequence of lexical cohesiveness of the word lists which can be seen through a fixed- width window.
Several methods have been tried to perform topic identification. Some involve parsing and semantic analysis of the text which are less robust. Others, such as the cue words and position methods are more robust but less accurate. The position methods are based on the intuition that sentences of greater topic centrality tend to occur in certain specifiable locations (e.g. text's title, first and last sentences of each paragraphs, etc.). However this intuition is contradictory, unless some restricted cases. Otherwise, the texts in a genre generally observe a predictable discourse structure and discourse structure differs significantly over text genres and subject domains. So, the position method cannot be defined for any text but tailored to genre and domain using training. The authors conduct some experiments, based on the corpus for TIPSTER program, in which they empirically determined the yield of each sentence position in the corpus, measuring against the topic keywords, then they ranked the sentence positions by their average yield to produce an optimal position policy for topic positions for the genre. Finally, comparing to abstracts accompanying the texts, they measured the coverage of sentences extracted according to the policy, cumulatively in the position order specified by the policy. The high degree of coverage indicated the effectiveness of the position method.
The author presents a method for identifying the central ideas in a text based on a representation-based concept counting paradigm. The method is not based on the word counting but on the concept counting. To represent and generalize concepts, the method uses the hierarchical concept taxonomy WordNet.
The author presents methods to identify topics using positional method based on optimal position policy which identify importance sentence position, cue phrases, and topic signatures which provide a way to represent concept co-occurrence patterns (i.e. a list of pairs keyconcept and weight, for example (earthquake, (Richter scale, w1), (death toll, w2), ...)
The author describes an approach to text summarization based on thematic representation of a text. The construction of thematic representation is based on comparing terms of the thesaurus using morphological representation of the text and terms. Thesaurus projection is a set of text descriptors together with relations to related text descriptors. In this structure, thematic nodes, are determinated using descriptor frequency in the text, and which correspond to topics ot subtopics discussed in a text. Summaries are generated by expressions of the main thematic nodes picked from the text.
The paper presents passage-retrieval techniques which are based on chronological decomposition into text segments and semantic decomposition into text themes in order to characterize text structure and then to replace texts by important text excerpts. Roughly, assuming that each text or text excerpts is represented by a vector of weighted terms which represent the occurrence characteristics of terms (e.g. words, phrases), a pairwise similarity coefficients, showing the similarity between pairs of texts based on coincidences in the term assignments, can be calculated. Graph structures are used to reprsent relationships between text components, i.e. the vertices are doucments and a link appears between to nodes when they are similar. Various elements of text structure are immediately desirable from a text-relationship maps, for example, the importance of a paragraph might be related to the number of incident branches of the corresponding node on the map, or a central node might then be characterized as one with a large number of associated paragraphs.

Statistical-based summarization

The authors propose a method, given a base corpus and word co- occurrences with higher resolving power, to establish links between the paragraphs of the article. The paragraphs wich presents the larger number of links to other paragraphs is considered a most significant one. Briefly, the steps of the proposed method are: in a base corpus, compute the frequency of each word and co-occurrence considering window spanning from -5 or +5, and similarly to each document, the pairs occurring repeatedly in differents paragraphs establish links between them, the central important paragraph is that presenting a larger number of links to other paragraphs.
The authors propose an approach for extracting key paragraphs based on the idea that whether a word is a key in an article or not depends on the domain to which the article belongs. The context is structured into domain, article and paragraph. A keyword satisfies the two conditions: its deviation value in the paragraph is smaller than that of the article and its deviation value in the article is smaller that that of the domain. The authors apply a term weighting method to extract keywords based on X2 method. Then for extracting key paragraphs, they represent every paragraphs as a vector of keywords, the clustering algorithms, based on a semantic similarity value between paragraphs, is applied to the sets and produces a set of semantic clusters which are ordered in the descending order of their semantic similarity values.
Abstract: To summarize is to reduce in complexity, and hence in length, while retaining some of the essential qualities of the original. This paper focusses on document extracts, a particular kind of computed document summary. Document extracts consisting of roughly 20\% of the original can be as informative as the full text of a document, which suggests that even shorter extracts may be useful indicative summaries. The trends in our results are in agreement with those of Edmundson who used a subjectively weighted combination of features as opposed to training the feature weights using a corpus. We have developed a trainable summarization program that is grounded in a sound statistical framework.
The authors describe the system ANES (Automatic News Extraction System). The system comprises two devices: Reader (converts the input into tokens, sentences and paragraphs and counts word occurrences and word weights) and Extractor (performs sentence weighting and determines the particular sentences to be included in the summary). The process of summary generation has five major constituents: training (to determine the typical frequency of occurrence of words averaged across represented publications), TF*IDF (sentence selection by segregating out a list of signature words which were performed using tf*idf), sentence weighting (based on the sum upon the weights of individual signature words), sentence selection (based on sentence weighting, location, etc.).
The authors reports on some experiments in document summarisation based on Kupiec et al.'s method. Kupiec et al. use supervised learning to automatically adjust feature weights, using a corpus of research papers and corresponding summaries generated by professional abstractors. In these experiments, the gold standard sentences are those summary sentences that can be aligned with sentences in the source texts. Once the alignement has been carried out, the system tries to determine the characteristic properties of aligned sentences according to a number a features, e.g. presence of particular cue phrases, location in the text, sentence length, occurrence of thematic words, and occurrence of proper names. Each document sentence receive scores for each of the features, resulting in an estimate for the sentence's probability to also occur in the summary. In the authors's experiments, summaries are written by the authors of the documents to summarize and not by professional abstarctors.
The authors present an extension of Kupiec and al.'s methodology for trainable statistical sentence extraction. The extension concerns the using of knowledge about the discourse-level structure. They are interested in the identification of argumentative units such as background, topic, related work, purpose, solution, result and conclusion. The presented system uses some heuristics based on indicator phrase quality, indicator phrase identity, location, sentence length, thematic word, title and header.
The paper describes a system for generating text abstracts which relies on a purely statistical principle, i.e. a combination of tf*idf weights of words in a sentence. The system takes an article from the corpus and build a word weight matrix for all content words across all sentences (tf*idf-value), determines the sentence weights for all sentences (sum over tf*idf-value), sorts the sentences according to their weights and extract the N highest weighted sentences.

Paragraph extraction

The authors propose a method, given a base corpus and word co- occurrences with higher resolving power, to establish links between the paragraphs of the article. The paragraphs wich presents the larger number of links to other paragraphs is considered a most significant one. Briefly, the steps of the proposed method are: in a base corpus, compute the frequency of each word and co-occurrence considering window spanning from -5 or +5, and similarly to each document, the pairs occurring repeatedly in differents paragraphs establish links between them, the central important paragraph is that presenting a larger number of links to other paragraphs.
The authors propose an approach for extracting key paragraphs based on the idea that whether a word is a key in an article or not depends on the domain to which the article belongs. The context is structured into domain, article and paragraph. A keyword satisfies the two conditions: its deviation value in the paragraph is smaller than that of the article and its deviation value in the article is smaller that that of the domain. The authors apply a term weighting method to extract keywords based on X2 method. Then for extracting key paragraphs, they represent every paragraphs as a vector of keywords, the clustering algorithms, based on a semantic similarity value between paragraphs, is applied to the sets and produces a set of semantic clusters which are ordered in the descending order of their semantic similarity values.
The paper presents passage-retrieval techniques which are based on chronological decomposition into text segments and semantic decomposition into text themes in order to characterize text structure and then to replace texts by important text excerpts. Roughly, assuming that each text or text excerpts is represented by a vector of weighted terms which represent the occurrence characteristics of terms (e.g. words, phrases), a pairwise similarity coefficients, showing the similarity between pairs of texts based on coincidences in the term assignments, can be calculated. Graph structures are used to reprsent relationships between text components, i.e. the vertices are doucments and a link appears between to nodes when they are similar. Various elements of text structure are immediately desirable from a text-relationship maps, for example, the importance of a paragraph might be related to the number of incident branches of the corresponding node on the map, or a central node might then be characterized as one with a large number of associated paragraphs.

Sentence extraction

The authors present a system for text summarization that combines frequency-based, knowledge-based and discourse-based techniques. Summarization proceeds by extracting of features using term frequency, signature words, subsequent references to full names and aliases, WordNet, and morphological analysis of variants that refer to the same word. Then, and in order to select sentences for summary, each sentence in the document is scored using different combinations of signature word features. After, the top n highest scoring are chosen as a summary of the content of the document.
The paper presents an architecture for a hybrid connectionist- symbolic machine for text summarisation. The main process is about the content selection and in order to identify generic content selection features, an extensive corpus analysis was carried out on a variety of real-word texts. In fact, the process of content selection is based on the mappings between the surface cues (i.e. lexical items with a semantic/rhetorical load) and the intermediary (i.e. rhetorical semantic criteria) and pragmatic (i.e. theories about communicating agents) features.
The author presents a telegraphic text reduction system which works on the sentence level rather than the document level. The general principle of reduction is based on general linguistic intuitions, for example, proper nouns are generally more important than commom nouns, nouns are more important than adjectives, adjectives are more important than articles, subclauses are less important than clauses, etc. The input text is marked up with linguistic structural annotations, and the output are generated according to the level of reduction requested. This is made (tokenisation, annotation with grammatical tags, part-of-speech disambiguation, part-of-speech tagger, syntactic dependencies, etc) using finite-state techniques. Then, the reduction levels are applied retaining more or less words around the skeletal parts of the sentence. The output is a telegraphic version of the text which can then be fed into a speech synthesizer.
The paper described an approach which is intended to be the basic architecture to extract a set of consice sentences that are indicated or predicted by goals and contexts. The sentence selection algorithm measures the informativeness of each sentence by comparing with the determined goals. The measurement represents the number of different sentence expressions related to the goals, the total number of sentence expressions related to the goals, and the total number of sentence expressions being not related to the goals.
The authors describe the system ANES (Automatic News Extraction System). The system comprises two devices: Reader (converts the input into tokens, sentences and paragraphs and counts word occurrences and word weights) and Extractor (performs sentence weighting and determines the particular sentences to be included in the summary). The process of summary generation has five major constituents: training (to determine the typical frequency of occurrence of words averaged across represented publications), TF*IDF (sentence selection by segregating out a list of signature words which were performed using tf*idf), sentence weighting (based on the sum upon the weights of individual signature words), sentence selection (based on sentence weighting, location, etc.).
The authors discuss text summarization in the automated editing system of questions and answers package. The summarization is considered in order to construct the node page, of questions answers package, and which contains the question or the problem that is discussed in the thread and a summary that should be as short as possible. The summary extraction consists of the feature-detection and sentence extraction. The feature-detection correspond to string-pattern matching between regular expressions and text portions. In order to condense the extracted texts, the authors propose some rules of rewriting.
The authors reports on some experiments in document summarisation based on Kupiec et al.'s method. Kupiec et al. use supervised learning to automatically adjust feature weights, using a corpus of research papers and corresponding summaries generated by professional abstractors. In these experiments, the gold standard sentences are those summary sentences that can be aligned with sentences in the source texts. Once the alignement has been carried out, the system tries to determine the characteristic properties of aligned sentences according to a number a features, e.g. presence of particular cue phrases, location in the text, sentence length, occurrence of thematic words, and occurrence of proper names. Each document sentence receive scores for each of the features, resulting in an estimate for the sentence's probability to also occur in the summary. In the authors's experiments, summaries are written by the authors of the documents to summarize and not by professional abstarctors.
The authors present an extension of Kupiec and al.'s methodology for trainable statistical sentence extraction. The extension concerns the using of knowledge about the discourse-level structure. They are interested in the identification of argumentative units such as background, topic, related work, purpose, solution, result and conclusion. The presented system uses some heuristics based on indicator phrase quality, indicator phrase identity, location, sentence length, thematic word, title and header.

Template-filling-based information extraction

FASTUS is a system for extracting information from real-world text based on finite-state machines. It employs a nondeterministic finite-state language model that produces a phrasal decomposition of a sentence into noun groups, verb groups and particles. The paper presents nuances between information extraction (FASTUS system) and text understanding (TACITUS system). In Fact, FASTUS participated in MUC-4, and the summarization is based on filling in slots of templates the content of newpaper articles on latin american terrorism. The template-filling requires identifying the perpetrators and victims of a terrorist act, the occupations of the victims, the type of physical entity attacked or destroyed, the date, the location, and the effect on the targets. The operation of FASTUS is composed of four steps: triggering, recognizing phrases, recognizing patterns amd merging incidents. The recall and precision of FASTUS are respectivelly 44% and 55%. It can read 2375 words per minute and can analyze one text in anaverage of 9.6 seconds.
The authors classify and review current approaches to software infrastructure of NLP systems, and then present the system GATE. The language engineering systems that provide software infrastructure for NLP can be classified as: additive or markup-based (e.g. SGML), referential or annotation- based (e.g. TIPSTER), and abstraction-based (i.e. the original text is preserved in processing only, e.g. ALEP). GATE adopt an hybrid approach based mainly on TIPSTER. An application in GATE have been realised LaSIE which participated in MUC-6.
After the sixth in the series of Message Understanding Conferences, The authors give the history of these series with an evaluation. We notify that the systems participating in MUC tasks allow the extraction of information based on named entity, coreference, template element and scenario templates.
From their experiences of template design for information extraction (MUC, TIPSTER and TREC), the authors discuss the problem of template design as a problem of knowledge representation such as what are essential facts about situations described in text ? Essential facts are determined by a semantic model of the domain. They give an ontology of template design based on the basic entities, their properties and the relations among them, and the kinds of changes in such properties and relations.
The content of this report is quite similar to the previous paper.
In the domain of Healthcare, medical professionals use online resources to find journal articles that discuss results pertaining to patients currently under their care. The authors present a design for generating summaries that are tailored to characteristics of the patient under consideration. After a text analysis phase, the authors observe that journal articles in medicine use a standard format structure which includes sections formally marked: introduction, methods, statistical analysis, results discussion and previous work. Within a single section, certain types of information are found, for example, in the methods section, descriptions of the patients in the study are included. The summary design for retrieving matching information from the articles, comprises: match patient characteristics using the standard structure of the article, categorize the article as either prognosis, diagnosis ot treatment article using specific phrases that indicate the category, identify patient stratification and extract results using the standard structure and merge extracted sentence fragments by post-processing the sentences using symbolic techniques to group them together. Experiments have been made in presenting a prototype of the system to professionals.
The authors present a core system for an information extraction based on generic linguistic knowledge sources. The inputs of the system are ASCII texts and the output templates. The processing of the data comprises: tokenizer, morphological and lexical processing, fragment processing and fragment combination with template generation. Three application system have been implemented: appointment scheduling via email, classification of event announcements sent via email and extraction of company information from newspaper articles.
Authors' abstract:
We present a methodology for summarization of news about current events in the form of briefings that include appropriate background (historical) information. The system that we developed, SUMMONS, uses the output of systems developed for the DARPA Message Understanding Conferences to generate summaries of multiple documents on the same or related events, presenting similarities and differences, contradictions, and generalizations among sources of information. We describe the various components of the system, showing how information from multiple articles is combined, organized into a paragraph, and finally, realized as English sentences. A feature of our work is the extraction of the descriptions of entities such as people and places for reuse to enhance a briefing.
Comments:
SUMMONS, developed at Columbia University, is a prototype summarizer -- briefing generator, to be exact -- for a sequence of news items on closely related incidents. The system has been restricted to the domain of terrorist events, and there are good reasons for this narrow focus. SUMMONS feeds on the rich resources of the Columbia NLP group, especially language generation tools, and on the University of Massachusetts' information extraction system that participated in MUC-4. That is, the system could be assembled with relatively little effort. The authors also do not shy away from manual intervention into data when it helps highlight their system's achievements.
A set of news items for summarization is preprocessed (!) by the message understanding system, among other preliminary, largely manual, steps. SUMMONS' main task is a manipulation of the templates produced by this preprocessor, so that the language generator can receive its data. The process is incremental; every new template may add to the summary if one of eight content planning operators discovers a difference that must be accounted for.
The project is, generally speaking, firmly in the language generation area, and its concerns are far from ours. In particular, the very interesting subproject on extracting descriptions of entities from newswire data seems inapplicable in our work. The narrowness of the application domain, while clearly necessary for the success of this project, is not an option in IIA. SUMMONS is, nonetheless, an impressive system even in its present stage, and the ensuing research will be worth watching.
The authors present the system PROFILE which combines the extraction of entity names and the generation of descriptions using FUF/SURGE which is based on functional descriptions. The system extraction can use an on-line browser of newswire or a description storage from an old newswire. So PROFILE maintains a database about descriptions. This is proceeded as follow: extraction of descriptions, categorization of descriptions and organization of descriptions in a database of profiles.
The authors propose a method that generates summaries of news based on the discourse macro structure (DMS). Their approach falls into the constatation that certain types of text conform to a set of style and organization constraints, for example, for the news text the DMS is: background and what is the news, and the summarization is based on DMS template filling. The exctraction of components of the DMS are based on scoring of paragraphs using metrics. The metrics integrate weighting of paragraphs which is based on term frequency, terms occuring in the title and in the paragraphs, noun phrases, words occuring only in some paragraphs, certain cue phrases, some indications, etc. They classify their approach as summarization-based query expansion.

Text categorization

The authors describe first algorithms that classify texts using extraction patterns and semantic features associated with role fillers in the domain of MUC (i.e. domain of terrorism). Second, the authors describe an automatic generation of extraction patterns using preclassified texts as input. This is achieved by the word-augmented relevancy signatures algorithm that uses lexical items to represent domain-specific role relationships instead of semantic features. The system proceeds in two stages: it uses heuristic rules to generate an extraction pattern for every noun phrase in the corpus, the result of this stage is a giant dictionary of extraction patterns. In the second stage, they process the training corpus a second time using the new extraction patterns. For each pattern, they estimate its relevance rate and then rank them. The algorithm is used in text categorization system which allow to generate classification terms.