Stewart Baillie

June 1996

Research Proposal - beta version

Abstract

This Cognitive Science research proposal investigates the application of linguistic knowledge tocomputer science technology. The aim is to endow an intelligent software agent operating on the world-wide web (a web spider) with domain understanding. In particular, emphasis will focus on the contribution of weak NLP and knowledge-based approaches to the understanding of a specific hypermedia domain.

Introduction

The internet has established itself as a powerful communication and information resource[1]. The hypermedia architecture and integrated GUI browsers of the world-wide web have encouraged a thriving net community which now numbers in the millions[2]. As a result, the web is playing an increasingly important role in public, commercial, and academic life. However, the shift to interactive web-space is not without technical and social problems.

One emerging problem is that of information discovery. A web user is generally interested in obtaining relevant information, but not in the discovery process. Unfortunately, the scale, diversity, and heterogeneity of the web is now such that finding relevant information is often difficult and time consuming. Thus, the discovery task may constitute an expensive distraction incidental to the user's main purpose. One way out of this situation is for the user to delegate the discovery task to a specialist information agent. The user's responsibility is then reduced to negotiating a task specification.

Web Spiders

Automating this style of delegative relationship is a topic of research in the intelligent software agent (ISA) branch of Artificial Intelligence. The aim is to produce software agents capable of autonomous intelligent action. The cyber-space world of the web is ideally suited for this type of application. It is already host to hundreds[3] of information processing software agents, or web spiders[4]. A few examples follow.

* The Amalthaea project investigates an evolving ecosystem of competing and cooperating information discovery and information filtering agents.

* The CiFi project investigates the discovery and retrieval of academic papers and abstracts from the net.

* The FERRET project uses a text skimming parser, dictionary, and domain scripts to investigate conceptual information retrieval based on canonical knowledge representation and case frame matching .

* The Letizia project investigates intelligent user interfaces which assist interactive web browsing by searching ahead and anticipating the user's interests.

* The RBSE Spider project investigates the automatic maintenance of WAIS style indexed URL databases.

* The Sulla project investigates personal proxy web servers built from user interest profiles, where potentially relevant resources are automatically discovered and then cached locally.

* The WebAnts project investigates cooperative multiagent web searches with results sharing.

The developers of Web spiders need to address the following issues:

* the negotiation of a task specification

* the purposeful navigation of the web

* the interpretation of hypermedia objects

* the reporting of relevant information

Hyper-understanding

The focus of the proposed research is bullet three, hypermedia understanding[5]. An interdisciplinary approach is adopted in which the technologies of computer science are employed to capture linguistic and hypermedia knowledge. For practical reasons, the analysis of hypermedia will be restricted to text objects[6].

Linguistics is traditionally divided into three subdisciplines: syntax, semantics, and pragmatics.

* Syntax refers to the surface forms, the grammatical rules and classes, that govern lexical and morphological arrangement.

* Semantics refers to the literal interpretation of meaning from context-free sentences.

* Pragmatics refers to the contribution of context to meaning interpretation.

These traditional levels of abstraction have proved useful in describing and formalising language behaviour. However, the levels are found to be interpenetrating and the boundaries at best vague. This is particularly true of theories of understanding which must attend to all three levels simultaneously. Noting this caveat, let us apply some linguistic modelling to web.

The web can be viewed from two perspectives:

* as a semantic network of hyper-documents or

* as a physical network of hyper-objects.

I use the term hyper-document to describe a finite intentionally structured semantic domain. The domain may consist of a single hyper-object, or of an inter-linked set of hyper-objects. Hyper-objects are the physical stuff of the web. The objects which HTML[7] tags reference, such text blocks, images, programs, and hyper-pages. Four levels of information can be seen to contribute to the understanding of web hyper-documents:

* the semantic content internal to hyper-objects

* the discourse function of hyper-objects

* the discourse organisation within hyper-documents

* the pragmatic context of hyper-documents

The re-construction of meaning of a hyper-document involves the interaction of all these levels.

Semantic content: Hypertext objects contain text. Traditional linguistics breaks language into sentential units for semantic interpretation. In the case of hypertext, it is convenient to expand these units to include text objects such as heading and anchor descriptions. Traditional NLP and statistical approaches may be used to interpret the literal meaning of these units.

Discourse function: A discourse is organised into segments which fulfil specific functions. Natural language offers a range of clues for identifying discourse segments. Hypermedia documents contain explicit clues to this process. HTML acts as a meta-language that describes the relationships between content bearing objects. It is a kind of hyper-intonation that can indicate the discourse function of hyper-objects. One simple example is the relationship between heading, italicised and bolded objects and discourse salience. Another is the association of text that follows a conclusion or abstract heading with that discourse intention.

Discourse organisation: It has long been recognised that sentential semantics is insufficient for general understanding. Sentences exist in the wider context of a discourse. By discourse, I mean a deliberate concatenation of semantic units. The arrangement of semantic units within a discourse is meaningful. The structure determines the contribution that each literal semantic unit makes towards the final discourse meaning. The organisation of a hyper-document reflects a discourse structure.

Pragmatic context: A hyper-document is embedded in a wider web structure. The interpretation of a hyper-document is often influenced by its web context. For instance, a university homepage forms the root of an information hierarchy. Thus a "faculty information" anchor would probably link to a subordinate page of faculty links, whereas the same anchor in an individual staff member's homepage would probably link to his or her superordinate faculty's homepage.

Approaches to document meaning can employ techniques of:

* statistical analysis or

* linguistic analysis[8].

Statistical approaches by-pass semantics and exploit purely surface orthographical patterns. They are currently the most popular and effective approach (Harman, et. al. 1995). However, they are in principle limited to the correlation between form and meaning.

General purpose NLP techniques suffer from no such in principle limit but do suffer from very real in practice limits. Linguistic theories are immature, especially where semantics and pragmatics are concerned. The latter two rely on vast networks of language and commonsense knowledge. Unfortunately, commonsense is a hard AI problem and currently without an extant solution. As a result, NLP approaches currently constitute more pure research programmes than practical technologies. This view is supported in automatic document indexing, where advanced linguistic models can perform worse than simple statistical techniques (Salton and McGill 1983). A more recent review states that "to date natural language processing techniques have not significantly improved the performance of document retrieval" (Harman, et. al. 1995). Anecdotal evidence is also available from sources such as the conversation-based Julia project, where the robot's sophisticated discourse competence is the result of cleverly engineered pattern matching. Foner (1993) notes that "her parser is shockingly simple, as such thing go. It is barely more complex than ELIZA's parser in some ways, in that it does not decompose its inputs into parse trees or anything else the a linguist might suggest".

At present it would appear that strong NLP approaches to language understanding are not mature enough to be practical. Therefore, it is the aim of this research to explore weak NLP approaches. Weak NLP techniques are those not based in overarching formal linguistic theories. They typically employ grammatical heuristics and knowledge and do not rely on deep parsing and formal semantics.

The Approach

The project will augment a simple statistical keyword matching technique[9] with knowledge-based techniques. Specifically, weak NLP techniques, knowledge about the four levels of hyper-document information previously discussed, and specific subject knowledge. An important thesis of the research is that knowledge-based approaches to hypertext understanding can significantly improve the performance of statistically-based approaches.

Some advantages of taking knowledge-based approaches to hypermedia understanding are that:

* they avoid the commonsense problem

* they present a much smaller programming task than modelling real grammars and their associated syntactic/semantic dictionaries

* they do not require powerful and/or specialised hardware platforms.

* they are better scaled to a minor thesis

Some disadvantages are that:

* they lack a formal theoretical base which results in a loss of generality

* they tend to be domain specific and not portable

* they cannot learn (the compiled knowledge is static)

The Generic Project

The MEngSc (Cognitive Science) guidelines require "the implementation of an AI system and an assessment of the implementation". This suggests a knowledge-based approach because it favours practicality over generality. The hypermedia understanding research will centre on an agent application specialised to prosecute information extraction tasks within a particular semantic domain . An existing web agent, or a component thereof, will act as a development shell[10]. The research project will consist of adding domain, hypermedia and weak NLP knowledge to the shell. The contribution of the various understanding techniques will be appraised. It is hoped that some generalisable web understanding heuristics and knowledge-based development methods will emerge.

More Detailed Speculations

Agent research at the University of Melbourne is new and no general purpose agent environment yet exists. However, research into hypermedia understanding may be pursued in parallel with a more general agent development programme. In the interim period, a stand-in application will be required to interface the understanding agency with the user and with the web. The PageSearcher application is a browser-based Java language hypermedia explorer using keyword matching. It may provide a vehicle in which to commence understanding research.

The semantic domain of Computer Science department web pages is under consideration. This domain has the advantage of being well represented on the web and of offering a constrained diversity of structure.

The task of acquiring prospective student course information in under consideration. This would take the form of canned question-answer adjacency pairs such as:

What courses are offered?

What is studied in each degree?

The project may consider some of the following techniques:

* The compilation of the four levels of hypermedia information into salient knowledge structures.

* Full document text analysis.

* An extension of speech act theory from semantics to pragmatics. Discourse analysis[11] techniques may be used to identify intentional discourse segments such as abstracts, introductions, conclusions, explanations, discussions, arguments, and examples. (see Courant, et. al. c1990 or Scott and Kamp 1995).

* Task goals communicated as descriptors that preferentially map onto related discourse function segments, such as text , headings, and text to figures, or discourse intention segments, such as introductions, examples, and arguments.

* Task goals communicated as descriptors that categorise words and phrases in ways that exploit patterns observed in text structures of the web. For example an ontology of semantic classes such as person, date, number, and subject.

* A taxonomy of web pages may be constructed as part of a mini domain ontology. For example, hypertext pages may be categorised as organisational homepages, personal homepages, link pages, or content pages. Specific knowledge could then be complied about the function and structure of each page type.

* The parsing of text into syntactic constituents is beyond the scope of this project. However, delimiting text into punctuated adjuncts may be useful in bounding lexical semantics. This may assist in compound term identification, or to disambiguate polysemous words.

* Provide a lexical recognisor or domain thesaurus for homophonous word matching.

* Provide a lexicon of domain relevant words.

References

Courant, M., Law, I. and Vauthey, B. (c1990) RESUME - Text Retrieval Using Metatext. In ...

Eichmann, D. (1994) The RBSE Spider -- Balancing Effective Search Against Web Load, First International Conference on the World Wide Web, Geneva, Switzerland, May 25-27. pp113-120.

Foner, Leonard (1993) What's an Agent, Anyway? A sociological Case Study. MIT Media Lab: Cambridge, MA.

URL: http://foner.www.media.mit.edu/people/foner/Julia/Julia.html

Donna Harman, Peter Schauble, and Alan Smeaton (1995) Document Retrieval. In Varile, Giovanni and Zampolli, Antonio (Eds) Survey of the State of the Art in Human Language Technology. Ch. 7, Document Processing. pp259-262

URL: http://www.cse.ogi.edu/CSLU/HTLsurvey

Koster, M. (1996) List of Robots, Nexor Corp.,

URL: http//info.webcrawler.com/mak/projects/robots/active.html

Lieberman, Henry (1995) Letizia: An Agent That Assists Web Browsing, In International Joint Conference on Artificial Intelligence, Montreal.

URL: http://lieber.www.media.mit.edu/people/lieber/letizia

Seng Wai Loke, Andrew Davison, and Leon Sterling (1996) CiFi: An Intelligent Agent for Citation Finding on the World-Wide Web. University of Melbourne Technical Report 96/4.

Mauldin, Michael (1991) Retrieval Performance in FERRET: A Conceptual Information Retrieval System. In The 14th International Conference on Research and Development in Information Retrieval, ACM SIGIR.

Moukas, Alexandros (1995) Amalthaea: Information Discovery and Filtering using a Multiagent Evolving Ecosystem. MIT Media Lab: Cambridge, MA.

Rickard, Jack (1996) Internet Numbers Redux, Boardwatch Magazine, vol X, issue 4, April 1996, ISSN: 1054-2760

URL: http://www.boardwatch.com/

Nick Rive (1996) Engsearch Project Report, University of Melbourne Summer Report 96/1.

Salton, Gerard and McGill, Michael, J (1983) Introduction to Modern Information Retrieval. McGraw-Hill: New York ISBN 0-07-054484-0

Donia Scott and Hans Kamp (1995) Discourse Modeling. In Varile, Giovanni and Zampolli, Antonio (Eds) Survey of the State of the Art in Human Language Technology. Ch. 6, Discourse and Dialogue. pp230-233.

URL: http://www.cse.ogi.edu/CSLU/HTLsurvey

AltaVista main page, URL: http://www.altavista.digital.com/

WebAnts home page, URL: http://polarbear.eng.lycos.com:8001/webants/

WebCrawler's - Web Size, Global Navigation Network, Inc.,

URL: http://webcrawler.com/WebCrawler/Facts/Size.html

Sulla - A User Agent for the Web,

URL: http//ricis.cl.uh.edu/agents/sulla.html

[1] Digital Corporation's AltaVista search engine currently indexes over 30 million hyper-pages (AltaVista 1996 - May) WebCrawler conservatively estimates over 145 000 web servers, six times that of a year ago (Webcrawler 1996 - May)

[2] Boardwatch Magazine soberly estimates nearly 17 million internet users in May 1996 (Rickard 1996)

[3] The NEXOR corporation web agent index holds over 50 "official" registrations (Koster 1996 - May)

[4] A metaphorical extension: the spider's habitat is the web.

[5] It is common practise to use anthropomorphic terminology in describing Artificial Intelligence objectives. I use such terms to indicate the notional intention of the research rather than to imply isomorphism with humans.

[6] This is not expected to significantly limit hypermedia understanding because the text generally carries the bulk of the information content.

[7] Hyper Text Markup Language (HTML) is the syntax of web hypermedia and tags its sentences.

[8] Or hybrids of the two, however the intent here is to identify the two different approaches.

[9] The research focus is on the contribution of domain knowledge and weak NLP techniques. A more complex statistical model could be adopted, but this would complicate, and possibly confuse, the research effort.

[10] The agent shell has yet to be finalised.

[11] Discourse structure will be ascertained using both linguistic cues and HTML tags.