Next: 2. Dealing with Linguistic Up: Linguistically-motivated Information Retrieval Previous: Linguistically-motivated Information Retrieval

1. Introduction

Information Retrieval (IR) has been developed to provide practical solutions to people's need to find the desired information in large collections of data. The IR task can be seen as the ``digital twin'' of the task of a person looking in a library for material relevant to a certain subject. In both cases, the searcher has an information need which has to be translated to library indices or query terms. Then, it is submitted to some system, library catalogue or computerized retrieval system, and the system in its turn suggests (retrieves) relevant material. The searcher will usually find that some of the suggested documents are not actually relevant, and will also suspect that some relevant documents might have been missed. For static collections, the effectiveness of such a search can be quantified using two metrics, precision and recall. Precision is defined as the ratio of the number of relevant retrieved documents to the total number of retrieved documents. Recall is the ratio of the number of relevant retrieved documents to the total number of relevant documents in the document collection. For an extended introduction to the IR problem, its history, widely accepted techniques, and retrieval evaluation metrics, the reader should refer to the classical books [1] and [2]; for a collection of classical articles in IR, to [3] (all in Readings for Further Study).

The tremendous increase over the last decade of information in digital form has led to a new challenge in IR. A World-Wide Web search today involves large amounts of information, and going through hundreds of irrelevant hits, which is usually the case, is very painful. Although IR has been in existence for more than three decades (and as a part of library science for much more), modern technology for its most part is still based on a simple assumption which often leads to unsatisfactory results. Restricting the problem to textual data, the assumption, implicit or explicit, upon which most commercial IR systems are based, is that

Definition 1 (naive Keyword Retrieval Hypothesis) If a query and a document have a (key)word in common, then the document is to some extent about the query.

Of course, if they have more keywords in common, then the document is more about the query. Moreover, the keywords are usually augmented with weights indicating their importance as information discriminators. In that respect, the IR problem is represented by matching the ``bag'' of keywords in the user's query with the ``bag'' of keywords representing the documents. The output of such a matching is usually a ranked list of documents; most relevant first, least relevant last. This relatively simple representation is the computer-age equivalent of library catalogues, and carries the same inadequacies. The most obvious inadequacies originate from linguistic variation, making the Keyword Retrieval Hypothesis insufficient because:

1.: It does not deal with morphological variation which produces keywords in different numbers, for instance wolf and wolves, or different cases, like man and man's (dealing with cases is trivial for English, but it is crucial for other more inflected languages like German or Greek).
2.: It does not handle cases where different words are used to represent the same meaning. For this phenomenon we use the term lexical variation. The result is that a query with the keyword film does not retrieve documents which contain its synonym movie.
3.: It does not distinguish cases where single words have multiple meanings due to semantical variation. A singer looking for bands will be faced with radio frequency bands as well.
4.: It does not deal sufficiently with syntactical variation. A document which contains the phrase near to the river, air pollution is a major problem is not about river pollution, although both keywords co-occur in the document. And certainly science library is not the same as library science.

Linguistic variation degrades the effectiveness of IR systems, in terms of precision and recall. On one hand, morphological and lexical variation hurts recall, on the other hand, semantical and syntactical variation hurts precision. However, trying to improve recall usually decreases precision and vice versa.

Linguistic variation in the IR context may be interpreted as: language is not merely a ``bag'' of words. Language is a mean to communicate about concepts, entities and relations which may be expressed in many forms. Word order may matter (as in science library vs. library science example) or not (general director vs. director general). Moreover, words combine to form phrases and other larger units with a meaning that may not be directly inheritable from the individual words. For example, a hot dog, either hot or not, has nothing to do with dogs. Given such considerations, it has been conjectured that a better representation should also include groups of words (phrases) and some form of regularization of words, word order, and meaning. Indeed, many researchers have developed such techniques.

This paper discusses a retrieval schema which attempts to overcome the problems originating from the Keyword Retrieval Hypothesis and linguistic variation. It is partly based on [1], and is organized as follows. First, in section 2 we will review some of the most important attempts made to deal with linguistic variation. Then, in the rest of this article we will discuss the key aspects of a linguistically-motivated retrieval system. Starting in section 3 from a phrase retrieval hypothesis, a naive extension of the Keyword Retrieval Hypothesis, we will address a suitable for IR representation of phrases in section 4. In section 5, possible regularizations of natural language will be outlined. The weighting of phrasal indexing terms and their matching will be discussed in section 6. An example architecture of such a linguistically-motivated retrieval system will be depicted in section 7. Last, we will draw some conclusions in section 8.

Next: 2. Dealing with Linguistic Up: Linguistically-motivated Information Retrieval Previous: Linguistically-motivated Information Retrieval

avi (dot) arampatzis (at) gmail