Information Retrieval (IR) has been developed to provide practical solutions to people's need to find the desired information in large collections of data. The IR task can be seen as the ``digital twin'' of the task of a person looking in a library for material relevant to a certain subject. In both cases, the searcher has an information need which has to be translated to library indices or query terms. Then, it is submitted to some system, library catalogue or computerized retrieval system, and the system in its turn suggests (retrieves) relevant material. The searcher will usually find that some of the suggested documents are not actually relevant, and will also suspect that some relevant documents might have been missed. For static collections, the effectiveness of such a search can be quantified using two metrics, precision and recall. Precision is defined as the ratio of the number of relevant retrieved documents to the total number of retrieved documents. Recall is the ratio of the number of relevant retrieved documents to the total number of relevant documents in the document collection. For an extended introduction to the IR problem, its history, widely accepted techniques, and retrieval evaluation metrics, the reader should refer to the classical books  and ; for a collection of classical articles in IR, to  (all in Readings for Further Study).
The tremendous increase over the last decade of information in digital form has led to a new challenge in IR. A World-Wide Web search today involves large amounts of information, and going through hundreds of irrelevant hits, which is usually the case, is very painful. Although IR has been in existence for more than three decades (and as a part of library science for much more), modern technology for its most part is still based on a simple assumption which often leads to unsatisfactory results. Restricting the problem to textual data, the assumption, implicit or explicit, upon which most commercial IR systems are based, is that
Linguistic variation in the IR context may be interpreted as: language is not merely a ``bag'' of words. Language is a mean to communicate about concepts, entities and relations which may be expressed in many forms. Word order may matter (as in science library vs. library science example) or not (general director vs. director general). Moreover, words combine to form phrases and other larger units with a meaning that may not be directly inheritable from the individual words. For example, a hot dog, either hot or not, has nothing to do with dogs. Given such considerations, it has been conjectured that a better representation should also include groups of words (phrases) and some form of regularization of words, word order, and meaning. Indeed, many researchers have developed such techniques.
This paper discusses a retrieval schema which attempts to overcome the problems originating from the Keyword Retrieval Hypothesis and linguistic variation. It is partly based on , and is organized as follows. First, in section 2 we will review some of the most important attempts made to deal with linguistic variation. Then, in the rest of this article we will discuss the key aspects of a linguistically-motivated retrieval system. Starting in section 3 from a phrase retrieval hypothesis, a naive extension of the Keyword Retrieval Hypothesis, we will address a suitable for IR representation of phrases in section 4. In section 5, possible regularizations of natural language will be outlined. The weighting of phrasal indexing terms and their matching will be discussed in section 6. An example architecture of such a linguistically-motivated retrieval system will be depicted in section 7. Last, we will draw some conclusions in section 8.