Next: 4. Representation of Phrases Up: Linguistically-motivated Information Retrieval Previous: 2. Dealing with Linguistic

3. The Phrase Retrieval Hypothesis

The goal of the indexing task is to assign characterizations (terms) to documents that are deemed to best represent their content. Every term used to characterize documents of the same collection can be seen as adding a new dimensionality to the characterization. Terms should be assigned to documents in such a way that documents on the same topic are positioned close together in the N-dimensional term space, while those on different topics are placed sufficiently apart. Terms can be anything from e.g. tri-grams and words to linguistic-entities and concepts. In the two extreme cases, documents can be characterized by themselves, e.g. their document numbers, or all documents by exactly the same characterization. The former characterization positions documents as far as possible apart, resulting in no way of retrieving documents on the same topic, thus it is unusable in the IR context. The latter provides no way of discriminating between different topics. Therefore, a suitable characterization must be usable and discriminating.

In a keyword-based representation, every document is characterized by a set of keywords with weights representing the importance of each keyword in characterizing the document. Keywords are usually derived directly from the document's text. Keyword-based representations are modestly usable and discriminating. Single words are rarely specific enough for accurate representation, e.g. the word system does not say much, whereas a sound system clarifies the meaning somewhat more. Moreover, a word with high frequency of occurrence in a document collection is not a good discriminator. On the other hand, a phrase, even one which is made up of high frequency words may occur only in a few documents, thus becoming a good discriminator. These observations suggest that a better characterization will make use of phrases. Consequently, a naive phrase retrieval hypothesis can be formalized as follows.

Definition 2 (naive Phrase Retrieval Hypothesis) If a query and a document have a phrase in common, then the document is to some extent about the query.

The Phrase Retrieval Hypothesis does not solve the problems originating from the Keyword Retrieval Hypothesis and linguistic variation. On the contrary, it creates more questions such as what a phrase is, how it should be used for indexing, be weighted and matched. We use this definition merely as a starting point, upon which we will build our framework.

Phrases can be obtained using statistical or syntactic methods. Syntactic phrases appear to be reasonable indicators of content, arguably better than proximity-based statistical phrases, since they account for word-order changes or other structural constructions, e.g. science library vs. library science vs. library of science. However, experiments have shown that syntactic methods are not significantly more effective than statistical methods [18,19,20]. This failure of NLP to outperform statistics can be attributed to the poor quality and robustness of the existing NLP techniques. Nevertheless, we will adopt a syntactic approach for the time being, assuming that accurate syntactic analysis and disambiguation techniques will become available. We will return to the effectiveness issues of NLP in section 7.

Evidence suggests that noun phrases should be considered as a semantical unit. The most important reasons are:

noun phrases play a central role in the syntactic description of all natural languages, functioning as subject, object and in preposition phrases.
In Artificial Intelligence, noun phrases are considered as references to (or descriptions of) complicated concepts [24]. By others, as picture producers.

Noun phrases might be good approximations of concepts, but other phrases also corresponding to concepts are missed. This observation points to the necessity to consider other phrases as well, for instance, verb phrases. The verb phrase describes a situation or process by relating a main verb to a number of noun phrases and other phrases. Therefore, the linguistically meaningful phrases which may be considered as retrieval terms are at least the noun phrase including its modifiers, and the verb phrase including its subject, object and other complements. An abstract representation of these phrases suitable for indexing is needed, and will be defined in section 4.

Phrases can be used in their literal form as terms, although the performance is then expected to be inferior to that of keywords. It is well known that, as the size of corpus grows, the number of keywords grows with the square root of the size of corpus. One could expect that the same holds for phrases, but the number of such enriched terms grows even faster. So does the likelihood of there being different phrases corresponding to the same concept. On one hand we would like to use phrases to achieve precision, but on the other hand recall will be too low, because the probability of a phrase re-occurring literally is too low. To deal with this sparsity of phrasal terms, we shall introduce a number of linguistic normalizations (section 5). Linguistic normalization tries to reduce alternative formulations of meaning to a normalized form. For example, river pollution and pollution of rivers are both normalized to the same indexing term pollution+river.

Next: 4. Representation of Phrases Up: Linguistically-motivated Information Retrieval Previous: 2. Dealing with Linguistic

avi (dot) arampatzis (at) gmail