The indexing model we would like to validate is based on the Phrase Retrieval Hypothesis . The idea of using phrasal indexing terms is not new and can already be found in [5,10,17]. It has been explored by several researchers in different ways and with mixed results, e.g. [18,19,21]. However, this approach tries to incorporate various techniques which deal with linguistic variation in a single phrase-based indexing scheme. In this section, the underlying model is briefly described. For a more detailed description the reader may refer to .
According to the linguistic principle of headedness,
any phrase has a single word as a head.
This head is the main verb in the case of verb phrases,
usually a noun (the last noun before any post-modifiers) in noun phrases.
The rest of the phrase consists of modifiers.
Consequently, every phrase can be represented by a phrase frame:
To deal with the sparsity of phrasal terms, linguistic normalization
Its goal is to cluster different but semantically equivalent phrases
Phrase frames, by their definition, incorporate the notion of syntactic normalization, that is the mapping of semantically equivalent but syntactically different phrases onto one phrase-class representative, the phrase frame. For instance, both retrieval of information and information retrieval are mapped to [retrieval,information].
Morphological normalization is applied by means of lemmatization to account for morphological variants of the keywords. Verb forms are reduced to the infinitive, inflected forms of nouns to the nominative singular, comparative and superlative of gradable adjectives to the absolute.
Lexico-semantical normalization matches different phrases which are semantically (almost) equivalent by exploring certain relations that can be found between the meaning of individual words, like synonymy, hyponymy, meronymy, etc. This normalization may be implemented either by means of lexico-semantical clustering, or by incorporating in the matching function of phrase frames a semantical similarity distance function between words (fuzzy matching).
Parts of this linguistically-motivated indexing model are still under investigation and development . Consequently, to perform these experiments we had to make some assumptions and give some quick solutions which we describe next.
In order to simplify the structural matching of phrases, and also to raise recall, we currently follow the strategy of unnesting all complicated phrase frames . A composed term like [a, [b, c]] is decomposed into two frames [b, c] and [a, b] using b as an abstraction for [b, c]. When this decomposition is applied recursively, it results in binary terms (BT's). As an example, consider the sentences
A student visits a conference on software engineering.from which, due to syntactical and morphological normalization, the same two frames are initially constructed for both sentences:
The software engineering conference is visited by some students.
In the current phase of our experimentation, phrase frames are constructed only from noun phrases, taking into account only prepositional phrase (PP) post-modifiers of nouns starting with the preposition of. These PP's are more likely to modify the preceding noun than others for which the PP-attachment problem has to be solved. However, we were able to disambiguate the modification structure of complicated noun phrases by applying statistical methods (described in section 4.5). We did not yet apply any lexico-semantical normalization.