2. A Linguistically-motivated Indexing Model

The indexing model we would like to validate is based on the Phrase Retrieval Hypothesis [1]. The idea of using phrasal indexing terms is not new and can already be found in [5,10,17]. It has been explored by several researchers in different ways and with mixed results, e.g. [18,19,21]. However, this approach tries to incorporate various techniques which deal with linguistic variation in a single phrase-based indexing scheme. In this section, the underlying model is briefly described. For a more detailed description the reader may refer to [2].

According to the linguistic principle of headedness, any phrase has a single word as a head. This head is the main verb in the case of verb phrases, usually a noun (the last noun before any post-modifiers) in noun phrases. The rest of the phrase consists of modifiers. Consequently, every phrase can be represented by a phrase frame:

$\begin{displaymath}\mbox{\it PF}= [h,m] \end{displaymath}$

The head h gives the central concept of the phrase and the modifiers m serve to make it more precise. Conversely, the head may be used as an abstraction of the phrase, loosing precision but gaining recall. Heads and modifiers in the form of phrases can be nested: $\left[\left[h_1,m_1\right],\left[h_2,m_2\right]\right]$ . The modifier part might be empty in case of a bare head. This case is denoted equivalently by [h,] or [h].

To deal with the sparsity of phrasal terms, linguistic normalization is introduced. Its goal is to cluster different but semantically equivalent phrases (figure 1).

**Figure 1:** Linguistic normalization
$\begin{figure} \centering \epsfxsize=5.5in % \epsfbox{normaliz.eps} \end{figure}$

We distinguish between three kinds of normalization which can be seen as recall enhancement techniques when phrases are used for indexing. These are syntactic, morphological and lexico-semantical normalization.

Phrase frames, by their definition, incorporate the notion of syntactic normalization, that is the mapping of semantically equivalent but syntactically different phrases onto one phrase-class representative, the phrase frame. For instance, both retrieval of information and information retrieval are mapped to [retrieval,information].

Morphological normalization is applied by means of lemmatization to account for morphological variants of the keywords. Verb forms are reduced to the infinitive, inflected forms of nouns to the nominative singular, comparative and superlative of gradable adjectives to the absolute.

Lexico-semantical normalization matches different phrases which are semantically (almost) equivalent by exploring certain relations that can be found between the meaning of individual words, like synonymy, hyponymy, meronymy, etc. This normalization may be implemented either by means of lexico-semantical clustering, or by incorporating in the matching function of phrase frames a semantical similarity distance function between words (fuzzy matching).

Parts of this linguistically-motivated indexing model are still under investigation and development [9]. Consequently, to perform these experiments we had to make some assumptions and give some quick solutions which we describe next.

2.1 Current Implementation

In order to simplify the structural matching of phrases, and also to raise recall, we currently follow the strategy of unnesting all complicated phrase frames [9]. A composed term like [a, [b, c]] is decomposed into two frames [b, c] and [a, b] using b as an abstraction for [b, c]. When this decomposition is applied recursively, it results in binary terms (BT's). As an example, consider the sentences

A student visits a conference on software engineering.
The software engineering conference is visited by some students.

from which, due to syntactical and morphological normalization, the same two frames are initially constructed for both sentences:

$\begin{displaymath}\mbox{\it BT}_1={\tt [student, visit]}, \; \mbox{\it PF}_1={\tt [visit, [conference, [engineering, software]]]}. \end{displaymath}$

$\mbox{\it PF}_1$ is further unnested to

$\begin{displaymath}\mbox{\it BT}_2={\tt [visit, conference]}, \; \mbox{\it BT}_3={\tt [conference, engineering]} \end{displaymath}$

$\begin{displaymath}{\rm and}\;\, \mbox{\it BT}_4={\tt [engineering, software]} \end{displaymath}$

Of course the unnesting makes it all the more important that a syntactical analyzer should be able to deduce the right dependency structure in complicated phrases.

In the current phase of our experimentation, phrase frames are constructed only from noun phrases, taking into account only prepositional phrase (PP) post-modifiers of nouns starting with the preposition of. These PP's are more likely to modify the preceding noun than others for which the PP-attachment problem has to be solved. However, we were able to disambiguate the modification structure of complicated noun phrases by applying statistical methods (described in section 4.5). We did not yet apply any lexico-semantical normalization.

Next: 3. Representational Choices Up: An Evaluation of Linguistically-motivated Previous: 1. Introduction

avi (dot) arampatzis (at) gmail