Next: 5. Linguistic Normalization Up: Linguistically-motivated Information Retrieval Previous: 3. The Phrase Retrieval

4. Representation of Phrases

A syntactic phrase can be represented in various ways. At the bottom end of the representation spectrum, a phrase can be represented simply by the unordered set of its words, disregarding all structure. At the other end, all linguistic structure can be taken into account, resulting in complicated parse-tree representations. The choice is a trade-off between syntactic information and ease of phrase extraction.

For example, a simple noun phrase picker could easily be constructed by looking for sequences of articles, adjectives and nouns within a text. A noun phrase extracted like that, would contain little information about how its adjectives and nouns are related to each other, except that adjacent words are most-probably more related than non-adjacent ones. In an unordered set-of-words representation, and assuming there is no special treatment of proper names, the noun phrase

the hillary clinton health care bill proposal

would contain bill clinton, but it is obvious that this phrase does not refer to him. However, experimentally such a co-occurrence of query keywords within a noun phrase has resulted in clear improvements in precision [25]. A sequence-of-words representation does not contain bill clinton (rightly), but does not contain clinton proposal either (wrongly). A full linguistic parsing would result in a much more precise representation. However, the parse-tree would contain too much linguistic detail, most of which is unnecessary for indexing as such details reflect mostly the syntactic description of the natural language used rather than the intended meaning. Since the goal is to derive adequately precise (for retrieval purposes) meaning from syntax, we will settle for less than full linguistic parsing. Linguistically motivated light parsing has already been shown to slightly improve retrieval results over the classic IR approximation to noun phrase recognition [26].

As a result, an intermediate representation of noun and verb phrases is desirable, eliminating structures which can be assumed not to be beneficial to IR:

Definition 3 (Noun Phrase for IR) A core noun phrase $\mbox{\it NP}$ , from an IR point of view, has the general form:

$\begin{displaymath}\mbox{\it NP}\ = \mbox{\it det}^\ast \mbox{\it pre}^\ast \mbox{\it head}\ \mbox{\it post}^\ast \end{displaymath}$

where

$\mbox{\it det}$ (determiner) = article, quantor, number, etc.
$\mbox{\it pre}$ (pre-modifier) = adjective, noun or coordinated phrase.
$\mbox{\it head}$ = usually a noun.
$\mbox{\it post}$ (post-modifier) = prepositional phrase, relative clause, etc.
the asterisk ( $\ast$ ) denotes a list of zero or more elements.

Pre- and post-modifiers may recursively include other NPs.

Definition 4 (Verb Phrase for IR) A verb phrase $\mbox{\it VP}$ , from an IR point of view, has the general form:

$\begin{displaymath}\mbox{\it VP}\ = \mbox{\it subj}\ \mbox{\it kernel}\ \mbox{\it comp}^\ast \end{displaymath}$

where

$\mbox{\it subj}$ (subject) = an NP (in the wide sense, including personal names, personal pronouns etc.)
$\mbox{\it kernel}$ (verbal clause) = inflected form of some verb, possibly composed with other auxiliary verb-forms and adverbs.
$\mbox{\it comp}$ (complements like object, indirect object, preposition complement, etc.) = an NP or Prepositional Phrase (PP).
the asterisk ( $\ast$ ) denotes a list of zero or more elements, depending on the transitivity of the verb (e.g. intransitive verbs have no complements, transitive verbs have an object, ditransitive have an object and indirect object).

In accordance with the above definitions, it is possible to perform a parsing arguably lighter than full linguistic parsing, while a reasonable amount of structural information will still be retained. An example parse-tree is given in figure 1; this is rather compact in comparison with a full linguistic parse-tree for the same sentence which would easily have overrun this page.

**Figure 1:** light parsing for IR purposes.
$\begin{figure} \centering \epsfxsize=6in \leavevmode \epsfbox{parse-tree.eps} \end{figure}$

Of course, it is important that the parser is able to deduce the correct (or at least the most probable) dependency structure in complicated phrases. As we will see next, some elements which are considered of little interest from an IR point of view, e.g. determiners, prepositions, auxiliaries and adverbs, may be eliminated.

Next: 5. Linguistic Normalization Up: Linguistically-motivated Information Retrieval Previous: 3. The Phrase Retrieval

avi (dot) arampatzis (at) gmail