Next: 6. Weighting and Matching Up: Linguistically-motivated Information Retrieval Previous: 4. Representation of Phrases

Subsections

5. Linguistic Normalization

The goal of normalization is to map different but semantically equivalent phrases onto one canonical representative phrase, the phrase frame (figure 2).

**Figure 2:** linguistic normalization
$\begin{figure} \centering \epsfxsize=5.5in \leavevmode \epsfbox{normaliz.eps} \end{figure}$

We distinguish between three types of normalization: the morphological, syntactical, and lexico-semantical normalization.

5.1 Morphological Normalization

Morphological normalization has traditionally been performed by means of stemming. Non-linguistic stemming, especially when it operates in the absence of any lexicon at all, is rather aggressive and may result in improper conflations. For instance, a Porter-like stemmer without a lexicon will reduce university to universe and organization to organ. Errors like these are translated into a loss in retrieval precision. This impact is greater for more inflected languages than English, because of the increased number of introduced ambiguities. Such improper conflations can be avoided by simply checking for existence of the word-form in a lexicon after each reduction step. Nevertheless, the verb forms attached and suited will still be reduced wrongly to the nouns attache and suite respectively.

Taking into account the linguistic context, a more conservative approach will prevent many of these errors. Conflations can be restricted to retain the part-of-speech of a word. In this respect, morphological normalization may be performed by means of lemmatization:

1.: Verb forms are reduced to the infinitive.
2.: Inflected forms of nouns are reduced to the nominative singular.
3.: Comparatives and superlatives of gradable adjectives are reduced to the absolute form.

For this task, the grammatical rules for forming e.g. past participles or noun plurals should be applied in a reverse way. Furthermore, the utilization of exception lists in order to handle irregularities such as wolf-wolves, bad-worse-worst, and see-saw-seen is indispensable.

Lemmatization is relatively simple and handles mostly inflectional morphology. It is similar to the lexicon-based word normalization, as referred to in [27]. It must be noted that there are cases where lemmatization reduces noun and verb forms to the same lemma. Consider for instance the verb form attacked and the plural noun attacks; both will be lemmatized as attack. Although such conflations seem beneficial, there are indications from a text filtering experiment that the confusion between nouns and verbs when these are lemmatized, decreases effectiveness [28].

Derivational morphology involves semantics and cross part-of-speech word relations, hence should be approached carefully. Certain derivational transformations may be suggested by syntax. For instance, verbs may be turned into nouns (nominalization) or the other way around as it will be shown in the next section. The remaining derivational morphology should be treated, where possible, by lexico-semantical normalization.

5.2 Syntactical Normalization

According to the linguistic principle of headedness, any phrase has a single head. This head is usually a noun (the last noun before the post-modifiers) in NPs, the main verb in the case of VPs. The rest of the phrase consists of modifiers. Consequently, every phrase can be mapped onto a phrase frame:

$\begin{displaymath}\mbox{\it PF}= [h,m] \end{displaymath}$

The head h gives the central concept of the phrase and the list m of modifiers serves to make it more precise. Conversely, the head may be used as an abstraction of the phrase, loosing precision but gaining recall. It should be noted that although the head-modifier relation implies semantic dependence, what we have here is purely a syntactic relation. The intention is to produce meaningful indexing terms without deep semantic analysis. Therefore, the precise semantic interpretation of any head-modifier relation is forborne, treating it simply as an ordered relation.

Heads and modifiers in the form of phrases are recursively defined as phrase frames: [[h₁,m₁],[h₂,m₂]]. The modifier part may be empty in case of a bare head. This case is denoted equivalently by [h,] or [h]. The head may serve as an index for a list of phrases with occurrence frequencies:

[ engineering   1026 ,
      of software 7 ;
      reverse   102 ;
      software  842 ;
      ... ]

where the frequency of a bare head includes that of its modified occurrences. Alternative modifications of the head are separated by semicolons.

Phrases frames are produced by normalizing the phrase representations of definitions 3 and 4. In noun phrases, determiners are of little interest for IR, thus they may be eliminated. The normalization of noun phrase is defined as:

Definition 5 (Noun Phrase Normalization)

$\begin{displaymath}\mbox{\it NP}\ = \mbox{\it det}^\ast \mbox{\it pre}^\ast \mbo... ...sto [\mbox{\it head},\mbox{\it pre}^\ast \mbox{\it post}^\ast] \end{displaymath}$

The elements of the list $\mbox{\it pre}^\ast \mbox{\it post}^\ast$ are considered to modify the head independently from each other, and they are separated by semicolons. Hence, any PF containing a list, e.g. [h,m]=[h,m₁;m₂], may be expanded as [h,m₁];[h,m₂]. The noun phrase normalization can be applied recursively on heads and modifiers which include other NPs. For example,

a special lecture on software engineering $\mapsto$
$\mapsto$ [lecture, special; on software engineering] $\mapsto$ $\mapsto$ [lecture, special; on [engineering,software]]

Prepositions (e.g. on in the last example) may optionally be kept for further semantic analysis, although their use is usually dropped for simplicity. However, it must be noted that the space-man on the ship enjoys a different view than the space-man outside the ship and the space-man without ship is probably not even in space. The impact of prepositions on retrieval performance is not well-established, but their careful treatment may be beneficial. Their use and meaning can always be postponed until the matching of PFs. Prepositions, conjuctions and such other lexical items were considered as connectors in the characterization language of index expressions [29].

The noun phrase presents only a few opportunities for syntactical normalization. For the verb phrase, more normalizations can be found which preserve its meaning, or rather do not loose information obviously relevant for retrieval purposes. To begin with the kernel, elimination of time, modality and voice seems resonable. The obviously meaningful head-modifier combinations are $[\mbox{\it subj},\mbox{\it verb}]$ and $[\mbox{\it verb},\mbox{\it comp}]$ :

Definition 6 (Verb Phrase Normalization I)

$\begin{displaymath}\mbox{\it VP}\ = \mbox{\it subj}\ \mbox{\it kernel}\ \mbox{\i... ...)];[\mbox{\it verb}(\mbox{\it kernel}) , \mbox{\it comp}^\ast] \end{displaymath}$

where the function $\mbox{\it verb}(\mbox{\it kernel})$ returns the main verb of the kernel.

For example,

the students will probably attend a special lecture on Monday $\mapsto$
$\mapsto$ [the students, attend] ; [attend, a special lecture; on Monday]

In the definition 6 the adverbs of the kernel are eliminated. Small experiments have suggested that adverbs have a little indexing value [28]. However, they might be more useful if they combine with the verbs (or adjectives in the case of noun phrase) they modify, e.g. [attend, probably]. The indexing value of such verb-adverb and adjective-adverb pairs has to be evaluated empirically.

The possibility exists to map verbs to nouns (nominalization) or vice versa (verbalization). Such a normalization allows the matching of PFs derived from different sources (verb phrases or noun phrases). For example, (to) implement can be nominalized to implementation. Since the opposite transformation is also possible for nominalized verb forms, the choice has to be made on the basis of experimentation. We will presently choose to turn everything into ``pictures'' (noun phrases) by applying the former alternative. This results in a more drastic (and compact) normalization:

Definition 7 (Verb Phrase Normalization II)

$\begin{displaymath}\mbox{\it VP}\ = \mbox{\it subj}\ \mbox{\it kernel}\ \mbox{\i... ...}(\mbox{\it kernel})) , \mbox{\it subj}\ \mbox{\it comp}^\ast] \end{displaymath}$

where the function $\mbox{\it nom}(\mbox{\it verb}(\mbox{\it kernel}))$ nominalizes the main verb of the kernel.

For example,

the students will probably attend a special lecture on Monday $\mapsto$
$\mapsto$ [attendance, the students; a special lecture; on Monday]

Similarily, adverbs may be mapped onto adjectives to modify the nominalized verbs, e.g. [attendance, probable]. Cross part-of-speech transformations like these controlled by syntax can deal to some extent with derivational morphology, compensating for the conservative nature of lemmatization described in the previous section. The further application of the noun phrase normalization to the last phrase frame, results eventually in:

[attendance, students; [lecture, special]; on [Monday]]

All these normalizations are rather language dependent and the final decision of what has to be included in the phrase frames should be left to the linguists and system designers; we have merely suggested some obvious ones.

5.3 Lexico-semantical Normalization

This kind of normalization depends on the observation that certain relations can be found between the meaning of individual words. The most well-known of those lexico-semantical relations are:

synonymy and antonymy,
hyponymy and hypernymy (the is-a relation).
meronymy and holonymy (the part-of relation).

Two important aspects which should be taken into account for this kind of normalization are polysemy and collocations.

A word is polysemous if its meaning depends on the context. For example, by itself the noun note can be meant as a being a short letter, or as a musical note, consequently its context has to clarify its meaning. The intended meaning determines the words which are lexico-semantically related to the initial word. Using the synonymy relation for the first meaning we can obtain brief, while tune is obtained in the second case. This suggests that the conceptual context of a word should be taken into account.

Collocations are two or more words which often co-occur adjacent, e.g. health care, having a certain meaning. When using WORDNET in expanding a query with hypernyms, the notion health care obtains social insurance which cannot be obtained in any case by expanding the two separate words. This observation suggests that collocations should be considered as single units.

Assuming that the word sense ambiguity originating from polysemy is resolved, three possibilities can been seen for lexico-semantical normalization:

1.: Semantical clustering in analogy with stemming. For instance, several synonyms in a context are reduced to one word cluster. The word cluster may be represented by the most frequent of the synonyms.
2.: Semantical expansion, expanding a term with all its -nyms. The derived terms may be weighted according to their relation with the initial term.
3.: Incorporation of a semantical similarity function into the retrieval function (Fuzzy Matching). Based on a semantical taxonomy, an ontology or a semantical network we can define a semantical similarity function for words.

Semantical clustering is rather aggressive and suffers from the same drawbacks as stemming. For example, two ``synonyms'' are always overlapping in meaning and they do not actually mean the same. The convention to call them ``synonyms'' depends on the degree of overlap. One of the questions is how extented these clusters should be, that is, what is the maximum semantical distance allowed for two words in order to belong to the same cluster. Again here, usability and discrimination come to play an important role. Too large clusters will be assigned as indexing terms to too many documents, and therefore are not discriminating. Too small clusters, e.g. one or two words, will not have a great impact in performance compared to conventional indexing, thus they are not usable. Experimentation should provide a usably discriminating cluster size. Semantical expansion can partly overcome the cluster size problem by supplying many related terms weighted inversely proportional to their semantical distance from the original term. However, expansion can easily result in an explosion of indexing or query terms. The possibility of fuzzy matching seems elegant and exciting, although it is far more computationally expensive than the others.

Working out fuzzy matching a bit more, using only the relations $\mbox{\it SYN}$ onymy, $\mbox{\it HYPON}$ nymy, and $\mbox{\it HYPER}$ nymy between two words x and y, one could define:

$\begin{displaymath}\mbox{\it sim}(x,y)= \left\{ \begin{array}{ll} 1 & x = y \... ...\it HYPER}_n (y) \\ 0 & {\rm otherwise} \end{array} \right. \end{displaymath}$

(1)

where $a\in\mbox{\it HYPER}_n(b)$ means that a can be found by walking in the graph of hypernyms of b a number of n steps. $a\in\mbox{\it HYPON}_n(b)$ is similarly defined. $\mbox{\it SYN}$ is a symmetric relation, meaning that if $x\in\mbox{\it SYN}(y)$ then $y\in\mbox{\it SYN}(x)$ , so it is sufficient to check only if one of the two holds. It should be noticed that $\mbox{\it sim}$ assumes an order in its arguments, x is a word from documents and y from a query. Moreover, hypernyms of query terms are matched with lower weight than hyponyms to reflect the assumption that a user's query salmon should not retrieve many documents about fish in general, but fish should retrieve documents about salmon.

As an example of fuzzy matching, consider the sentence

The students will probably attend a conference on software engineering.

from which, after syntactical and morphological normalization and elimination of some (assumed) redundant elements, the following phrase frame may be constructed:

[attendance, student; conference; [engineering, software]]

Now let us consider another sentence

The pupils are listening carefully to the tutorial about software engineering.

which in a phrase frame representation becomes:

[listening, pupil; tutorial; [engineering, software]]

Note that listening here represents the nominalized form (the listening) of the verb to listen rather than its progressive form. Using WORDNET's lexical graph, and assuming that the latter sentence is a part of a natural language description of a user's information need (query), the following relations hold:

$\begin{displaymath}{\tt student}=\mbox{\it SYN}({\tt pupil}) \Rightarrow \mbox{\it sim}({\tt student},{\tt pupil})=0.9 \end{displaymath}$

$\begin{displaymath}{\tt conference}=\mbox{\it HYPER}_2({\tt tutorial}) \Rightarrow \mbox{\it sim}({\tt tutorial},{\tt conference})=0.5^2 \end{displaymath}$

The nouns listening and attendance may be matched through the relation which holds between their corresponding verbs:

$\begin{displaymath}{\tt attend}=\mbox{\it HYPON}_1({\tt listen}) \Rightarrow \mbox{\it sim}({\tt attend},{\tt listen})=0.7 \end{displaymath}$

Using these relations, it is now easy to match the two sentences. However, this example is conveniently selected as it results in phrase frames with similar structure. In general, this is not the case, suggesting that such a lexico-semantical similarity function should be a part of a larger structural matching technique.

Next: 6. Weighting and Matching Up: Linguistically-motivated Information Retrieval Previous: 4. Representation of Phrases

avi (dot) arampatzis (at) gmail