Next: 4. Experimental Setup Up: An Evaluation of Linguistically-motivated Previous: 2. A Linguistically-motivated Indexing

3. Representational Choices

The different indexing sets we experimented with are summarized below. The acronyms will be used to refer to these choices in the rest of the article.

w

(words): All word-forms found in the text.

Sw

(Stemmed words): All word-forms stemmed by a Porter stemmer. This a traditional indexing scheme and serves as the baseline in order to compare the effectiveness of the rest of indexing schemes.

Lw

(Lemmatized words): The same as w, except that all word-forms are lemmatized with respect to their POS category. In all the following choices, lemmatization is applied as standard.

Of course, for all w, Sw and Lw we eliminate words of low indexing value by using a POS stop-list (see section 4.5).

Ln

(Lemmatized nouns): Nouns and proper nouns are well-known to be important in retrieval. What happens if we omit all other keywords?

Lnj

(Lemmatized nouns and adjectives): The combined effect of using the union of nouns and adjectives is investigated in this experiment. These two categories cover most of the words occurring in noun phrases.

Lnv

(Lemmatized nouns and verbs): We investigate the combined effect of using the union of nouns and verbs.

Lnjv

(Lemmatized nouns, adjectives and verbs): This experiment serves as an indication of what might happen if we include to the indexing language only linguistic entities which are extracted from noun or verb phrases. Moreover, the impact of using adverbs for indexing can be measured indirectly by comparing Lnjv with Lw, since the indexing set Lnjv can be constructed from Lw by removing the adverbs.

Lap

(Lemmatized adjacent word-pairs, extracted from NPs): These word-pairs consist of the nouns and adjectives of Lnj, associated to form 2-word phrases by using the adjacency criterion. The hypothesis for this experiment is that adjacent words can be considered semantically related because of their proximity and be taken as one term. We use an extended notion of adjacency by accepting non-adjacent words as adjacent if the in-between words belong to certain POS categories (e.g. determiner, article, or preposition). For instance, the phrase pollution of the air gives the adjacent pair pollution_air.

This is an important experiment because in comparison to Lbt (described next) should measure the effect of syntactical normalization on performance.

Lbt

(Lemmatized binary terms (Lbt, extracted from NPs): These binary terms consist of the nouns and adjectives of Lnj, associated to form 2-word phrases by using the term modification criterion, i.e. head-modifier pairs. The head-modifier pairs are computationally more expensive than adjacent pairs since syntactical normalization is required, however, binary terms are syntactically canonical, e.g. both phrases air pollution and pollution of the air are mapped onto the same head-modifier pair, [pollution,air].

Next: 4. Experimental Setup Up: An Evaluation of Linguistically-motivated Previous: 2. A Linguistically-motivated Indexing

avi (dot) arampatzis (at) gmail