Our experimental results using linguistically-motivated indexing terms suggest that part-of-speech information is beneficial to indexing. We found that a traditional keyword-based indexing set can be reduced to retain only its nouns and adjectives without hurting effectiveness, even slightly improving it.
Augmenting indexing sets with composite terms resulted in significant improvements in effectiveness for both adjacent pairs and head-modifier pairs. Nevertheless, head-modifier pairs have not proven better than adjacent pairs despite their syntactically canonical nature. The natural language processing techniques used were very limited, but the investigation suggests that using better linguistic tools would improve performance.
A comparison of lemmatization to stemming was not found to produce significant improvements, although lemmatization is considered less error-prone. In fact, both of these forms of morphological normalization were found not to improve significantly the effectiveness of information seeking environments characterized by relatively complete and accurate information needs, such as classification, categorization, or routing given sufficient training data. However, it still seems beneficial for incomplete and imprecise information needs, such as short retrieval queries or near the bootstrapping of filtering tasks. In any case, morphological normalization as much as part-of-speech information may be used to assist feature reduction techniques.
Our current research effort is aimed at several issues. We intend to, first, develop more extended syntactical normalization techniques, second, to replace the temporal solution of unnesting phrase frames with some kind of structural matching, third, to develop a proper weighting scheme for phrase frames, and fourth, to incorporate also lexico-semantical normalization. The overall goal is to break out of the traditional and long-survived bag-of-words paradigm. This goal may seem rather ambitious but not impossible.