The purpose of an automated information seeking system is to process information sources, and provide users with the information they need. The particular nature of an information seeking process is determined by the characteristics of information needs and information sources, such as the change rate. For instance, Information Retrieval assumes a one-time user request and a static collection of information objects, while Information Filtering assumes a long-term user interest and a dynamic collection. What all seeking processes have in common is a technique for representing needs and sources. Representation makes possible to automate a process of computing comparisons of relevance between needs and sources. The process of building such representations is widely know as indexing.
Representations are usually derived from the contents of objects. In case of textual objects (documents), words taken directly from the document's text are augmented with weights and traditionally used to form a bag-of-words representation, disregarding the linguistic context. This bag-of-words representation presents inadequacies which have been identified by many researchers. The most obvious inadequacies originate from linguistic variation at the morphological, syntactical, and semantical levels of a natural language.
Linguistic variation is responsible for many possible alternative formulations of a single meaning. Briefly, morphology allows affix changes in words as a result of syntax. Additionally, syntax determines if and how words are associated. Lexico-semantics are about words which can be used in more than one senses, or conversely, senses which can be expressed using different words. Many researchers have developed techniques to deal with linguistic variation, nevertheless, empirical evaluations have presented mixed results leaving the matter unsettled.
In this article, we describe an experiment which was performed as a first attempt in a long way to validate a linguistically-motivated indexing model. This model incorporates various techniques dealing with the types of linguistic variation which are most relevant to information seeking tasks, in a single indexing and matching scheme. The approach taken here is based on a Part-Of-Speech (POS) tagger and syntactic pattern matching. First, we experimented with representations based on combinations of different POS categories. These representations combine the category of nouns with these of adjectives, verbs, and adverbs. The different representational choices are compared to the baseline of using all keywords as index terms. Then, we experimented with composite terms which were built, firstly, using a simple criterion like word adjacency and secondly, using syntactic structure like word modification. We also investigated the effect of morphological normalization by means of lemmatization, which can be seen as POS-directed stemming. Evaluation is done in a classification environment using precision and recall.
The rest of this article is organized as follows. First, we briefly present in section 2 the linguistically-motivated indexing model we want to validate with this line of research. In section 3 we summarize and justify our representational choices. In section 4 we describe the experimental system, algorithms used, evaluation measures, the dataset and pre-processing applied to it. In section 5, experiments and results are discussed. Conclusions are drawn in section 6 and directions for further research are identified.