Next: 8. Conclusions Up: Linguistically-motivated Information Retrieval Previous: 6. Weighting and Matching

7. A Linguistically-motivated Retrieval System Architecture

In this section we discuss how a linguistically-motivated retrieval system, like the one described in this article, can be implemented. Until now, we assumed that an immaculate linguistic analysis is available, disregarding technical implementation details. However, trying to put such a retrieval system into practice, the inefficiencies and ineffectiveness of currently available NLP techniques become apparent. A major source of ineffectiveness is linguistic ambiguity, some of which can be resolved, while the rest requires sophisticated semantic analysis. Furthermore, NLP can be so time-consuming that it becomes impractical for real-world applications. Lacking deep semantic analysis, some design decisions have to be made in order to make a linguistically-motivated retrieval system usable in the real-world.

Given a collection of text documents, the indexing task assigns to each document a characterization in the form of (weighted) phrase frames. Phrase frames are derived from documents through a sequence of processing steps:

1.: Tokenization.
2.: Part-of-speech tagging.
3.: Morphological normalization.
4.: Collocation identification.
5.: Lexico-semantical normalization.
6.: Syntactic analysis.
7.: Syntactical normalization.
8.: Weighting.

The tokenization step constitutes the detection of sentence boundaries followed by division of sentences into words. This may sufficiently be implemented based on capitalization rules, spacing, tabbing, and document's layout considerations.

Part-of-speech tagging assigns a part of speech label to each word in a text depending on the labels assigned to the words around it. It is possible that more than one label can be assigned to word, suggesting some kind of lexical ambiguity in the input. A simple way to overcome this ambiguity is to retain only the most probable label for an ambiguous word, based on the occurrence frequencies of the word under all its possible parts-of-speech. Another solution would be to postpone lexical ambiguity resolution until syntactic analysis. Syntactic rules are able to resolve some lexical ambiguity, but not all. Taking collocations as single units may also resolve some lexical ambiguity. For example, while social can be either adjective or noun, social insurance taken as a single unit is a noun collocation because it functions as a noun. After part-of-speech tagging, morphological normalization is performed, guided by the assigned labels.

Static collocation lists or word co-occurrence statistics can be used to identify collocations. Identified collocations are treated as single units in subsequent processing steps. Lexico-semantical normalization is the following step, assuming that it is implemented by semantical clustering or expansion. If it is implemented as a semantical similarity function, then it is performed during the matching of documents to queries rather than during indexing.

Syntactic analysis or parsing reveals syntactic relations between words, collocations, and phrases in a sentence. Syntactic relations are identified based on syntactic rules (grammar). Given the part-of-speech information for a text, syntactic rules can be formulated for sequences of part-of-speech labels, e.g. the combination adjective-noun surrounded by other part-of-speech labels is a noun phrase. Structural ambiguity (what-modifies-what) may occur during analysis. For instance, every noun phrase with three or more words, two or more of which are nouns, is a potential source of structural ambiguity. To disambiguate such structures, statistical methods can be applied. In the case of noun phrases, first, frequency information from the corpus for all 2-word noun phrases are collected. Then all 3-word noun phrases are disambiguated by assigning to them the most probable structure based on the frequencies of 2-word noun phrases. Gradually, this can be applied up to n-word noun phrases based on the frequencies of all previously disambiguated k-word noun phrases (k<n). Where not enough frequency information is available, left-dependence may be assigned since it is the most probable modification structure in the English noun phrase. A similar statistical approach can be developed to resolve the prepositional phrase attachment problem, guided by sub-categorization information about nouns and verbs.

The next step, syntactical normalization, may be incorporated in the parser in a way that the parser outputs regularized parse tree representations, e.g. phrase frames. As soon as the collection of documents is translated to a phrase frame representation, phrase frames can be weighted according to their frequency characteristics and structure.

A similar procedure like the above indexing steps can be followed to turn a natural language query into a phrase frame representation, allowing the matching of queries to documents. The indexing procedure just described can replace the indexing part of a conventional retrieval system architecture. There is no obvious need why radical architectural changes should be made. Inverted files, vector space and probabilistic retrieval models are still suitable and may be adapted to work with linguistically-motivated indexing terms. What really changes is the way that indexing terms are extracted from documents, and how these are matched. The current inefficiencies and ineffectiveness of NLP techniques can be treated, for the time being, by statistical solutions like the (crude) ones described above. Fortunately, the explosion in computational power that becomes daily available combined with the efforts put into NLP issues from the (computational) linguists' side suggests that the use of linguistically-motivated retrieval systems into everyday practice is merely a matter of time.

Next: 8. Conclusions Up: Linguistically-motivated Information Retrieval Previous: 6. Weighting and Matching

avi (dot) arampatzis (at) gmail