The bag-of-words paradigm has dominated the commercially available information retrieval systems for about three decades. The main reasons of the long endurance of such systems based on simple assumptions, like the naive Keyword Retrieval Hypothesis, are: first, they are relatively easy and simple to implement (it takes a third year computer science student with the knowledge of a programming language, an IR textbook, and some days' time), and second and most-important, these systems have presented until recently a satisfactory effectiveness in searching collections in the class of megabytes.
The digital and networking revolution has made available data in the class of gigabytes, exposing the inadequate nature of keyword-based systems. The searching for information has become a laborious task for a user who presently has to perform her or his own selection over the ``dirty'' output of a World-Wide Web search engine, for example. As a consequence, many researchers have aimed at higher levels of natural language utilization in IR, assuming that ``understanding'' better the information need, as well as the information residing in a database is the key for improving retrieval effectiveness.
However, the attempts made to break out of the bag-of-words paradigm by employing NLP and other linguistic resources have until now presented inconsistent or at least dubious results. One explanation why NLP has not had more successes in IR is that it does not go far enough. First, the currently available NLP techniques suffer from lack of accuracy and efficiency, and second, there are doubts if syntactic structure is a good substitute for semantic content. The evidence so far suggests further investigation and better modeling.
In this article, we have reviewed some of the most important research in the field, and discussed an general model for a linguistically-motivated retrieval system. We believe that a retrieval schema which is based on the Phrase Retrieval Hypothesis and incorporates linguistic normalization has more potential in improving retrieval effectiveness than keyword-based schema's. We have suggested a suitable model and some techniques, however, whether or not the discussed techniques work remains unclear and the answer requires more empirical data.
Considering that better IR means more user satisfaction, perhaps a more radical change in the focus of IR is needed. Maybe the future of IR is not to provide better ranking of retrieved documents, but to supply the very information a user is seeking. A compact summary of retrieval results, or a brief answer might be more usable for an average user than a ranked list of hundreds of documents. However, automatic summarization, question answering, and information extraction systems require advanced NLP techniques. Furthermore, the traditional precision- and recall-based retrieval quality metrics may not be able to evaluate the ability of a system to derive such information, consequently other metrics will have to be developed. Nevertheless, one thing seems certain for the future: NLP and other linguistic resources will become -- if they are not becoming already -- indispensable parts of every effective IR system.