Next: 7. A Linguistically-motivated Retrieval Up: Linguistically-motivated Information Retrieval Previous: 5. Linguistic Normalization

6. Weighting and Matching

Term weighting is a crucial part of any information retrieval system. Statistical weighting schemes like tf.idf, which perform well for single terms, do not seem to extent on multi-word terms. Most work on the use of multi-word indexing terms in IR concentrated on representation and matching strategies. Little consideration was given to weighting and to scoring of documents matched. An obvious weighting strategy for phrasal terms is to weight a term as a function of the weights of its components. However, such strategies did not produce uniform results [18,30]. We suggest a simple weighting scheme suitable for phrase frames which takes into account the modification structure and its depth. Phrase frames may contain nested phrase frames (sub-frames) at different depths. To simplify the structural matching of complicated phrase frames, the strategy of unnesting can be followed. The unnesting of a phrase frame produces all possible sub-frames down to single-term frames. This can be understood easier, by visualizing a phrase frame as a tree; the root-node is the main head, and every node is modified by its child-nodes. Such an abstract tree is depicted in figure 3.

**Figure 3:** tree visualization of a phrase frame p with a sub-frame q at depth k.
$\begin{figure} \centering \epsfxsize=2.5in \leavevmode \epsfbox{frame_tree.eps} \end{figure}$

Unnesting produces all possible triangles q of all possible sizes and depths. The main head of a frame carries the most semantic information of all the other elements in the frame. The other elements modify the head, increasing the amount of semantic information carried by the frame. The amount of information added to the frame by an element is inversely proportional to the depth of the element within the frame.

First we introduce the predicate $\mbox{\it sub}(p,q,k)$ as a shorthand for the expression: phrase frame p has phrase frame q as a sub-frame at depth k. The depth weight of sub-frame q obtained from frame pcan be expressed as:

$\begin{displaymath}\mbox{\it dw}(q,p)= \sum_{k:\mbox{\it sub}(p,q,k)} \frac{1}{1+k} \end{displaymath}$

The sum accounts for more than one occurrences of sub-frame q within p(rather rare because stylistic considerations for natural language do not favour repetitions of the same sub-phrase within a NP or VP). Let a document d have the set C(d) of phrase frames as characterization, augmented with all the unnested terms down to single terms. Then the frame frequency of q within document d can be described as:

$\begin{displaymath}\mbox{\it ff}(q,d) = \sum_{p \in C(d)} \mbox{\it dw}(q,p) \end{displaymath}$

The geometrical length of the document frame vector in the N-dimensional frame space is:

$\begin{displaymath}l(d) = \sqrt{\sum_{q \in C(d)} \mbox{\it ff}(q,d)^2} \end{displaymath}$

The weight of frame q within document d is estimated by:

$\begin{displaymath}w(q,d) = \mbox{\it ff}(q,d) / l(d) \end{displaymath}$

The similarity between a document d and a query q then is estimated by the dot product formula:

$\begin{displaymath}S(d,q) = \sum_{r \in C(d)\cap C(q)} w(r,d)\ast w(r,q) \end{displaymath}$

Using the last formula, the documents of a collection can be ranked in a response to a query.

Next: 7. A Linguistically-motivated Retrieval Up: Linguistically-motivated Information Retrieval Previous: 5. Linguistic Normalization

avi (dot) arampatzis (at) gmail