2 Thresholding a Ranked List

Essentially, the task of selecting $ K$ is equivalent to thresholding in binary classification or filtering. Thus, we recruited and adapted a method first appeared in the TREC 2000 Filtering Track, namely, the score-distributional threshold optimization (s-d) [2,3].

2.1 The S-D Threshold Optimization

Let us assume an item collection of size $ n$ , and a query for which all items are scored and ranked. Let $ P(s\vert 1)$ and $ P(s\vert)$ be the probability densities of relevant and non-relevant documents as a function of the score $ s$ , and $ F(s\vert 1)$ and $ F(s\vert)$ their corresponding cumulative distribution functions (cdfs). Let $ G_n \in [0,1]$ be the fraction of relevant documents in the collection of all $ n$ documents, also known as generality. The total number of relevant documents in the collection is given by

$\displaystyle R$ $\displaystyle = n\,G_n$ (1)

while the expected numbers of relevant and non-relevant documents with scores greater than $ s$ are

$\displaystyle R_+(s)$ $\displaystyle = R\, \left(1-F(s\vert 1)\right)$ (2)
$\displaystyle N_+(s)$ $\displaystyle = \left(n - R \right) \left(1-F(s\vert)\right)$ (3)

respectively. The expected numbers of the relevant and non-relevant documents with scores $ \le s$ respectively are

$\displaystyle R_-(s)$ $\displaystyle = R - R_+(s)$ (4)
$\displaystyle N_-(s)$ $\displaystyle = \left( n - R \right) - N_+(s) \,$ (5)

Let us now assume an effectiveness measure $ M$ of the form of a linear combination the document counts of the categories defined by the four combinations of relevance and retrieval status, for example a linear utility [18]. From the property of expectation linearity, the expected value of such a measure would be the same linear combination of the above four expected document numbers. Assuming that the larger the $ M$ the better the effectiveness, the optimal score threshold $ s_\theta$ which maximizes the expected $ M$ is

$\displaystyle s_\theta =\operatornamewithlimits{arg\,max}_{s} \left\{ M(R_+(s) , N_+(s) , R_-(s), N_-(s)) \right\}$ (6)

Given $ n$ , the only unknowns which have to be estimated are the densities $ P(s\vert 1)$ and $ P(s\vert)$ (or their cdfs), and the generality $ G_n$ .

So far, this is a clear theoretical answer to predicting $ s_\theta$ for linear effectiveness measures. In Section 2.3 we will see how to deal with non-linear measures, as well as, how to predict rank (rather than score) cut-offs.

2.2 Probability Thresholds

Given the two densities and the generality defined earlier, scores can be normalized to probabilities of relevance straightforwardly [2,14] by using the Bayes' rule.

Normalizing to probabilities is very important in tasks where several rankings need to be fused or merged such as in meta-search/fusion or distributed retrieval. This may also be important for thresholding when documents arrive one by one and decisions have to be made on the spot, depending on the measure under optimization. Nevertheless, it is unnecessary for thresholding rankings since optimal thresholds can be found on their scores directly, and it is furthermore unsuitable given $ F_1$ as the evaluation measure.

While for some measures there exists an optimal fixed probability threshold, for others it does not. Lewis [13] formulates this in terms of whether or not a measure satisfies the probability thresholding principle, and proves that the $ F$ measure does not satisfy it. In other words, how a system should treat documents with, e.g., 50% chance of being relevant depends on how many documents with higher probabilities are available.

The last-cited study also questions whether, for a given measure, an optimal threshold (not necessarily a probability one) exists, and goes on to re-formulate the probability ranking principle for binary classification. A theoretical proof is provided about the $ F$ measure satisfying the principle, so such an optimal threshold does exist. It is just a different rank or score threshold for each ranking.

2.3 The S-D Rank Optimization

The s-d threshold optimization method is based on the assumption that the measure $ M$ is a linear combination of the document counts of the four categories defined by the user and system decisions about relevance and retrieval status. However, measure linearity is not always the case, e.g. the $ F$ measure is non-linear.

Non-linearity complicates the matters in the sense that the expected value of $ M$ cannot be easily calculated. Given a ranked list, some approximations can be made simplifying the issue. If $ G_n$ , $ F(s\vert 1)$ , and $ F(s\vert)$ are estimated on a given ranking, then Equations 2-5 are good approximations of the actual document counts. Plugging those counts into $ M$ , we can now talk of actual $ M$ values rather than expected. The score threshold which maximizes $ M$ is given by Equation 6.

While $ M$ can be optimal anywhere in the score range, with respect to optimizing rank cutoffs we only have to check its value at the scores corresponding to the ranked documents, plus one extra point to allow for the possibility of an empty optimal retrieved set. Let $ s_k$ be the score of the $ k$ th ranked document, and define $ M_k$ as follows:

M_k =
M(R_+(s_k) , N_+(s_k) , R_-(s_k)...
...k)) & k=1, \dotsc, n\\
M(0, 0, R, n-R) & k=0\\

The optimal rank $ K$ is $ \operatornamewithlimits{arg\,max}_{k} M_k$ . This allows for K to become 0, meaning that no document should be retrieved.

avi (dot) arampatzis (at) gmail