5 Experiments

In this section, we will conduct a range of experiments with the truncated models of [5], which we discussed in great detail above. Since our focus is the thresholding problem, we use an off-the-shelf retrieval system: the vector-space model of Apache's $ \mathtt{Lucene}$ .

More information about the collection, topics, and evaluation measures can be found in the overview paper in this volume, and at the TREC Legal web-site.

5.1 Runs

For TREC Legal 2007 and 2008 we created the following runs:

Off-the-shelf LUCENE using the $ \mathtt{RequestText}$ as query, on a stemmed index, using the generic SMART stoplist. The 2007 rankings are truncated at 25k results.

This run is the run labeled $ \mathtt{catchup0701t}$ in [4].

Same as above, but in pre-processing this year's topics, we used the $ \mathtt{RequestText}$ field stop-listed by an extended list in which we manually included low-content words based on the topics of 2006 and 2007. All 2008 rankings are truncated at 100k items.

This runs is the basis for the official submissions labeled $ \mathtt{uva}$ - $ \mathtt{xcons}$ , $ \mathtt{uva}$ - $ \mathtt{xb}$ , and $ \mathtt{uva}$ - $ \mathtt{xk}$ .

For the threshold optimization, we first apply the original version of the score-distributional threshold optimization as it has been used, for example, in the Filtering track [2,3]:
sd original
First fitting a mixture of normal (for relevant) and exponential (for non-relevant) to the score distribution, and then calculate the rank that maximizes the $ F_1$ measure. Note that the fit may indicate an optimal rank threshold beyond the run's length (25k in 2007 and 100k in 2008), in which case we simply select the final rank.

This run corresponds to our official submission labeled $ \mathtt{uva}$ - $ \mathtt{xk}$ .

In this paper, we presented an improved version of the sd method in Section 3. The improvements that have the greatest impact on end-user effectiveness are:
  1. Use of truncated distributions [5] to account for natural score bounds or truncations.
  2. EM is run with different initial parameters, and better termination methods. We also now run it up to 100 times instead of 10.
  3. We used the square error before to select the best fit; we replaced this with the $ \chi^2$ which is more suitable for distributions.
  4. Optimal binning. Before, we used a fixed number of $ \max(5,
t/200)$ bins, which gave 500 bins (or a bit less after a left-truncation of the data) for the 2008 rankings.
Consequently, we provide here additional runs:
Theoretical Truncation
Runs using the theoretical truncation of Section 3.3.1. The B runs is down-sampled (a stratified sample of 1/3).
Theoretical Truncation
Runs using the technical truncation of Section 3.3.2. The A runs are down-sampled (a stratified sample of 1/3). Details of the effect of sampling and binning on the fits are in Table 1.

5.2 Results and Discussion

We first discuss the overall quality of the rankings, and then the main topic of this paper--estimating the cut-off $ K$ .

Table: Ranking quality for the Legal 2007 & 2008. The highest, lowest, and median are of the 23 submissions in 2008 using the $ \mathtt{RequestText}$ field only.
Run $ Prec@5$ $ Recall@B$ $ F_1@R$
Legal07 0.3302 0.1548 0.1328
Legal08 0.4846 0.2036 0.1709
highest 0.5923 0.2779 0.2173
median 0.4154 0.2036 0.1709
lowest 0.0538 0.0729 0.0694

The top half of Table 2 shows several measures on the two underlying rankings, Legal07 and Legal08. We show precision at 5 (all top-5 results were judged by TREC); estimated recall at $ B$ ; and the $ F_1$ of the estimated precision and recall at $ R$ (i.e. the estimated number of relevant documents).

To determine the quality of our rankings in comparison to other systems, we show the highest, lowest, and median performance of all submissions in the bottom half of Table 2. As it turns out, Legal08 obtains exactly the median performance for $ Recall@B$ and $ F_1@R$ when using all relevant documents in evaluation. Both rankings fare somewhat better than the median at $ Prec@5$ and in evaluating with the highly relevant documents only. It is clear that our rankings are far from optimal in comparison with the other submissions. On the negative side, this limits the performance of the s-d method. On the plus side, it makes our rankings good representatives of the median-quality ranking.

Table: Estimating cut-off $ K$ for the Legal 2007 & 2008. The highest, lowest, and median are of the 23 submissions using the $ \mathtt{RequestText}$ field. Statistical significance (t-test, one-tailed) at 95% ($ ^{\circ}$ ) and 99% ( $ ^{\bullet}$ ) against the original sd method.
    2007 2008
Run Truncation $ F_1@K$ $ F_1@K$
sd original None - 0.0681 $ \,^{\mbox{-}}$
B Theoretical 0.0984 0.1361$ ^{\circ}$
A Technical 0.1011 0.1284$ ^{\circ}$
highest   - 0.1848
median   - 0.0974
lowest   - 0.0051

Table 3 shows the results for the various thresholding methods. We see that the original s-d method stays well behind the $ F_1$ @$ R$ in Table 2. Although this comparison is unfair, the mean estimated number of relevant items is generally not known, we expected the original s-d method to do better.

All runs with the improved version of the s-d method lead to significantly better results. The B run use the theoretical truncation of Section 3.3.1, whereas the A runs use the technical truncation of Section 3.3.2. For 2007, the technically truncated model A is superior to the theoretically truncated model B. For 2008, the technically truncated A model lags somewhat behind the theoretically truncated B model. In comparison with the `old' non-truncated model, corresponding to our official TREC 2008 submission, both the truncated models obtain significantly better results.

We also show the highest, lowest, and median performance over the 23 submissions to TREC Legal 2008 (recall that the thresholding task is new at TREC 2008, so there is no comparable data for 2007). Note that the actual value of $ F_1@K$ is a result of both the quality of the underlying ranking and choosing the right threshold. As seen earlier, our ranking has the median $ Recall@B$ and $ F_1@R$ . With the estimated threshold of the s-d model, the $ F_1@K$ is 0.1374, well above the median score of 0.0974.

There is still amble room for improvement. The $ F_1@R$ in Table 2 is 0.1328 for 2007 and 0.1709 for 2008, and we obtain 75-80% of these scores. Obviously, $ R$ is not known in an operational system, and $ F_1@R$ serves as a soft upperbound on performance.

5.3 Further Analysis

Figure: $ F_1$ @$ R$ versus $ F_1$ @$ K$ as estimated by s-d method for all 26 topics of TREC Legal 2008.

Figure: S-D model predictions (top plots) versus the official evaluation (bottom plots).
  \includegraphics[width=4.0in]{plots/}   \includegraphics[width=4.0in]{plots/}  
  Topic 73 (2007)   Topic 105 (2008)  
  \includegraphics[width=4.0in]{plots/}   \includegraphics[width=4.0in]{plots/}  
  Topic 124 (2008)   Topic 145 (2008)  

Figure 6 show the F$ _1$ scores of the Legal 2008 B run, plotted against the "ceiling" of F$ _1$ at the estimated R. We will look in detail at some of the topics from 2007 and 2008 B runs:

Topic 73
$ B$ = 4,085; $ est.R$ = 31,894; $ K_{opt}$ = 22,091.
Topic 105
$ B$ = 36,549; $ est.R$ = 34,424; $ K_{opt}$ = 49,439.
Topic 124
$ B$ = 86,075; $ est.R$ = 20,083; $ K_{opt}$ = 44,524.
Topic 145
$ B$ = 40,315; $ est.R$ = 91,790; $ K_{opt}$ = 82,806.

Figure 7 compares the prediction of the s-d model with the official evaluation's estimated precision, recall, and F$ _1$ . Before discussing each of the topics in detail, an immediate observation is that the estimated (non-interpolated) precision is strikingly different from monotonically declining "ideal" precision curves.

For Topic 73 (Legal 2007), the estimated $ R$ exceeds the length of the ranking, and the $ K_{opt}$ corresponds to the last found relevant document at rank 22,091. The s-d model is clearly aiming too low and estimates $ R$ at 2,720 and $ K$ at 2,593.

Topic 105 (Legal 2008) has an $ R$ of 34,424, well within the length of the ranking, and the s-d model estimates an $ R$ of 36,503, near to the real $ R$ , and an estimated $ K$ of 28,952. The divergence in the prediction of $ K$ may be explained, in part, by the fact that $ K_{opt}$ always corresponds to a point where a relevant document is retrieved, and judged documents are very sparse down at this rank.

Topic 124 (Legal 2008) has an $ R$ of 20,083 and the s-d model predicts an $ R$ of 51,231 and a $ K$ of 43,597. Here, the $ R$ is overestimated but the $ K$ is very close to the $ K_{opt}$ . Topic 145 (Legal 2008) has an $ R$ of 91,790, very close to the length of the ranking. The s-d model predict an $ R$ of 87,060 and a $ K$ of 91,590, both relatively close to the official evaluation especially when bearing in mind that the $ K_{opt}$ is again at the last relevant document in the whole ranking.

avi (dot) arampatzis (at) gmail