In this section, we will conduct a range of experiments with the truncated models of [5], which we discussed in great detail above. Since our focus is the thresholding problem, we use an offtheshelf retrieval system: the vectorspace model of Apache's .
More information about the collection, topics, and evaluation measures can be found in the overview paper in this volume, and at the TREC Legal website.
For TREC Legal 2007 and 2008 we created the following runs:
This run is the run labeled in [4].
This runs is the basis for the official submissions labeled  ,  , and  .
This run corresponds to our official submission labeled  .
We first discuss the overall quality of the rankings, and then the main topic of this paperestimating the cutoff .
Run  
Legal07  0.3302  0.1548  0.1328 

Legal08  0.4846  0.2036  0.1709 
highest  0.5923  0.2779  0.2173 
median  0.4154  0.2036  0.1709 
lowest  0.0538  0.0729  0.0694 
The top half of Table 2 shows several measures on the two underlying rankings, Legal07 and Legal08. We show precision at 5 (all top5 results were judged by TREC); estimated recall at ; and the of the estimated precision and recall at (i.e. the estimated number of relevant documents).
To determine the quality of our rankings in comparison to other systems, we show the highest, lowest, and median performance of all submissions in the bottom half of Table 2. As it turns out, Legal08 obtains exactly the median performance for and when using all relevant documents in evaluation. Both rankings fare somewhat better than the median at and in evaluating with the highly relevant documents only. It is clear that our rankings are far from optimal in comparison with the other submissions. On the negative side, this limits the performance of the sd method. On the plus side, it makes our rankings good representatives of the medianquality ranking.
2007  2008  
Run  Truncation  
sd original  None    0.0681 

B  Theoretical  0.0984  0.1361 
A  Technical  0.1011  0.1284 
highest    0.1848  
median    0.0974  
lowest    0.0051 
All runs with the improved version of the sd method lead to significantly better results. The B run use the theoretical truncation of Section 3.3.1, whereas the A runs use the technical truncation of Section 3.3.2. For 2007, the technically truncated model A is superior to the theoretically truncated model B. For 2008, the technically truncated A model lags somewhat behind the theoretically truncated B model. In comparison with the `old' nontruncated model, corresponding to our official TREC 2008 submission, both the truncated models obtain significantly better results.
We also show the highest, lowest, and median performance over the 23 submissions to TREC Legal 2008 (recall that the thresholding task is new at TREC 2008, so there is no comparable data for 2007). Note that the actual value of is a result of both the quality of the underlying ranking and choosing the right threshold. As seen earlier, our ranking has the median and . With the estimated threshold of the sd model, the is 0.1374, well above the median score of 0.0974.
There is still amble room for improvement. The in Table 2 is 0.1328 for 2007 and 0.1709 for 2008, and we obtain 7580% of these scores. Obviously, is not known in an operational system, and serves as a soft upperbound on performance.

Figure 6 show the F scores of the Legal 2008 B run, plotted against the "ceiling" of F at the estimated R. We will look in detail at some of the topics from 2007 and 2008 B runs:
Figure 7 compares the prediction of the sd model with the official evaluation's estimated precision, recall, and F . Before discussing each of the topics in detail, an immediate observation is that the estimated (noninterpolated) precision is strikingly different from monotonically declining "ideal" precision curves.
For Topic 73 (Legal 2007), the estimated exceeds the length of the ranking, and the corresponds to the last found relevant document at rank 22,091. The sd model is clearly aiming too low and estimates at 2,720 and at 2,593.
Topic 105 (Legal 2008) has an of 34,424, well within the length of the ranking, and the sd model estimates an of 36,503, near to the real , and an estimated of 28,952. The divergence in the prediction of may be explained, in part, by the fact that always corresponds to a point where a relevant document is retrieved, and judged documents are very sparse down at this rank.
Topic 124 (Legal 2008) has an of 20,083 and the sd model predicts an of 51,231 and a of 43,597. Here, the is overestimated but the is very close to the . Topic 145 (Legal 2008) has an of 91,790, very close to the length of the ranking. The sd model predict an of 87,060 and a of 91,590, both relatively close to the official evaluation especially when bearing in mind that the is again at the last relevant document in the whole ranking.
avi (dot) arampatzis (at) gmail