6 Conclusions

We studied the problem of finding an "optimal" point to stop reading a ranked list, by selecting thresholds that optimize the $ F_1$ -measure. The approach taken employs the score-distributional threshold optimization (s-d), a non-parametric method proven effective for binary classification in earlier years. We made significant theoretical and computational improvements over the original method, and identified room for further improvements.

The method uses no other input than the document scores of a standard retrieval run, fit a mixture of (possibly truncated) normal and exponential distributions (normal for relevant, and exponential for non-relevant document scores), and calculate the optimal score threshold given the estimated distributions and their contributing weight. The experiments confirm that the s-d method is effective for determining thresholds, although there is still clear room for improvement: the effectiveness varies considerably per topic, with an average performance of 75-80% of $ F_1@R$ .

Assuming that a normal-exponential mixture is a good approximation for score distributions and that no relevance information is available, we believe that the improved methods described in this paper are a) as general as possible, b) they deal with most known theoretical anomalies and practical difficulties, and consequently, c) they bring us closer to the performance ceiling of s-d thresholding. If the effectiveness is deemed unsatisfactory, further improvements of s-d thresholding should come from using alternative mixtures or training data. Nevertheless, some other mixtures may be more difficult--or even impossible--to estimate. Acknowledgments

We thank Nir Nussbaum for producing the rankings and submitting the official runs. We also benefited from discussions with Stephen Robertson.

This research was supported by the Netherlands Organization for Scientific Research (NWO, grant # 612.066.513, 639.072.601, and 640.001.501).

avi (dot) arampatzis (at) gmail