Let us now elaborate on the form of the two densities and of Section 2.1 and their estimation. ^{3}
Score distributions have been modeled since the early years of IR with various known distributions [21,20,7,6]. However, the trend during the last few years, which has started in [3] and followed up in [8,1,2,14,22], has been to model score distributions by a mixture of normalexponential densities: normal for relevant, exponential for nonrelevant.
Despite its popularity, it was pointed out recently that, under a hypothesis of how systems should score and rank documents, this particular mixture of normalexponential presents a theoretical anomaly [17]. In practice, nevertheless, it has stand the test of time in the light of
In this paper, we do not set out to investigate alternative mixtures. We theoretically extend and refine the current model in order to account for practical situations, deal with its theoretical anomaly, and improve its computation. We also check its goodnessoffit to empirical data using a statistical test; a check that has not been done before as far as we are concerned. At the same time, we explicitly state all parameters involved, try to minimize their number, and find for them a robust set of values.
Let us consider a general retrieval model which in theory produces scores in , where and . By using an exponential distribution, which has semiinfinite support, the applicability of the sd model is restricted to those retrieval models for which . The two densities are given by
Over the years, two main problems of the normalexponential model have been identified. We describe each one of them, and then introduce new models which eliminate the first problem and deal partly with the other.
From the point of view of how scores or rankings of IR systems should be, Robertson [17] formulates the recallfallout convexity hypothesis:
For all good systems, the recallfallout curve (as seen from [...] recall=1, fallout=0) is convex.Similar hypotheses can be formulated as a conditions on other measures, e.g., the probability of relevance should be monotonically increasing with the score; the same should hold for smoothed precision. Although, in reality, these conditions may not always be satisfied, they are expected to hold for good systems, i.e. those producing rankings satisfying the probability ranking principle (PRP), because their failure implies that systems can be easily improved.
As an example, let us consider smoothed precision. If it declines as score increases for a part of the score range, that part of the ranking can be improved by a simple random reordering [19]. This is equivalent of "forcing" the two underlying distributions to be uniform (i.e. have linearly increasing cdfs) in that score range. This will replace the offending part of the precision curve with a flat onethe least that can be done improving the overall effectiveness of the system.
Such hypotheses put restrictions on the relative forms of the two underlying distributions. The normalexponential mixture violates such conditions, only (and always) at both ends of the score range. Although the lowend scores are of insignificant importance, the top of the ranking is very significant, especially for low topics. The problem is a manifestation of the fact that an exponential tail extends further than a normal one.
To complicate matters further, our data suggest that such conditions are violated at a different score for the probability of relevance and for precision. Since the measure we are interested in is a combination of recall and precision (and recall by definition cannot have a similar problem), we find for precision. We force the distributions to comply with the hypothesis only when , where the score of the top document; otherwise, the theoretical anomaly does not affect the score range. If is finite, then two uniform distributions can be used in as mentioned earlier. Alternatively, preserving a theoretical support in , the relevant documents distribution can be forced to an exponential in with the same as this of the nonrelevant. We apply the alternative.
In fact, rankings can be further improved by reversing the offending subrankings; this will force the precision to increase with an increasing score, leading to better effectiveness than randomly reordering the subranking. However, the big question here is whether the initial ranking satisfies the PRP or not. If it does, then the problem is an artifact of the normalexponential model and reversing the subranking may actually be dangerous to performance. If it does not, then the problem is inherent in the scoring formula producing the ranking. In the latter case, the normalexponential model cannot be theoretically rejected, and it may even be used to detect the anomaly and improve rankings.
It is difficult, however, to determine whether a single ranking satisfies the PRP or not; it is wellknown since the early IR years that precision for single queries is erratic, especially at early ranks, justifying the use of interpolated precision. On the one hand, according to interpolated precision all rankings satisfy the PRP, but this is forced by the interpolation. On the other hand, according to simple precision some of our rankings do not seem to satisfy the PRP, but we cannot determine this for sure. We would expect, however, that using precision averaged over all topics should produce amore or lessdeclining curve with an increasing rank. Figure 1 suggests that the offtheshelf system we currently use produces rankings that may not satisfy the PRP for ranks 5,000 to 10,000, on average.

Consequently, we rather leave open the question of whether the problem is inherent in some scoring functions or introduced by the combined use of normal and exponential distributions. Being conservative, we just randomize the offending subrankings rather than reversing them. The impact of this on thresholding is that the sd method turns "blind" inside the upper offending range; as one goes down the corresponding ranks, precision would be flat, recall naturally rising, so the optimal threshold can only be below the range.
We will use new models that, although they do not eliminate the problem, also do not always violate such conditions imposed by the PRP (irrespective of whether it holds or not).
In order to enforce support compatibility, Arampatzis et al. [5] introduced truncated models which we will discuss in this and the next section. They introduced a lefttruncated at normal distribution for . With this modification, we reach a new mixture model for score distributions with a semiinfinite support in .
In practice, however, scores may be naturally bounded (by the retrieval model) or truncated to the upside as well. For example, cosine similarity scores are naturally bounded at 1. Scores from probabilistic models with a (theoretical) support in are usually mapped to the bounded via a logistic function. Other retrieval models may just truncate at some maximum number for practical reasons. Consequently, it makes sense to introduce a righttruncation as well, for both the normal and exponential densities.
Depending on how one wants to treat the leftovers due to the truncations, two new models may be considered.
There are no leftovers (Figure 2). The underlying theoretical densities are assumed to be the truncated ones, normalized accordingly to integrate to one:
Let and be the random variables corresponding to the relevant and nonrelevant document scores respectively. The expected value and variance of are given by Equations 24 and 25 in Appendix B.3. For , the corresponding Equations are 26 and 27 in Appendix B.4.
The underlying theoretical densities are not truncated, but the truncation is of a "technical" nature. The leftovers are accumulated at the two truncation points introducing discontinuities (Figure 3). For the normal, the leftovers can easily be calculated:
The cdfs corresponding to the above densities are:
The equations in this section simplify somewhat when estimating their parameters from downtruncated ranked lists, as we will see in Section 4.1. We do not need to calculate . If, for some measure, the number of nonrelevant documents is required, it can simply be estimated as .
The expected values and variances of and , if needed, have to be calculated starting from Equations 2427 and taking into account the contribution of the discontinuities. We do not give the formulas in this paper.
For both models the right truncation is optional. For , we get , leading to lefttruncated models; this accommodates retrieval models with scoring support in . This is the maximum range that can be achieved with the current mixture, since the restriction of a finite is imposed by the use of the exponential.
When then and . If additionally , then and . Thus we can wellapproximate the standard normalexponential model. Consequently, using a truncated model is a valid choice even when truncations are insignificant.
From a theoretical point of view, it may be difficult to imagine a process producing a truncated normal directly. Truncated normal distributions are usually the results of censoring, meaning that the outtruncated data do actually exist. In this view, the technically truncated model may correspond better to the IR reality. This is also in line with the theoretical arguments for the existence of a full normal distribution [2].
Concerning convexity, both truncated models do not always violate such conditions. Consider the problem at the top score range . In the cases of , the problem is outtruncated in both models, whilein theoryit still always exists in the original model. The improvement so far is of a rather theoretical nature. In practise, we should be interested in what happens when . Our extended experiments (not reported in this paper) suggest that truncation helps estimation in producing higher numbers of convex fits within the observed score range. Consequently, the benefits are also practical.
These improvements make the original model more general, and it indeed produces better fits on our data. In fact, the truncated distributions should have been used in the past during parameter estimation even for the original normalexponential model due to downtruncated rankings.