Categories and Subject Descriptors:
H.3 [Information Storage and Retrieval]: H.3.3 Information Search
and Retrieval; H.3.7 Digital Libraries
General Terms: Measurement, Experimentation, Theory
Keywords: Query length, Power-law, Zipf's law, Transaction log analysis
From analysing query-logs, previous research has suggested that the
distribution of query lengths can be approximated with the
(generalized) Zipf's law
or a power-law [5,7]. The law
appears to fit well to the largest length observations
(where depends on the domain) but not to the whole sample. For
example, data show that the length frequency for web queries peaks at
rather than at single keyword queries, suggesting a .
In the discrete case, the fraction of queries with length is given by
Others, without empirical justification, modeled query lengths with a
Poisson distribution by setting its mean to the average query length
[1].
Using a Poisson distribution, in a population of queries with average
length , the fraction of queries with length is
In this paper, we provide a model for query length. Beyond the theoretical interest, such a model has also practical applications in optimizing query cache size in search engines, in generating simulated queries for efficiency testing, and for effectiveness evaluation [e.g., 1]. Using several query data-sets, we confirm that the right tail of the length distribution is better approximated with a power-law rather than a Poisson, and introduce a truncated model. In the process, we estimate the slope for English queries. Finally, we speculatively explain why some data deviate from a power-law and what this may mean for IR.
We applied the methods of Clauset et al. [2] to automatically determine the scaling parameter, , using maximum likelihood methods, and the lower bound, , by minimizing the Kolmogorov-Smirnov statistic. These values are in the final two columns of Tab. 1 and the resulting distribution is depicted in Fig. 1.^{1}
The method gives reasonable power-law fits by excluding short lengths (). The scaling parameter is around 5 for query lengths. This gives a much steeper slope on log-log plots than the familiar Zipfian distribution of word frequencies. Fig. 1 also shows (with impulses) the Poisson distribution used by [1]; it matches well with the data at short lengths, but clearly lacks the tail.The power-law model matches better the data in wider ranges of lengths than the Poisson model. While the fits on MQ07 and TB06 are good, there are some deviations in the right tails of the TB05 and AOL data. The TB05 data may be better fitted with a similar distribution, namely a power-law with an exponential cutoff [2]: . Note that Eq. 1 can be obtained from the latter for . This distribution is a common alternative because it captures finite-size effects, e.g., earthquake magnitude data have the same lack of tail due to the finite amount of energy in Earth's crust. We can speculate that the data-set was probably created in a way that there was a maximum query length imposed. The imposed maximum can be indirect, e.g., a result of the maximum amount of effort and time users are willing to put into formulating queries. The "bump" in the AOL data seems of a technical nature possibly due to a shift from typed queries to cut-and-paste queries; not knowing whether and how these data are processed, we will not speculate any further.
Since the bulk of queries concentrates at short lengths where a power-law does not fit given the current indexing languages, it makes practical sense to use a mix of truncated Poisson/power-law to model query lengths. In such a practical model, lengths are Poisson-distributed for and power-law-distributed for . The choice of depends on the specific domain, i.e., a combination of features of the document collection, query/indexing language, and pattern of use of the system.
Deviations of real data from the power-law may be explained by either finite-size effects or insufficient specificity of indexing terms. Studies in other fields, e.g., economics, have shown that deviations of data from the power-law at hand are usually an indication of inefficiencies in the system that the data come from. Pennock et al. [4] studied power-law distributions of numbers of web links and found that deviations of data from the power-law per category of websites correlate to how much competition is present in that category. The better the power-law fit, the more competitive the category. For query lengths, we have shown how a simple process of re-indexing on longer text fragments can "fix" some deviations, a fact that may point to inefficiencies in single-word indexing schemes.
Acknowledgments This research was supported by the Netherlands Organization for Scientific Research (NWO, under project # 640.001.501).
This document was generated using the LaTeX2HTML translator Version 2008 (1.71)
Copyright © 1993, 1994, 1995, 1996,
Nikos Drakos,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999,
Ross Moore,
Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -reuse 0 -no_footnode -numbered_footnotes -white -split 0 -long_titles 10 -show_section_numbers -antialias -antialias_text -address 'avi (dot) arampatzis (at) gmail' -html_version 4.0,latin1,unicode,utf8 -local_icons -no_navigation kto726-arampatzis.tex
The translation was initiated by Avi Arampatzis on 2009-10-16