Next: 5. Experimental Results and Up: An Evaluation of Linguistically-motivated Previous: 3. Representational Choices

Subsections

4. Experimental Setup

Our main concern is to evaluate different indexing schemes. Document classification, categorization, or routing environments provide a good test-bed for such evaluations. In such environments, given a pre-classified corpus, the measurement of recall is straightforward, while for classical retrieval systems it involves expensive human judgments. The experimental system is based on the vector space model, terms are weighted in a ${\mbox {\it tf}}\times{\mbox {\it idf}}$ fashion, and classifiers are constructed automatically using Rocchio's (original) relevance feedback method. We are aware that the choices of retrieval model, weighting scheme, and classifier construction method may have a fairly significant impact on performance, but since we are more interested in the comparative performance of the different indexing schemes we settle for widely accepted and proven methods.

The same goes for the evaluation measures. Instead of making a binary decision to either assign or not a document to a class, we allow the system to return a traditional ranked list of documents for every class: most relevant first, least relevant last. Thus, evaluation is done with 11-point interpolated recall-precision and average precision on a dataset from the Reuters-21578 text categorization test collection. Next we discuss in more detail the methods used, the dataset and pre-processing applied to it.

4.1 The Vector Space Model

In the Vector Space Model [15] documents are represented as vectors of weights:

$\begin{displaymath}D_i = < d_{i1}, d_{i2}, \cdots ,d_{ik}, \cdots ,d_{iN} > \end{displaymath}$

(1)

where d_ik is the weight of the kth indexing term in the ith document, and N is a the number of indexing terms being used. Indexing terms are assumed to be stochastically independent. An indexing term may be a word, phrase, n-gram, or some other linguistic entity. The weight of a term for a particular document is a function of the number of times the term occurs in that document, the number of documents the term occurs in, and other information. Of the variety of weighting methods possible, we chose the Cornell ltc weighting commonly used in text retrieval [4]:

$\begin{displaymath}d_{ik} = \frac{\mbox{\it tf}_{ik} \times \log(N_D/n_k)}{\sqrt{\sum_{j=1}^N(\mbox{\it tf}_{ij} \times \log(N_D/n_j))^2}} \end{displaymath}$

(2)

where N_D is the number of documents, n_k is the number of documents in which term k appears, and $\mbox{\it tf}_{ik}$ is:

$\begin{displaymath}\mbox{\it tf}_{ik} = \left\{ \begin{array}{ll} 0 & {\rm if~... ... \log(f_{ik})+1 & {\rm otherwise}. \par\end{array} \right. \end{displaymath}$

(3)

where f_ik is the number of occurrences of term k in document i.

Classification queries (or classifiers) are represented in the same manner, e.g. for a topic t

$\begin{displaymath}Q_t = < q_{t1}, q_{t2}, \cdots ,q_{tk}, \cdots , q_{tN} > \end{displaymath}$

(4)

is the corresponding vector (using the same set of terms as for the document vectors).

To compute the similarity between a query Q_t and a document D_i we used the dot product formula

$\begin{displaymath}S(D_i, Q_t) = D_i \ast Q_t = \sum_{k=1}^{N} d_{ik} \ast q_{tk}. \end{displaymath}$

(5)

4.2 Classifier Construction

Classifiers were constructed automatically by applying Rocchio's relevance feedback method [13] on a pre-classified set of documents (training set). Rocchio's algorithm is a well-known algorithm in the IR community, traditionally used for relevance feedback. Classifiers based on Rocchio have proven to be quite effective in filtering [16] and classification [12,8] tasks.

When training documents are to be ranked for a topic, an ideal classifier should rank all relevant documents above the non-relevant ones. However, such an ideal classifier might just not exist, therefore, we settle for a classifier that maximizes the difference between the average score of relevant and the average score of non-relevant documents.

If the similarity between topics and documents is calculated by the cosine measure (equation 5), and document vectors are weighted and length normalized such as $\vert D_i\vert=1, \forall i$ , then Rocchio specifies that the optimal classification query Q_t for topic t should have term k weighted as

$\begin{displaymath}q_{tk} = \frac{1}{\vert R_t\vert}\sum_{i\in R_t}d_{ik}-\frac{1}{\vert N_t\vert}\sum_{i\in N_t}d_{ik}, \end{displaymath}$

(6)

where R_t and N_t are respectively the sets of relevant and non-relevant to t training documents, and |.| denotes the number of elements in a set. After training, negative weights are usually set to zero.

4.3 Term Selection

Term selection (also called feature selection or feature reduction in classification) is an important task. Training data usually contain too many terms, thus it is not uncommon to end up with thousands of terms in the indexing vocabulary. Applying feature reduction techniques to text classification tasks was found not to impair classification accuracy, even for reductions up to a factor of ten [20,12]. This is also economical in time and space.

The answer to the question of how many terms are sufficient for a topic representation is rather topic dependent and should be determined experimentally. However, assuming that relevant documents are likely to look more like each other (even in their lengths) than non-relevant documents do, a sensible number of terms would be a number proportional to the expected average number or unique terms in relevant documents:

$\begin{displaymath}N = c \times {\mbox{\it average number of unique terms in relevant documents.}} \end{displaymath}$

(7)

The constant c is topic dependent.

The average number of unique terms in relevant documents was calculated on the training data. The value of the constant c was estimated experimentally for our dataset. After a few tuning experiments, we set c=0.4. Only the terms with the top N (equation 7) Rocchio weights were selected for every classifier and the rest were removed. This resulted in a great reduction in the number of terms without significant drop in performance. It should be noted that c was tuned for Sw, and assumed to hold for all other experiments as well, suggesting that the Sw experiment may be favored over the rest.

4.4 The Dataset

We evaluated on a dataset from the Reuters-21578 (distribution 1.0) text categorization test collection, a resource freely available for research in Information Retrieval, Machine Learning, and other corpus-based research¹.

We produced the Modified Apte (ModApte) split (training set: 9,603 documents, test set: 3,299 documents, unused: 8,676 documents). The ModApte split is a subset of the Reuters documents which are about economic topics, such as income, gold and money-supply. [7] discuss some examples of the policies (not always obvious) used by the human indexers in deciding whether a document belongs to a particular topic category.

Because Rocchio's algorithm requires each training document to have at least one topic, we further screened out the training documents with no topics category. Of course, we did not remove any of the 3,299 test documents, since that would have made our results incomparable with other studies. Documents can have assigned more than one topic, i.e. multi-classification. We used only the topics which have at least one relevant training document and at least one relevant test document (90 topics in total).

Since there is a large variation in the numbers of relevant training documents for topics, we evaluate separately on small and large topics. As small topics are considered the ones which have ten or less training documents (32 in total), and the rest (58 in total) are considered as large.

4.5 Pre-processing

In order to obtain for every experiment the appropriate indexing terms from the dataset, we applied some pre-processing. The pre-processing was performed in six steps:

1.

Tokenization (script written in PERL): Detection of sentence boundaries followed by division of sentences into words.

2.

Part of speech tagging: Brill's rule-based tagger² [3] was employed to obtain POS information for the contents of the dataset. The tagger comes with a lexicon derived from both the Penn Treebank tagging of the Wall Street Journal (WSJ), and the Brown Corpus. Conveniently, the WSJ articles are, like the Reuters documents, about economic topics and this increased the reliability in tagging the Reuters corpus.

3.

Shallow parsing and term extraction (script written in PERL): Syntactic pattern matching on the POS tags to extract noun phrases for the Lap and Lbt experiments. For the Lbt experiment, the extracted noun phrases were further syntactically normalized and unnested, while for the Lap they were just broken down to adjacent word-pairs. For the rest of the experiments, the corresponding terms were extracted based on the POS tags.

4.

POS stop-listing (only for w, Sw, and Lw): It is well-known that the use of a stop-list improves the quality of an indexing set. Traditionally, a stop-list is constructed by taking a predetermined list of function words (articles, prepositions, etc.) Since our approach is based on a POS tagger, we used a POS stop-list to remove all words belonging to the following categories: determiners (e.g. a, the, all), prepositions and subordinating conjunctions (e.g. in, to, of), pronouns (e.g. I, yours), the infinite marker to, modal verbs (e.g. would, must), the verbs to be and to have and coordinating conjunctions (e.g. and).

5.

Disambiguation of the NP structure (only for Lbt, PERL script): Noun phrases with more than two words can be structurally ambiguous. To disambiguate the modification structure we applied statistical methods. First we collected frequency information from the corpus for all noun phrases with two words. Then all 3-word noun phrases were disambiguated by assigning to them the most possible structure based on the frequencies of 2-word noun phrases. Gradually, this was applied up to n-word noun phrases based on the frequencies of all previously disambiguated k-word noun phrases (k<n). Where not enough frequency information had been available, left-dependence was assigned since it is the most probable modification structure in the English noun phrase.

6.

Morphological Normalization: Lemmatization was performed according to the POS information by using WORDNET's v1.6 [11] morphology library functions³.

For Sw, words were stemmed using the Porter stemmer of the Lingua::Stem (version 0.30) extension to PERL.

Next: 5. Experimental Results and Up: An Evaluation of Linguistically-motivated Previous: 3. Representational Choices

avi (dot) arampatzis (at) gmail