We propose a heuristics which improves learning efficiency and retrieval
efficiency in interactive document retrieval for selection of displayed doc-
uments to a user. This heuristics is based on the extreme bias between
positive and negative example.
We conducted experiments to evaluate the effectiveness of our proposed
heuristics for active learning. We use a set of articles which is widely used
in the text retrieval conference TREC. For comparison with our approach,
two information retrieval methods were adopted. The first is conventional
Rocchio-based relevance feedback. The second is conventional selection
rule for SVM-based active learning. Then we confirmed our proposed
system outperformed other ones.
Ordering of displayed documents is accomplished by calculation of the
degree of relevance in interactive document retrieval. In SVM-based inter-
active document retrieval, the degree of relevance is evaluated by signed
distance from optimal hyperplane. It is not made clear how the signed
distance on the SVMs has characteristics in Vector Space Model which is
used in Rocchio-based method. We show that SVM-based retrieval has
an association with conventional Rocchio-based method by comparative
analysis of relevance evaluation.
As a result of their analysis, equation of weight vector of relevance
feedback based on SVMs is equivalent to update equation of query vector
of Rocchio-based method. The degree of relevance on SVM based method
evaluates scalar product of norm of target document vector and cosine
similarity of weight vector. On the other hand, the degree of relevance
on Rocchio-based method evaluates cosine similarity of query vector.
From this knowledge, we propose a cosine kernel equivalent to cosine
similarity that is suitable for SVM-based interactive document retrieval.
The effectiveness of a method using our proposed cosine kernel was con-
firmed, and it was experimentally compared with conventional relevance
feedback for the Boolean, term frequency (TF) and term frequency-
inverse document frequency (TFIDF) representations of document vec-
tors. The experimental results for a Text Retrieval Conference data set
show that the cosine kernel is effective for all document representations,
especially TF representation.