Query Improvement In Information Retrieval
                 Using Genetic Algorithms
       - A Report on the Experiments of the TREC Project

                      Jing-Jye Yang, Robert R. Korfhage
                      Department of Information Science
                                  and
                             Edie Rasmussen
                        Depar~ment of Library Science

                    School of Library and Information Science
                           University of Pittsburgh


                               Abstract

          We have been developing an adaptive method using genetic
          algorithms to modi~  user queries, based on relevance
          judgments. This algorithm was adapted for the Text REtrieval
          Conference (TREC). The method is shown to be applicable to
          large text collections, where more relevant documents are
          presented to users in the genetic modification. The algorithm
          also shows some interesting phenomena, such as parallel
          searching. Further studies are planned to adjust the system
          parameters to improve its effectiveness.


1. Introduction

     Information retrieval can be viewed as searching in a high-dimensional document space
to bring relevant documents to users. It is known, however, that this is not an easy task. Due
to the complexity of document writing styles and the difficulty users have in presenting their
information requests, the retrieval results often frustrate users. Returning to the system with a
modified query becomes unavoidable if a user wants to improve the retrieval results.
However, most systems provide litde or no guidance to the user for modif~ing the original
query, adding to the frustration. Using relevance feedback to help users solve the problem has
long being viewed as a promising agenda.


                                  31

      Relevance feedback, although started more than twenty years ago, attracted more
research in recent years. With advances in computer hardware and software, the feedback
process, which requires more computing time and friendlier user interface than the traditional
method, will dominate the information retrieval area.  However, feedback techniques also
need to be improved. They should include more intelligent mechanisms which embed the
ability to adapt to the changing environment and handle the feedback process automatically.

      We have been developing an adaptive method using genetic algorithms to modify user
queries automatically. Our preliminary experiments based on several simple data collections
showed promising results (Yang and Korfhage, 1992a, 1992b). This algorithm was applied to
the TREC project. This paper reports the experiments and some of the results.


2e The Model

      Our system is based on a vector space model. Within the model, both documents and
queries are represented as vectors.  A document vector (D~ ) with n keywords can be
represented as:

                            = (t~1, t~2~ t~3 ,

where   (, = 1......, n) is the assigned value of the 1h keyword.  Either weighted or
unweighted document keywords can be used.  For unweighted document keywords, tij is
either 0 or 1. For weighted document keywords, the value of tij is a real number in the range
[0,1].

      In this model a query (Q) with m keywords is also represented in a vector format as:

                          Q = (qtj, qt2..., qt~)

where qt~(k= 1,..., m) is the kth keyword in the query, where the keywords are extracted from
the user natural language requests. However, no weights are assigned to query terms in the
initial stage; query term weights are assigned by the system during the processing to be
described later.

      In general, under vector space models, document retrieval is based on a similarity
measurement between the query and documents.         This means that documents with high
similarity to the query are judged more relevant to the query and should be retrieved first. In
our model, the similarity measure is based on a metric, the L~ metric, which measures the
distance between document vectors and query vectors.     This means that the shorter the
distance between a document and the query, the higher the similarity between them.  The
distance value between a document D~ and the query Q is calculated, using the Lp metric, by
the formula:


                                       32

                       DIS(D~, Q) =11 D~, Q II = (Xltik - qt~~)lIP
                                               k

where k, (k = 1......, m), ranges over the set of query terms. Theoretically, the value of p can
be in the range [1,   J. In the current experiments, p is set to 2. That is, the distance measure
is the Euclidean distance. Note that only terms which are present in the query are used in the
formula. This means if a term is present in the document but not in the query, the term is
assumed of no concern or not known by the user.    In other words, a document will be
considered in the retrieval process only if it has at least one term in common with those in the
query.

      However, variations exist in the actual implementation.     The keyterms in documents
are unweighted in the ~TREC databases we have used. That is, the elements in the document
vectors are in binary mode, 0 or 1. This was determined for two reasons. First, for a large
database of this kind, obtaining document term weights using general term weighting
methods, such as the inverse document frequency (e.g., Salton and Buckley, 1988) requires
counting  term frequencies from all of the documents in the database, which is resource
intensive for a database of this size.   Second, previous experiments using the Oranfield
database indicate that our system performance equally well with either weighted or
unweighted document terms.

      Each query vector is generated from a topic provided in the TREC project. In general,
a topic includes several descriptive items: the sequential number of the topic, domain, title,
descriptive, narrative, concepts, factors, nationality, etc.  The terms of a query vector are
derived from the title and the concepts of the topic. In some cases the nationality is added, if
it is important to meet the requirement of the narrative. The query vector term weights are
assigned by the system at the beginning of each run.


3. The Algorithm

      The method we developed is based on the relevance feedback concept. However, the
algorithm underlying the process is totally different from those used in other relevance
feedback studies. Our system is also based on genetic algorithms which are known for their
efficiency and effectiveness in searching large problem spaces to find optimal solutions (e.g.,
Holland, 1975; Goldberg, 1989).

      The basic concept behind genetic algorithms is that there exists a population of
individual variants (those individuals can be various types, depending on the actual problem to
be solved) in an environment, each trying to survive by exchanging information and competing
with others. The genetic algorithms use methods analogous to natural evolution mechanisms
to generate those individuals who provide the optimal or near optimal solution to the problem.
The adaptability of genetic algorithms is due to the fact that individual selection and
manipulation are adaptive to the existing environment. The environment evaluates the fitness


                                          33

of the individuals, and the results form the feedback from the environment to the system
where the genetic process is applied.  The efficiency of the algorithm is due to parallel
searching, with a set of individuals exploring the problem space simultaneously. Information
of better structures is stored, exchanged and transferred from generation to generation, and
combined to make individuals having all or most of the better features of their predecessors.

     A few researchers have applied genetic algorithms to information retrieval. Gordon
(1988) used a genetic algorithm to modif~ document descriptions to facilitate the retrieval of
relevant documents for a group of users who are interested in similar topics.  Frieder and
Siegelmann (1991) developed a genetic algorithm to obtain an optimal allocation of
documents onto nodes in distributed multiprocessors.

     In vector space models document retrieval can be seen as searching a very high-
dimensional document space to find the area(s) where most of the relevant documents to a
user query are located. We have been developing a genetic algorithm in document retrieval
through the modification of query term weights. The steps of the genetic query modification
processes can be described as follows.

(1) Generate query variants:
     As described above, the genetic algorithms are applied to situations where there is a
group of individuals. To apply the techniques, there needs to be a set of query individuals.
They are generated in this step. Each query individual vector has the same elements as in the
original query (user's query).  Each element within a query vector represents a weight
corresponding to the related term in the original query. The weights are assigned randomly
using a random number generator, with the values of the weights scaled to the range [0,1).
The number of query variants is set in the beginning of the experiments and remains fixed in
later processes of each run.

(2) Individual query performance evaluation:
     This process evaluates the performance of each individual query variant.    First, the
distances between query variants and documents are calculated, using the formula mentioned
in Section 2. Second, for each query individual, those documents whose distance values to
this query individual are less than a given threshold are viewed as relevant to the user request
and are retrieved. Finally, the retrieval performance (P) of each query individual is assessed
using the formula:

                              P=A -B-C

where A is the number of retrieved documents which are judged as relevant to the user
request; B is the number of retrieved documents which are not relevant; and C is the number
of relevant documents that are not retrieved by the query individual. The performance values
for the query variants are to be used in the genetic modification. (Note: the P value also can
be in the form: P = A - B, in case the C value is unknown which is normal in the actual online
retrieval process. Note also that P is the internal value used by the system and is unknown to
the user.)


                                        34

(3) The genetic process:
     After the evaluation process genetic operators are used to modify the structures of the
query variants, with the aim of improving the retrieval effectiveness of the variants.

     (a) Selection: In this step those query individuals whose performances are greater than
the average performance of all the query variants are selected and retained for further
processing. Other query individuals are discarded.

     (b) Reproduction: In this process the survivors from the previous step are reproduced.
In our algorithm, the reproduction algorithm developed by Baker (1987) is used. The number
of offspring for each individual depends on the ratio of its performance to the total
performance of all the survivors and the population size. Therefore, the individual survivor
with the highest performance will have the most offspring.  Also, in this step, the query
variants are rearranged randomly in a sequence for later processing. We call the set of query
variants after reproduction the semi-new generation.

     (c) Mutation: In the mutation step, some individuals are selected at random from the
semi-new generation, and some of their genes (i.e., the term weights) are selected and
assigned new values, also in the range [0,1].   The number of individuals to be selected
depends on the mutation ratio chosen in the experimentation. The selection of genes and the
new value assignments are also done randomly.

     (d) Crossover: The crossover process mates pairs of query individuals in the semi-
new generation and exchanges parts of their term weights.  Here, the two-point crossover
method is used. For each pair, two positions in the term vectors are selected randomly, and
all term weights between the two positions are exchanged for the two individuals. Thus, two
new individuals are born. However, it may be the case that not all individuals are chosen to
mate with others. The crossover ratio controls the number of mates.

     After the crossover operation, a totally new generation is created and the process goes
back to step 2 and continues until a preset condition is satisfied, either a fixed number of
generations has been processed or no more relevant documents are retrieved.

     The genetic retrieval algorithm is distinguished from other research using relevance
feedback: (1) Several query individual variants exist and explore the document space
simultaneously. (2) To modify the term weights, it is not required to calculate and analyze the
document term weights, at least in the current study. Query term weight modification is done
by the genetic mechanisms through the interchange of term weights between query individuals
and the introduction of new weights by the mutation operation. (3) Only term weights are
modified and no new terms are introduced and no terms are deleted. The importance of a
term is shown by the value of its weight.


                               35

4. Preprocessing of Documents and Queries

      As in most information retrieval Systems, the documents and topics in the TREC
project need to be preprocessed before they can be used by our algorithm.

(1) Document processing:
      Document retrieval in our system is based on keyword match. In current system, a
keyword' is a single word. For the documents, to facilitate the retrieval we need to create
inverted files. The processes for creating the inverted files can be described as follows.

      (a) Keyword extraction:
      Keywords are extracted for each document. The stopwords (total 2,529 stopwords on
the list) are deleted, and keywords are stemmed using the Porter stemming algorithm which
was implemented by Fox ~rakes, 1992).  The document number from which a keyword is
extracted is also stored with that keyword, for it will be used to create inverted files and later
to facilitate the retrieval of the full text of the document. Also an "address file" is generated
which indicates for each document the offset of its location related to the original file within
which the document is stored.

      (b) Creation of inverted files:
      An inverted file is created for each database to facilitate the retrieval process. The file
is organized as: a keyword followed by a list of document numbers from which the keyword
extracted. Three-level indexed files are generated for the inverted file to reduce the search
time. Each index file include keywords, their offsets in the inverted file and the offset in the
immediately higher level file (for the second and third levels).

(2) Query processing:
      Queries are generated from the topics provided on the TREC project. For each topic
a single query vector (the original query) is generated, which consists of a list of keywords
from the title and the concepts of the topic.  For some queries information about the
nationality is also included if it is necessary to satisfy the requests from the narrative
descriptions of the related topics. The stemming algorithm is also applied to the terms in the
query. For the training queries we did not add any terms from other descriptive items on the
topics. But for some ad hoc queries, we added several keywords from the narrative because
we thought that those keywords would be useful to identify relevant documents. The routing
queries are those query individuals from the last generation of the training queries.

      Ten query individual vectors were generated for each original query. On the training
queries the initial query term weights for the query individuals were assigned randomly, and
then modified by the genetic algorithm. The final query individual vectors from the training
set were used as routing queries with no weights being modified.


1Keyword, keyterm and term Will be used interchangeably in this paper.


                                       36

     The assignment of the term weights to the ad hoc queries is different. Only one query
vector (called query individual one) is generated for each ad hoc query. The term weights in
query individual one were assigned either by using the weights of the same terms in the final
generation of the training topics if they existed, or by the researchers by referring to the
weights of the relative terms in the final generation of the training topics or to their
importance related to the topic. Moreover, in the ad hoc queries, a term weight of the query
individual one could be assigned different values depending on which database the query is
used to against documents. Based on the query individual one, another nine query individuals
were generated with the weights of each term in the nine individuals being normally
distributed around the corresponding term weights on the query individual one.

     One interesting question in query processing is how to handle the situation where a
concept with a NOT requirement is presented in a user's request (i.e., the topics in TREC). It
is hard to deal with in some retrieval systems. However, in the TREC topics, there are several
cases which include NOT in the concepts and the narrative items.  Our solution to this
problem is to assign negative weight to a keyword if it is described by the NOT concept or
narrative item. Since our system is based on a distance measure, a negative assignment can
cause the documents which include this keyword to have longer distance from the query than
those without the keyword.


5. System Configuration

     Some features of our system on the TREC project are described in this section.

(1) Window size for document retrieval
     In much feedback retrieval research, there is a fixed window size which decides the
number of documents to be retrieved. The value usually was set between 5 and 20 (e.g.,
Salton, 1971). Aalbersberg (1992) used window size one in the feedback retrieval process
and showed that the precision values for four standard databases, on average, are higher than
those achieved with size greater than one. However, Aalbersberg also suggested that variable
relevance feedback (Ide, 1971) is worth implementing in term weighting IR systems.

     In our original experiments on small databases a threshold value was generated for
each query in the beginning of an experiment, as the criterion for determining the retrieval
window. If the distance measure between a document and a query individual was less than the
threshold value, the document was viewed as relevant to the query and was retrieved. Thus
we had a variable window size.  However, in dealing with the large databases we added
another factor to decide the window size. As in the previous method, a threshold value is
generated first as each original query is read into the system. Documents whose distance is
less than the threshold will be regarded as relevant to the query and will be printed for
evaluation. However, if more than forty documents satisfy the condition, only the top forty
are retrieved for evaluation. If there is a tie, the selection is random among these tied. This
number seems reasonable given the time constraints and focus of the study.


                                     37

(2) User involvement and relevance judgment
     Since our system uses relevance feedback (on the training queries and the ad hoc
queries), users (evaluators) are needed to evaluate the retrieved documents. The evaluators
are our researchers in the ~EC project. The document evaluation process runs like this. On
each iteration, for each query individual the retrieved documents are printed with both their
full text (or abstracts or summaries if available) and document numbers.  An evaluator
examines the retrieved documents based on his/her reading of the text (or abstract or
summaries) and understanding of the topic, and marks the document numbers which are
judged relevant to the topic.

     The results of the judgment from the evaluators are fed back to the system.  The
system then applies the genetic modification process (as described in Section 3) based on the
feedback information.

(3) The feedback process
     The feedback process can be described as follows.  The system uses the relevance
judgment to calculate the performance (P) of each query individual based on the function
described in Section 3.2. However, the parameters in the formula were changed as follows.
First, there are no known relevant documents for each query, the value C is deleted. Second,
in our initial tries on a few training topics we found that for these very large databases many
nonrelevant documents are retrieved in the first generation. This would cause the individuals
which retrieved few relevant and many nonrelevant documents to get lower P values than
those which retrieved no documents or only a few nonrelevant documents.   So instead of
using equal weight for the values of A and B, we gave more weight to A.   The formula
became: P = 1O*A - B.  This increases the chance of survival for those individuals which
retrieve a few relevant documents. The weight 10 was chosen for the parameter A because in
several initial tries (weight equal to 1, 5 and 10), 10 brought more relevant documents in the
modified generation than the other weights. However, values higher than 10 were not used to
avoid the query variants to converging too quickly so that the final query might be a local
optimum.

     After computing the performance for each variant, genetic operations are applied to
generate a set of new query individuals. Those new query individuals are then used by the
system to retrieve documents. However, those documents which have been retrieved in
previous generations are not presented in the new generation, even if they satisfy the selection
condition.  This strategy is similar to the residual collection method used by Chang et. al.
(1971).

     The feedback process continues until either no more relevant documents are retrieved
for any query variants or the user (the evaluator) decides to stop the process, mainly due to
the time constraints.


                                     38

6e Statistical Data

(1) Computing time:
        Time is given for two types of processing.

        (a) Time for document processing:
        Document processing time includes keyword extraction, sorting and merge, and
building inverted files.     Two systems were used, a SUN -670 and SUN SPARQIPC(s),
depending on availability.      Table 1 provides the time used on several stages in document
processing.

        (b) Time for document retrieval per generation
        The time required to retrieve documents for each iteration (generation) depends on
several factors: the number of terms in a query, the generation, and the computing system
used. Table 2 gives the average retrieval time for three generations, based on topics 1-50 and
the first dataset (disk one).

(2) Storage space for data structures:
        Several types of data structure were created to facilitate retrieval. They are inverted
files, indexed files and address files which consist of document numbers and their locations in
the databases. The storage space used for these files (disk one only) is shown in Table 3.

(3) Machine capability:
        The capabilities of our systems are:
        SUN-670: 32 Megabytes RAM and 40 MHz CPU clock rate;
        SUN SPARC/IPC: 24 Megabytes RAM and 25 MHz CPU clock rate.

(4) Manpower:
        There are in total four persons participating in the project, two Ph.D. students and two
faculty.  One Ph.D. student worked full time on this project, and the other part time, one
faculty member half time and the other part time.


                          Table 1. Time for document processing (Unit: min.)

                          DOE          AP              ZIFF            WSJ            FR

                          19st   1'st    2nd       1~st  2~nd    1~st     2nd    1~st   2nd

  ___________________     ataset ataset  ataset  ataset  ataset  ataset  ataset  ataset ataset
   Keywordextraction      129     148     169      130    43      182     125     19     16

   Sortingandmerge        293     281     NA       248    72      404     398     18     11

   Create inverted files   29     NA       27      28     11      NA       34        2    3
  * NA: Not Available


                                                 39

                       Table 2. Average retrieval time for topics 1-50
                              on the first dataset
                          ~rnt: minutes)

                         Generation 0 Generation 1 Generation 2

                            3:56        2:19          2:05


                      Table 3. Total amount of storage for inverted and
                             indexed files -- disk one only

                       ~rnt: Megabyte~

                                       DOE     AP    ZIFF  WSJ

                         Invertedfiles 162.3 199.8   143.7 223.4

                         Indexed files 3.0     2.2    2.4   2.1

                         Addressfiles  4.3     1.7    1.7   2.6


7. Results

     This section describes several results of using the genetic algorithm in the TREC
document collection.  Examples provided are from training queries (topic 1 to 50) on the
DOE database.

(1) Query convergence
     In large document collections like the TREC databases, the genetic algorithm caused
the query variants to converge within 3 to 6 generations in most cases.   For an example,
Table 4 shows the term weights of query individuals in the first generation, 0, and last
generation, 5, on topic 3. For most of the query terms the weights on the query individuals
converged to a single value in the final generation. Although a few variations existed, they
were caused by the mutation operation. Table 5 shows a similar situation for topic 12, where
the final generation is 4.

     An interesting phenomenon is how the query term weights changed.        The genetic
operators select the query individuals which have higher performance values than the average
performance of all the individuals and exchange parts of their term weights by using mutation
and crossover.  Although the two operations are random, the results are interesting. The


                                     40

experiments show that from generation to generation the average weight of each term in the
query individuals gradually moves to the value in the final converged generation, although
small variations may exist. As an example, Table 6 shows the changes of the average term
weights on topic 3, from generation 0 to generation 5. The values on generation 5 are almost
identical with those in Table 4.

(2) Effects of query term weight modification
     The goal of query convergence using the genetic algorithm is to find the query
individual with highest performance, that is, retrieving more relevant documents than its
predecessors. Evidence from the experiments has shown the GA works as expected. In most
cases, new relevant documents were brought in to the user in each generation, until
convergence at the final generation.  Table 7 through Table 11 show the numbers of new
relevant documents retrieved in each generation for the five databases.

     Observing the results, one interesfing phenomenon arises; that is, for a topic the
algorithm may retrieve different numbers of relevant documents on distinct databases.  It
seems reasonable since the five databases may concentrate on different areas. Moreover, the
retrieval patterns for WSJ and AP databases, which should be interesting in the same topic,
looks similar with each other.

     Table 4 Query individuals on Topic 3 (11 terms)

Generation = 0
o 0.18 0.31 0.53 0.95 0.17 0.70 0.23 0.49 0.12 0.08 0.39
1 0.28 0.37 0.98 0.54 0.77 0.65 0.77 0.78 0.82 0.15 0.63
2 0.31 0.35 0.92 0.52 0.40 0.61 0.79 0.93 0.87 0.87 0.67
3 0.76 0.58 0.39 0.36 0.20 0.83 0.42 0.46 0.98 0.13 0.21
4 0.96 0.74 0.41 0.78 0.76 0.96 0.03 0.32 0.76 0.24 0.59
5 0.04 0.96 0.32 0.06 0.44 0.92 0.57 0.12 0.57 0.25 0.50
6 0.24 0.48 0.41 0.87 0.43 0.36 0.38 0.04 0.16 0.52 0.70
7 0.10 0.40 0.77 0.24 0.34 0.23 0.30 0.30 0.89 0.04 0.65
8 0.40 0.68 0.73 0.94 0.23 0.84 0.97 0.78 0.43 0.67 0.81
9 0.16 0.28 0.14 0.86 0.75 0.21 0.14 0.29 0.80 0.22 0.56

Generation = S
0 0.28 0.37 0.98 0.54 0.77 0.76 0.03 0.45 0.69 0.24 0.78
1 0.28 0.22 0.98 0.54 0.77 0.56 0.03 0.32 0.76 0.24 0.58
2 0.28 0.22 0.98 0.54 0.77 0.66 0.03 0.45 0.69 0.24 0.78
3 0.28 0.22 0.98 0.54 0.77 0.66 0.03 0.45 0.69 0.24 0.78
4 0.28 0.22 0.98 0.54 0.77 0.66 0.03 0.45 0.69 0.24 0.78
5 0.28 0.22 0.98 0.54 0.77 0.66 0.03 0.45 0.69 0.24 0.78
6 0.28 0.42 0.98 0.54 0.77 0.66 0.13 0.45 0.69 0.24 0.78
7 0.28 0.22 0.98 0.54 0.77 0.66 0.03 0.32 0.76 0.24 0.58
8 0.28 0.37 0.98 0.54 0.77 0.67 0.13 0.32 0.76 0.24 0.55
9 0.28 0.37 0.98 0.54 0.77 0.66 0.03 0.32 0.76 0.24 0.58


                                         41

     Table 5 Query individuals on Topic 12 (12 terms)

Generation =0
0 0.18 0.31
1 0.37 0.98
2 0.92 0.52
3 0.36 0.20
4 0.76 0.96
5 0.92 0.57
6 0.38 0.04
7 0.30 0.89
8 0.43 0.67
9 0.22 0.56

Generation =4
0 0.92 0.57
1 0.92 0.57
2 0.92 0.57
3 0.92 0.57
4 0.92 0.57
5 0.92 0.57
6 0.92 0.57
7 0.92 0.57
8 0.92 0.57
9 0.92 0.57

0.53
0.54
0.40
0.83
0.03
0.12
0.16
0.04
0.81
0.72


0.12
0.12
0.12
0.12
0.12
0.12
0.12
0.12
0.12
0.12

0.95 0.17 0.70 0.23 0.49 0.12 0.08 0.39 0.28
0.77 0.65 0.77 0.78 0.82 0.15 0.63 0.31 0.35
0.61 0.79 0.93 0.87 0.87 0.67 0.76 0.58 0.39
0.42 0.46 0.98 0.13 0.21 0.96 0.74 0.41 0.78
0.32 0.76 0.24 0.59 0.04 0.96 0.32 0.06 0.44
0.57 0.25 0.50 0.24 0.48 0.41 0.87 0.43 0.36
0.52 0.70 0.10 0.40 0.77 0.24 0.34 0.23 0.30
0.65 0.40 0.68 0.73 0.94 0.23 0.84 0.97 0.78
0.16 0.28 0.14 0.86 0.75 0.21 0.14 0.29 0.80
0.20 0.99 0.25 0.43 0.76 0.86 0.89 0.98 0.40


0.67 0.25 0.50 0.28 0.28 0.96 0.84 0.41 0.60
0.57 0.25 0.50 0.07 0.15 0.96 0.74 0.41 0.40
0.47 0.25 0.50 0.09 0.35 0.96 0.98 0.43 0.36
0.57 0.25 0.50 0.09 0.15 0.96 0.77 0.41 0.47
0.57 0.25 0.50 0.09 0.21 0.96 0.74 0.41 0.47
0.57 0.25 0.50 0.09 0.28 0.96 0.84 0.41 0.47
0.72 0.25 0.50 0.26 0.31 0.96 0.84 0.41 0.47
0.52 0.25 0.50 0.09 0.12 0.96 0.74 0.41 0.52
0.67 0.25 0.50 0.26 0.15 0.96 0.74 0.41 0.52
0.67 0.25 0.50 0.26 0.15 0.96 0.74 0.41 0.52

Table 6 Average query term weights on each generation for topic 3

Gen.  ti  t2   t3   t4    tS  t6   t7   t8   t9 tlO til

0   .343 .515 .560 .612 .449 .631 .460 .451 .640 .317 .571
1   .514 .493 .747 .688 .585 .772 .592 .672 .734 .426 .671
2   .396 .386 .881 .552 .510 .787 .315 .516 .763 .348 .684
3   .301 .349 .938 .526 .511 .811 .262 .444 .785 .213 .700
4   .286 .342 .968 .556 .696 .746 .060 .346 .746 .240 .649
5   .280 .285 .980 .540 .770 .661 .050 .398 .718 .240 .700


                                    42

Table 7. # of relevant document
         retrieved - DOE database

Topic       Generation
         0    1   2   3   4    5

Ol       0
02       0
03       5    8   5   9   2
04       0
05       0
06       0
07       0
08       0
09       0
10       2    0
11       4    3
12     29   15    1   2
13     23
14       0
15       0
16       0
17       0
18       0
19       0
20       0
21     27     5
22       0
23       0
24     36   15    5   2
25     10     0   1  10
26       6    0   5
27     58   22   15   4  22
28       1
29       0
30       0
31       0
32       0
33     10     3  10
34       0
35       0
36       0
37       0
38       3
39     14
40       2    2
41       0
42       3    1
43       4
44       0
45       3    2   0
46       0    1
47       0
48       4
49       9
50       0


                                         43

Table 8. # of relevant document
          retrieved - FR database

Topic       Generation
          0  1   2    3   4   5

Ol        0
02        0
03        0
04        0
OS        1
06        0
07        0
08        0
09        0
10        0
11        1
12        2
13        0
14        0
15        0
16        0
17     19   18   6
18        0
19        0
20        0
21        0
22        0  1   2
23        0
24        0
25        0
26        0
27        0
28        0
29        0
30        0
31        0
32        0
33        0
34        0
35        0
36        0
37        0
38        0
39        0
40        0
41        0
42        0
43        0
44        0
45        0
46        0
47        0
48        0
49        0
50        0

Table 9. # of relevant document               Table 10. # of relevant document
                                                       retrieved - WSJ database

        retrieved - AP database

Topic     Generation                       Topic     Generation
       0    1  2    3   4   5   6                 0   1  2   3   4   5   6

Ol
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29      0                                      29      1
30      0                                      30     20
31      0                                      31      3   4
32      0                                      32      0
32      0                                      33      6
33      0   1    1                             34      3
34      0                                      35      2   2
35      0   1                                  36      1
36      0                                      37      0
37      2                                      38      1
38      0                                      39      0
39      0                                      40      1
40      0                                      41      0
41      0                                      42      0
42      2   1                                  43      0
43      0                                      44      0
44      7   5                                  45      0
45      0                                      46      1  14
46      0   19                                 47     22
47      8                                      48      0
48      1   8                                  49     23  20  19    1
49     12                                      50      0
50      0

* means more relevant documents were retrieved
after this generation, but the document number has
been added into this generation.


                                        44

Ol     15   10   4    8
02     24    9   1
03     30    5
04     11    9   7    8   2
Os       3
06     26    2   5    5   5   7
07       4   5
08     15   11   8
09     22    4   5    2
10     43   24   6    5   3
11       10  3   2
12     26    2
13     29
14       2
15     45
16       0
17       2
18     31
19     50
20       1
21     18
22     21
23       2
24     28
25       1
26       6
27       1
28       6

16   8  12    8   2  22*
50  30   5  31   23  32*
 6  10  13    7   3
 7   2   3
 9
 3   7   1
25  13   2    7   9
 1
 3
 1   1

12    6 19    5
12  11   2
17
 5    7  7
 6    0  2    1
11  14   14   8   3    7  26*
 5    4  0    2
 0    3
14    2  1
30  18   11   31
16  23   8   20  14    21
45  38   3    1
67    6
 0
12
 0
 3
 0    4  8  10   21    2   5*
28  35   62  74  88  29   48*
 0    1  4    3   2
 3    1
65  15   24   7   3    4   5*
 1    2
16  14   10   2
 0
 0    1  2
 0
 3    1

Table 11. # of relevant document
          retrieved - ZIFF database

Topic       Generation
        0123456

Ol        3   1   2
02      10    1   0
03      24  10    7    6
04        0
05        2
OG        0
07        1
08        0
09        0
10        0
11        3
12        0
13        0
14        0
15      47    3   2
16        0
17        0
18        0
19        0
20      18  11    15
21        2   1
22        4
23        0
24        0   4
25        0
26      73  75    10   3   5   5
27      28  27    17
28      34  29    8
29        9   4   1
30        8   0   2
31      11    5   1
32      18    4
33      13  20    13 15    7   6   9*
34      24  29    7    2
35        2
36        3
37      59  38    1
38        6    4  3  7     14  2
39        6
 40       3
 41     15
 42     12
 43       2
 44     43
 45     10     6  1
 46     14  20
 47     26    13  14   2
 48       1    0  5
 49     72    14  5
 50       2


                                         45

(3) Effects of the genetic operators - mutation and crossover
     Previous examples have shown the overall effect of the GA operations, where the
genetic modification processes on the query term weights have improved retrieval results.
However, in most situations, the function of the two important operators - mutation which
assigns a new weight to a randomly selected term, and crossover which exchanges individual
term weights between two randomly selected pairs - is difficult to display.  However, in
experiments on the DOE database we found one example, topic 13, which shows how
mutation and crossover can create a new query individual retrieving, in this example, all of the
relevant documents which are retrieved part by one of their parent and part by the other
parent, together with the other new query individual retrieving none.

     Table 12 depicts this situation. The top part of the table shows the term weights of the
query individuals. The left side are the two original query individuals - the parents, and the
right side are the children after mutation and crossover.  The rest of the table shows the
documents retrieved by the parent 1, parent 2 and the children. We can see that one of the
children has inherited all of the proper term weights from the parents, and this child retrieved
all of the relevant documents retrieved by both of the parents. Note that although two of the
term weights are changed by the mutation, it does not affect the combination effect because
the difference between the changed weights and the original ones from one of the parents is
not significant. Of cause, the effects of crossover and mutation are not always this clear cut.

(4) Parallel search using multiple query individuals
     The genetic algorithm using multiple individuals in the document retrieval process
provides the capability of parallel search with different query individuals searching the
document space simultaneously. Each individual, with the same terms but different weights,
may retrieve different documents which are located in different areas in the document space.
This makes the genetic search distinct from other searching methods, where only one query is
used.

     The effects of parallel search can be explained by showing the documents retrieved
from several query individuals. Here we give the results from topic 24. Table 13 displays the
term weights of query individuals in the first generation, and Table 14 shows the relevant
documents retrieved by four different query individuals from Table 13. We can see that for
these four queries, there is no overlap among the relevant documents retrieved. Thus each
query individual emphasizes different concepts by assigning high weights to them. Documents
which are more closely related to those concepts would be retrieved by different query
individuals.

     The situation also can be viewed as having the query individuals handle the AND and
OR operation simultaneously, which is impossible for methods using only one query unless
they use the Boolean model. However, many researchers reject the Boolean model because it
is difficult for users to apply the AND/OR operations correctly (e.g., Bookstein, 1985).


                               46

        Table 12 The combined effect of crossover and mutation, topic 13

          ti  t2  t3   t4  t5              ti  t2   t3  t4  t5

parent 1  .77 .65 .77 .78 .82      child 1    .65  .77 ~
parent2 .46   .98 .13 .21 .96      child 2    .98  .13 ~

Note: underhne means caused by crossover
    double underline means caused by mutation

Documents retrieved
by parent 1

Documents retrieved
by parent 2

Documents retrieved Documents retrieved
by child 1         by child 2

DOEl -05-0747     DOE 1-24-0861    DOE 1-05-0747      none
DOE 1-11-0383     DOE 1-26-0555    DOE1-1 1-0383
DOE 1-44-0234     DOE 1-36-0791    DOE 1-44-0234
DOE 1-49-0051     DOE 1-50-0948    DOE 1-49-0051
DOE 1-69-0957     DOEl -79-0679    DOE 1-69-0957
DOE 1-81-0200     DOE2-27-0093     DOE 1-81-0200
DOE2-20-0798      DOE2-69- 1023    DOE2-20-0798
DOE2-27-0 146     DOE2-70-0787     DOE2-27-0 146
DOE2-32- 1039     DOE2-7 1-0280    DOE2-32- 1039
DOE2-37-0525      DOE2-80-0365     DOE2-37-0525
DOE2-48-0 100                      DOE2-48-0 100
DOE2-49-0342                       DOE2-49-0342
DOE2-7 1-0973                      DOE2-7 1-0973
                                   DOEl -24-0861
                                   DOE 1-26-0555
                                   DOE 1-36-0791
                                   DOE 1-50-0948
                                   DOE 1-79-0679
                                   DOE2-27-0093
                                   DOE2-69- 1023
                                   DOE2-70-0787
                                   DOE2-7 1-0280
                                   DOE2-80-0365


                                 47

      The parallel search could be continued if the individuals retrieving different relevant
documents have the same power, that is, the differences in the number of relevant documents
brought by each of them are small. The results in Table 15 and Table 16 come from the same
topic in Table 13 and Table 14, but from the second generation (generation 1) after feedback
and genetic modification to the first generation.  It can be seen that due to the genetic
operations, parts of several query individuals have begun converging. At this point the queries
group into three clusters ofsimilarqueries, (0,3,7,8,9), (1,5,6) and (2,4)-- Table 15. Within
each group several term weights are identical.    However, parallel search still continues.
Differences in the other weights result in query individuals that retrieve completely different
sets of documents. For example, contrast query 7 with the others in its cluster.  Other
examples from this study show that this effect may be observed with differences in only one or
two term weights.

      Tables 17 and 18 show that the parallel search from this topic went on to the final
generation, 3, where for each term almost all of the weights had converged to a single value.
But query 7 has two different values at terms 4 and 21 which cause it to retrieve totally
different relevant documents than the other query individuals. However, this is not always the
case. In most situations, the genetic algorithm will force the query variants to converge to a
single query; that is, all of the query individuals in the final generation retrieve the same
relevant documents.

      The example on topic 24 also shows, as in previous cases,     that new relevant
documents are brought to the user in each generation, as indicated by the `*` following the
document numbers.


  Table 13. Term weights of query individuals of topic 24, the first generation (generation 0)

  ti  t2 t3  t4   t5 t6 t7 t8 t9 tio tlI t12 t13 t14 tls t16 t17 t18 t19t20 t21

0  18 .31 .53 .95 .17 .70.23.49
1 .63 .31 .35 .92.52 .40.61 .79
2.13 .21 .96.74.41 .78.76.96
3 .57.25 .50.24.48 .41 .87.43
4.30.89 .04.65.40 .68.73.94
5.14.29.80.22.56 .72.20.99
6.60.24.45.79.08 .48 .15 .25
7.78.71 .45.70.10 .96.55.74
8 .49.61 .42.13 .26 .04.98.11
9.84.84.90.59.54 .17.65.69

.12.08 .39.28 .37.98 .54.77.65 .77.78 .82.15
.93 .87.87.67 .76.58 .39.36.20.83 .42.46.98
.03.32.76.24.59.04.96 .32.06.44.92.57.12
.36.38 .04.16.52.70.10.40.77 .24.34.23 .30
.23.84.97 .78 .43 .67.81 .16.28 .14.86.75 .21
.25 .43 .76.86 .89.98 .40.43 .13 .46.24.99.65
.94.61 .99.48 .80.74.38 .48.53.10.59.35 .14
.58.64.78.19.30.28.68 .29.57.42.31 .44.57
.38 .65 .35 .55 .36 .57 .48 .16.62.17 .55 .29.87
.26.11 .81 .19.42.35 .84.14.26.18 .48.38 .50


     48

Table 14 Relevant documents retrieved by query individuals for topic 24,
      the first generation (generation 0)
               Query individuals
               0    3     6

Document number

                               8

DOE 1-57-0764    v
DOE2-04-0727     v
DOEl -75-0356    v
DOE 1-56-0376    v
DOE2- 14-0677    V
DOE1-07-1283     V
DOE 1-14-1165    v
DOE 1-35-0677    v
DOE 1-35-0739    v
DOE 1-53-1065    v
DOE1-77-0170     v
DOE 1-85-0552    v
DOE2- 15-0328    v
DOE2-31-1277     v
DOE2-61 -0632          v
DOE2-28-08 19          v
DOE 1-55-0792          V
DOE 1-70-0125          v
DOE 1-01-0620          V
DOE 1-11-0738               V
DOE2-73-0983                v
DOE2-7 1-0057               v
DOE2-73-0254                v
DOE 1-85-0035               V
DOE 1-06-1157               v
DOE1-22-1 160               v
DOE1-09-1204                v
DOE 1-01-0602               V
DOE 1-73-0028               v
DOE 1-19-0847                    v
DOE1-21-1059                     v
DOE2-83-0561                     v
DOE1-09-1154                     V
DOE 1-30-0306                    v
DOE 1-33-0952                    V
DOE2-25-0240                     v
DOE1-07-1305                     v
DOE 1-70-0170                    v
DOE 1-73-0068                    v


                                49

(5) Precision and recall
       The average performance of the current genetic algorithm can be shown using
precision and recall values. Here we use data provided by NIST to show the results from the
ad hoc queries and the routing queries. Note that in our experiment the ad hoc queries were
used only on the second dataset (disk two).  Table 19 and Table 20 display the average
precision at the 11 recall levels from both the ad hoc queries and the routing queries,
respectively.


8. Discussion

       The experimental results on the TREC document collection have shown that the
genetic approach in query modification can be applied in large and varied databases. Evidence
has indicated that the genetic query modification leads to the convergence of the query term
weights. The results have also shown that additional relevant documents can be retrieved by
modifying the query term weights using the genetic approach.

       We have also shown, through specific examples, the effects of crossover and mutation,
and how variations in query term weights affect the retrieval of relevant documents.  We
believe that the genetic algorithm technique introduces new information into the retrieval
process, allowing the system to do a better job of query modification.

       Evidence has shown that even without the use of genetic techniques our system
employs parallel  search, although the genetic  modification brings additional relevant
documents. For the system designer interested in Boolean techniques this permits exploration
of several Boolean variants simultaneously, and also facilitates use of Boolean combinations of
the individual query variants.


Table 15. Term weights of query individuals of topic 24, the second generation (generation 1)

 ti   t2 t3 t4    t5 t6 t7  t8 t9 tio til t12 t13 t14 tistl6 t17 tig t19 t20 t21
0.18 .31 .53.95 .17.70.23 .49.12.08 .39.28 .37.96.38 .48.69.77.78.82.15
1 .60.24.45.79.08.48.15 .25.94.61 .99.48.80.76.54.77.49 .10.59.35.14
2.49.61 .42.13.26 .04.98 .11 .38 .65 .35 .55 .36.57.48.16.74.57.55 .29.87
3 .18.31 .53.95 .28 .70.23.49.12.08 .39.28 .37.98 .54.77.53.37.78.82.15
4.49.61 .42.13.26 .04.67.25.94.61 .99.48.80.75.48.16.62 .17.55.29.87
5.60.24.45 .79.08.48.47.11 .38 .65 .35 .55 .36.55 .38.48 .53.10.59.35.14
6.60.24.45.79.08.48.15 .25.94.61 .79.28.37.98.54.77.65 .77.78.69.14
7.18.31 .53.95.17.70.23.49.12 .08.59.48.80.74.38.48 .53.10.59.48.98
8.18.31 .53 .95 .17.70.23 .49.12.08 .39.28 .37.98 .54.77.65 .77.78.82.15
9.18 .31 .53.95 .17.70.23 .49.12.08 .39.28 .37 .98 .54.77 .65 .77.78.82.15


                                       50

     Table 16. Relevant documents retrieved by query individuals for topic 24,
            the second generation (* new retrieved documents)

                        Query individuals
Document number   0     1    2    3

4     5     6    7    8     9

DOEl-57-0764
DOE2-04-0727
DOEl-56-0376
DOE1-07-1283
DOEl-14-l 165
DOEl-35-0677
DOE 1-35-0739
DOEl-53-1065
DOEl-77-0170
DOEl-85-0552
DOE2-15-0328
DOE2-31-1277
DOE2-73-0254
DOE1-75-0356
DOE1-06-1 164*
DOE1-1 1-0738
DOE2-73-0983
DOE2-71-0057
DOE2-28-0819
DOE1-19-0847
DOE1-21-1059
DOE2-83-0561
DOE1-09-1154
DOE1-30-0306
DOE1-33-0952
DOE2-25-0240
DOE1-55-0888*
DOE1-83-0442*
DOE242-0268*
DOE2-72-0035*
DOE1-77-0125*
DOE1AO-1 152*
DOE1-1 1~0490*
DOE1-15-1 164*
DOE1-35-0747*
DOE146-031 1*
DoE1~73-0024*
DOE2-39-0087*
DoE2-71-0068*
DOE2-71-0075*
DoE1A3~0907*
DOE1-7&1035*
DOE1-82-0152*
DOE2-14-0677

V               V                        V    V

V               V                        V    V

V               V                        V    V

V                                        V    V
V                                        V    V
V                                        V    V

V                                        V    V
V                                        V    V

V                                        V    V

V                                        V    V

V                                        V    V

V                                        V    V
      V                        V

      V

V               V         V     V


                           V    V

                           V    V

V               V

V               V

V

V
      V

      V

      V

      V

      V

      V

      V

      V

51

V


V

V

V

V

V

V

V

V

V

V

V

V

V

V

V

V

      Table 17 Query individuals on Topic 24 (generation 3)

0 0.18 0.31 0.53 0.95 0.17 0.70 0.23 0.49 0.12 0.08 0.39 0.28 0.37 0.98 0.54 0.77 0.65 0.77 0.78 0.82 0.15
1 0.18 0.31 0.53 0.95 0.17 0.70 0.23 0.49 0.12 0.08 0.39 0.28 0.37 0.98 0.54 0.77 0.65 0.77 0.78 0.82 0.15
2 0.18 0.31 0.53 0.95 0.17 0.70 0.33 0.59 0.12 0.08 0.39 0.28 0.37 0.98 0.54 0.77 0.75 0.77 0.78 0.82 0.15
3 0.18 0.31 0.53 0.95 0.28 0.70 0.23 0.49 0.12 0.08 0.39 0.28 0.37 0.98 0.54 0.77 0.55 0.77 0.78 0.82 0.15
4 0.18 0.31 0.53 0.95 0.17 0.70 0.33 0.59 0.12 0.08 0.39 0.28 0.37 0.98 0.54 0.77 0.65 0.77 0.78 0.82 0.15
5 0.18 0.31 0.53 0.95 0.17 0.70 0.23 0.49 0.12 0.08 0.39 0.28 0.37 0.98 0.54 0.77 0.65 0.77 0.78 0.82 0.15
6 0.18 0.31 0.53 0.95 0.17 0.70 0.33 0.59 0.12 0.08 0.39 0.28 0.37 0.98 0.54 0.77 0.65 0.77 0.78 0.82 0.15
7 0.18 0.31 0.53 0.95 0.17 0.70 0.13 0.39 0.12 0.08 0.39 0.28 0.37 0.98 0.54 0.77 0.65 0.77 0.78 0.82 0.98
8 0.18 0.31 0.53 0.95 0.17 0.70 0.23 0.49 0.12 0.08 0.39 0.28 0.37 0.98 0.54 0.77 0.65 0.77 0.78 0.82 0.15
9 0.18 0.31 0.53 0.95 0.17 0.70 0.23 0.49 0.12 0.08 0.39 0.28 0.37 0.98 0.54 0.77 0.65 0.77 0.78 0.82 0.15


      Table   18. Relevant documents retrieved by query individuals for topic 24,
                 (generation 3) (* new retrieved documents)

Document number      0     1     2     3     4      5     6     7     8     9

DOE 1-57-0764        V      V    V     V     V     V      V           V     V
DOE 1-75-0356        V      V    V     V     V     V      V           V     V
DOE2~07~0957*        V      V    V     V     V     V      V           V     V
DOE2~55~0603*        V      V    V     V     V     V      V           V     V
DOE1-82-0152         V      V    V     V     V     V      V           V     V
DOE 1-82-0447        V      V    V     V     V     V      V           V     V
DOE2- 14-0677        V      V    V           V     V      V           V     V
DOE1-07- 1283        V      V    V           V     V      V           V     V
DOE1-53-1065         V      V    V           V     V      V           V     V
DOE1-59-1 152        V      V          V     V     V      V           V     V
DOE 1~82~0434*       V      V    V           V     V      V           V     V
DOE2-63-071 1*                   V
DOE1~07~1313*                          V
DOE1~12~0905*                                                   V
DOE1-23-051 1                                                   V
DOE2~33~1315*                                                   V
DOE1A8-0009                                                     V
DOE1-16-1 155*                                                  V
DOE2-40- 1200*                                                  V
DOE1-1 1-0490                                                   V
DOE1-15-1 164                                                   V
DOE 1-35-0747                                                   V
DOE143~0913*                                                    V
DOE1-46-031 1                                                   V
DOE 1-73-0024                                                   V
DOE 1~93~0736*                                                  V
DOE2-71-0075                                                    V
DOE 1~33~0034*                                                  V
DOE 1~84~1200*                                                  V
DOE 1-85-0564                                                   V


                                      52

Table 19. Precision and recall for
          ad hoc queries

     Top ranked evaluation

Run number:     upitt3 all
Num~queries:      50
Total number of documents over all queries
  Retrieved:   5383
  Relevant:    5923
  Rel_ret:     1249
Recall - Precision Averages:
  at 0.00     0.5622
  atOlO       0.2723
  at 0.20     0.2025
  at0.30      0.1382
  at 0.40     0.0708
  at 0.50     0.0309
  at0.60      0.0279
  at0.70      0.0155
  at0.80      0.0136
  at0.90      0.0000
  at 1.00     0.0000
Average precision for all points
  11-ptAvg:    0.1213
Average precision for 3 intermediate
      points (0.20,0.50,0.80)
  3-ptAvg:     0.0824

  Table 20. Precision and recall for
           routing queries

       Top ranked evaluation

  Run number:   upitt3 all
  Num~queries:    50
Total number of documents over all queries
   Retrieved:  7611
   Relevant:   14216
   Rel_ret:    1277
  Recall - Precision Averages:
   at 0.00    0.5285
   atOlO      0.2057
   at0.20     0.0649
   at 0.30    0.0085
   at 0.40    0.0000
   atOSO      0.0000
   at0.60     0.0000
   at0.70     0.0000
   at0.80     0.0000
   at0.90     0.0000
   at 1.00    0.0000
  Average precision for all points
   11-pt Avg:  0.0734
  Average precision for 3 intermediate
       points (0.20, 0.50,0.80)
   3-pt Avg:   0.0216

Recall:                                   Recall:

  at S docs:  0.0192                        at 5 docs: 0.0090
  at 15 docs: 0.0509                        at 15 docs: 0.0215
  at 30 docs:   0.0849                      at 30 docs: 0.0362
  at lOodocs:   0.1753                      at 100 docs: 0.0842
  at 200 docs:  0.2206                      at 200 docs: 0.1007

Precision:                                Precision:

  At Sdocs:     0.3280                      At Sdocs: 0.3520
  At 15 docs:   0.2867                      At 15 docs:  0.3013
  At 30 docs: 0.2533                        At 30 docs:  0.2727
  At 100 docs: 0.1844                       At 100 docs: 0.2102
  At 200 docs: 0.1249                       At 200 docs: 0.1277


                                   53

9. Failure Analysis

     Compaiiing our results with those provided by NIST, the precision values at eleven
recall points indicate that our system performs better than the median level on about half of
the topics for the ad hoc queries (Figure 1). In several queries (54, 60, 79, 81, 84, 91, and
100) we sent only a few documents (fewer than 10), and for queries 59, 61, 68, 80 and 99 we
sent no documents, because the threshold values limited the number of documents retrieved.
Thus the precision values for those queries are zero or  very low.  The same situation
happened for some routing queries.

     The precision values for the routing queries in our system are lower than the median
level in most cases (Figure 2). Beside the threshold inhibition mentioned above, we observed
that due to the query convergence in the final generation of the training topic, most query
variants retrieved the same documents.   Some query individuals in the intermediate
generations which retrieved different relevant documents than the last generation may not
have survived.  We think this caused the situation where fewer relevant documents were
retrieved on the routing queries.

     Problems arose with specific queries due to our pre-processing of the documents.
Several circumstances were not considered in designing our system.  For example, some
special keywords, such as AT&T and M which was used in some documents to represent
million, were not processed, but were significant in some topics. Another factor is that in the
AP, WSJ and ZIFF databases more than one text with different topics comprised a single
document. Since we did not separate them, the keyword match could cause a document to be
retrieved because keywords from different text parts matched the query, though the document
itself is not relevant.


                                       54

    0.6

 P
 r  0.5
 e
 C
    0.4
 S
 i
 0
    0.3
 n

 V
    0.2
 a
 1
 U
 e  0.1


     0


   0.6

 P
 r 0.5
 e
 C
   0.4
 S
 i
 0
   0.3
 n

 V
   0.2
 a
 1
 U
 e 0.1

I

I 1111

L

I

59 61 79 84 99 89 71 51 57 88 87 67 81 86 72 55 90 9 76 64 78 62 52 82 69

60 68 80 91 92 73 66 65 56 95 54 74100 63 83 75 85 94 53 77 98 % 58 93 70

             Topic number (ad-hoc)
             Upitt     :>.>  NIST-Median
  Figure 1. Precision values on 50 ad hoc queries

0

I         Ii.

    4:
  5 3 5 7 16 4., 2


    Figure 2.

I

fi n

4

4

3

~. 3~3'1I -~ 3 4~  1 3( 4~ 1~ 4 4' 2~ ( 4'
  711159 5.218 25.' 8 29:9:7
   Topic number ~outing)
    Upitt             NIST-Median
Precision values on 32 routing queries

2'

2

55

3

10. Further Studies

        In dealing with these large databases, our algorithm shows many interesting
phenomena which we could not deal with due to the short time period of TREC, and which
could form the agenda for further studies.

(1) The document term weights
        Although we use the vector space model mentioned in Section 3, unweighted keyterms
are assigned to describe the documents. We have mentioned the two factors based on which
we made the decision. Moreover, the databases consist of a variety of subjects which may
make the general term weights less reliable than in most experimental databases, which deal
with only a narrow area.  While our Cranfield study supports this decision, the effect of
weighted and unweighted document keywords in large databases should be further studied.

(2) Window size for document retrieval
        Although we established the maximum size of the retrieved document set as 40, it
would be interesting to set the window size smaller than 40 at the beginning of the process
and gradually increase the size after the query individuals move toward convergence. Various
thresholds and cutoff points are often arbitrarily set. Further experimentation is planned to
investigate the effect of changes in these values.

(3) Document classification
        In this experiment we found many cases in which different query individuals retrieved
different documents. This may arise because the database contains relevant documents that
belong to different classes, that can be retrieved only by using different queries. This fact
would never be discovered by systems that rely on a single query. Our present algorithm aims
at convergence to a single query form.  We need a mechanism which will detect the
development of distinct classes, and automatically separate the query into two or more distinct
queries.  Similar ideas, not based on genetic algorithms, have been discussed in document
classification (Worona, 1971) and in query splitting (Borodin, et. al., 1971).

(4) Term addition and deletion
        In our feedback process, no terms were added to or removed from the original query.
This may cause lower recall. However, the determination of which terms should be added or
removed is not an easy task (e.g., Harman, 1988, 1992). It would be interesting to see how
this problem can be handled in the genetic process.


Reference
Aalbersberg, I.J. (1992). Incremental relevance feedback. Proceedings of the lSth Annual
        International SIGIR Conference, pp 11-22. Copenhagen, Denmark (June, 1992).


                                      56

Baker, J.E. (1987). Reducing bias and inefficiency in the selection algorithm. In J.J.
      Grefenstette (Ed.), Proceedings of the Second International Conference on Genetic
      Algorithms. pp.14-21. NJ: Lawrence Eribaum Associates, Publishers.

Bookstein, A. (1985). Probability and fuzzy-set applications to information retrieval. In: M.
      E. Williams (Ed.), Annual Review of Information Science and Technology, Vol.20,
      pp.117-151.

Borodin, A., Kerr, L. and Lewis, F. (1971). Query splitting in relevance feedback systems.
      In: 0. Salton ~d.), The Smart Retrieval System: experiments in automatic document
      processing. NJ: Prentice Hall.

Chang, Y.K., Cirillo, C. and Razon, J. (1971). Evaluation of feedback retrieval using
      modified freezing, residual collection, and test and control groups. In: 0. Salton ~d.)
      (1971). The SMART Retrieval System: experiments in automatic document
      processing. New Jersey: Prentice-Hall, Inc.

Frakes, W.B. (1992). Stemming Algorithms. In W.B. Frakes and R. Baeza-Yates, (Eds.)
      Information Retrieval: data structures and algorithms, pp. 131-160. NJ: Prentice
      Hall.

Frieder, 0. and Siegelmann, H. (1991). On the allocation of documents in multiprocessor
      information retrieval systems. Proceedings of the Fourteenth Annual International
      ACMISIGIR Conference on Research & Development in Information Retrieval, pp.
      230-239. Illinois: Chicago.

Goldberg, D. E. (1989). Genetic Algonthms: in Search, Optimization, and Machine
      Learning. New York: Addison-Wesley Publishing Company, Inc.

Gordon, M.D. (1988). The necessity for adaptation in modified Boolean document retrieval
      systems. Information Processing & Management, 24(3), pp. 339-347.

Harman, D. (1988). Towards interactive query expansion. Proceedings of the Eleventh
      Annual International ACMISIGIR Conference, pp. 321-331.

Harman, D. (1992). Relevance feedback revisited. Proceedings of the 15th Annual
      International SIGIR Conference, pp 1-10. Copenhagen, Denmark (June, 1992).

Holland, J. H. (1975) Adaptation in Natural and ArtLflcial Systems. Ann Arbor: The
      University of Michigan Press.

Ide, E. (1971). New experiments in relevance feedback. In: G. Salton ~d.) (1971). The
      SMART Retrieval System: experiments in automatic document processing,
      Chapter 16. New Jersey: Prentice-Hall, Inc.


                                57

Salton, G. (Ed.). (1971). The SMART Retrieval System: Experiments in Automatic
     Document Processing. NJ: Prentice-Hall, Inc.

Salton, 0. (1986). Recent trends on automatic information retrieval. Proceedings of
     ACMISIGIR Conference on Research and Development in Information Retrieval, pp.
     1-10. Pisa, Italy (September, 1986).

Salton, 0. and Buckley, C. (1988). Term-weighting approaches in automatic text retrieval.
     Information Processing & Management, 24(5), pp.513-523.

Worona, 5. (1971). Query clustering in a large document space. In: 0. Salton (Ed.) (1971).
     The SMART Retrieval System: experiments in automatic document processing,
     Chapter 12. New Jersey: Prentice-Hall, Inc.

Yang, J.J. and Korfhage, R.R. (1992a). Adaptive information retrieval systems in vector
     space model. Proceedings of Symposium on Document Analysis and Information
     Retrieval, pp.134-150. Las Vegas, Nevada (March, 1992).

Yang, J.J. and Korfhage, R.R. (1992b). Query modification using genetic algorithms in vector
     space models. Research Reports, L!5045/1592001, Department of Information
     Science, University of Pittsburgh.


                               58