Retrieval Experiments with a Large Collection using PIRCS

                        K.L. Kwok, L. Papadopoulos and Kathy Y.Y. Kwan

                                 email:kklqc@cunyvm.bitnet
                      Computer Science Department1 Queens College, CUNY
                                   flushing, NY 11367

ABSTRACT

Our strategy to Information Retrieval and to the TREC experiments is based on techniques that have
previously been demonstrated to work for small to medium size collections: 1) use of document
components for retrieval and term weighting; 2) two-word phrases to achieve better precision and recall;
3) combination of retrieval methods, and 4) network implementation with learning capability to support
feedback and query expansion. Evaluation shows that we return the best results for Category B in both
ad hoc and muting retrievals, and our approach is comparable to the best methods used in Category A
experiments.    It appears techniques that work for small collections such as combining soft-boolean
retrieval with probabilistic model, user relevance feedback, and feedback with query expansion also work
for this large collection.


1. Introduction

Over the past years, we have built an experimental system for automated information retrieval (IR) called
PIRCS, acronym for Probabilistic Indexing and Retrieval - Components - System. It provides storage and
retrieval capability based on collection statistics using content terms as index terms for document
representation.  Its design is based on a network, and has flexibility in mind more than efficiency or
performance, so that future and unforseen approaches to IR may be supported. As it turns out, when the
TREC experiments were announced, we found that our system fits nicely with the requirements. Certain
changes (mostly because of large file sizes) and a number of peripheral programs are needed, but the basic
design remains intact. Because available RAM and disk space are limited, we can only handle Category
B Wall Street Journal files (WSJ 500 MByte raw data); however, we forsee no problem running the
system for Category A files (2 GByte raw data) if we have the appropriate hardware. In what follows we
shall describe our strategies for IR and TREC in Section 2, system description in Section 3 and the
Appendix, results and discussions in Section 4, and conclusions in Section 5.


2. Strategies for IR and TREC Experiments

Over the past twenty five or 50 years, a number of techniques have been known to work in IR for small
to medium collections.  These include: manually created thesauri and phrases to enhance recall and
precision; term weighting that can account for importance of content and discrimination; user relevance
feedback and feedback with query expansion; combining multiple retrieval methods, and using multiple
document and query representations. Many other methods exist and have been experimented with, (such
as clustering, natural language processing, various artificial intelligence techniques, etc.) but their
effectiveness in general are still in question. Our strategy to IR and to the TREC experiments is based
on the following that reflect on the previous workable techniques: 1) use of document components; 2)
two-word phrases; 3) combination of retrieval methods, and 4) network implementation with learning.
We shall discuss how and why each of these methodologies may help contribute to the effectiveness of
our results.


                                        153

2.1. Use of Document Components and Query Formulation

We view each source document not as monolithic, but as constituted of components.          Document
components are utilized in two ways: for constructing a more restricted context for term weighting,
retrieval and feedback, and for defining initial weights to the content terms for representation. These are
discussed in the following sub-sections.

2.1.1 Sub-Documents as Document Components

A survey of the WSJ files shows that document lengths vary substantially, from a couple of lines to
hundreds, with several thousand words.  Moreover, many documents carry unrelated news stories,
separated by three dashes `---` on an independent line.  We believe that treating such documents as
monolithic objects will have adverse impact on: a) precision, because they might lead to high probability
that homographs would occur in a different sense and context from what one intends; b) term weighting,
because unreliable estimates of the necessary probabilities for the index terms might result, and affecting
retrieval; c) feedback and query expansion, because documents that are long and have mixed unrelated
topics will make these processes imprecise; and d) system output effectiveness, because after retrieval,
users still have to manipulate a large document to locate where the relevant passage is.

There may be some risks in using document breakups. Boolean AND's would not be satisfied if the two
factors for an AND happen to be split up in separate sub-documents. Coordinate matching would get less
term counting if one does not somehow combine the counts of each sub-documents for ranking purposes,
assumming all match terms are topically relevant. List queries with term weights may or may not suffer:
even though a sub-document would have less term match than a flill document, shorter document lengths
may lead to higher term weights depending on the weighting method used. Since all documents are
treated in the same fashion, the sub-document weights will be affected to the same degree. If we allow
for a substantial chunk of text, such as a few hundred words, a writer generally would have a chance to
express what s/he intends fairly completely, either in this or in another sub-document.  Since we are
neither using Boolean retrieval nor coordinate matching, we believe using sub-documents with more
uniform lengths outweigh the risks. Our experimental results seem to support our conjecture.

Therefore the first processing we did was to break each document into component units of approximately
equal lengths. We creat a new component, (which we call a sub-document, with two more digits attached
at the end of the original document ID, assumming a breakup of at most 100 sub-documents per
document) whenever we recognize the story separation mark `---`, or when we have a run of texts
exceeding N words and ending on a paragraph boundary. A sub-document should not be too short lest
it carries little content. Moreover, since each sub-document is an independent entry consuming space and
time resources, we do not want to exceed a certain limit, which we arbitrary set as double the original
number. After some experimentation, we found that a break at N=360 raw words satisfies our design goal.
The original number of documents in WSJ is (first half plus second half) 98733+74486 = 173219; after
breaking into components, we end up with 192935+156880 = 349815 sub-documents. This forms our
database for subsequent processing. We would prefer to break documents based on more sophisticated
strategies such as context, but we have not done so.

2.12 Query Formulation

In PIRCS without soft-boolean, queries and documents are items of the same category, each containing
a list of content terms.  For a query, we obtain the list from the <title>, <desc>, <narr> and <con>
paragraphs of the topic in a flilly automatic fashion. These paragraphs also go through a filtering program
that removes standard introductory phrases such as: `To be relevant, a document (will I must I..) (discuss


                                           154

I cite I report I ..)`, etc. If a capital `NOT' is found in a sentence, the rest of the sentence including the
`NOT' is also removed, because it is difficult for list queries to handle negation. We understand that this
does not completely solve the negation problem because some `not's are not capitalized, and many
negations are expressed by other means. The remaining words from the four paragraphs are then merged
and processed against the collection dictionary to form a query representation. No breakup into sub-
documents is done for topics.

We also manually form a boolean query for each topic for soft-boolean retrieval, thus providing both an
alternative representation for queries as well as a different retrieval method. We essentially scan the same
paragrapbs as before. Sometimes we also consult the document frequency of a term to screen out high
frequency terms, to arrive at a smaller expression. We might occasionally add to the boolean expression
some new terms that are not in the original paragraphs. However, the way our evaluation program works
is that these new terms are ignored because they are not part of the automatically formed query.

2.1~ Initial Term Weighting based on Single Terms as Conceptual Components

After the previous processes, we apply the use of document components a second time. We regard each
content term within a sub-document or query as an independent concept.  This allows us to use the
principle of document self-recovery to give initial weights to each term of an item, or to use the simpler
Inverse Collection Term Frequency (ICTF) weighting [1,2]. Because the former requires experimentally
adjusting some paramenters and we did not have sufficient relevance judgment information, we decided
to use the simplier ICTF for our initial weighting of a term in an item (query or document) as follows:

    = In [p/(l-p)] + In [(l-s~~/s~.               (1)

w~ is the weight given to term k in item i; p = 1/50, a constant chosen based on previous experience; and
5jk = (Fk-dik)/(NW-LI) if item i is a document, and 5jk = Fk/NW if item i is a query. Here Fk = Xi d~k is the
collection frequency of term k, ~ is the term frequency of term k in item i, L1 = xk d~ is the length of
item i, and N~ = Xi  = xk Fk is the total number of terms in the database. The fraction (1~5ik)/5ik inside
the logarithm ln in Eqn. 1 is approximately N~Fk, if N~ť Fkť d~, hence the nomenclature ICTF.

2.2. Two-word Phrases and Other Vocabulary Control

Document frequencies (of terms) of a few hundred are small compared with a few hundred thousand
documents. Yet a few hundred high-ranked documents that are irrelevant would stretch a user's patience
to great limits. We therefore believe that in large collection environments, precision enhancement tools
are very important.  Syntactic phrases, or statistically generated phrases within tight context would
probably be useflil as indexing terms. However, we do not have these tools yet for the experiments.

A look at the collection also shows that WSJ jargon contains many two-word phrases that are a
combination of very common words. Examples are `big three', `buy down', `drug money', `go public',
`take over', etc. By themselves, many of the single terms would be screened out because they are either
on the stopword list, or have high document frequencies. In combination however these two-word phrases
are precise in meaning and would probably impact favorably on both precision and recall. We therefore
spent a fair amount of manual effort to record these content-specific combinations in a two-word phrase
file. During processing, whenever such two-word phrases occur in adjacent positions in a sentence, a new
index term is created consisting of the combination, in addition to the single terms.  Such two-word
combinations are also common in other fields. Our file has 396 such pairs, containing also some from
the computing field, and not necessarily just consisting of function or common words. We would like to
have a larger set, but we did not have the resource nor the domain expertise. Another vocabulary control


                                              155

tool is a stopword list with 595 entries, some of which are morphological variants of the same word. In
addition, we use Porter's stemming [3] algorithm for suffix stripping.

There are many other vocabulary control tools such as spell-checker, proper noun and date identification,
synonym list, thesaurus, semantic nets, etc. These could be added in the future.

2.3. Multiple Retrieval Methods

It is well known that different retrieval algorithms on the same collection would lead to different retrieval
results and that combining them could substantially enhance effectiveness [2,4,5,6]. Our network approach
to IR (Section 4) can support multiple retrieval methods fairly easily. Three methods of retrieval are used
in PIRCS for these experiments, and they agree nicely with the different types of TREC query
requirements:

(a) Query-Focused Retrieval

This method involves the question: given a query CL, should document d~ be considered as relevant to it?
This point of view is appropriate for ad hoc environment where we have a set of static documents being
processed against a query stream. Each query serves as a focus to which documents are ranked. An
approximate probabilistic measure W~ of this answer for d~ is evaluated as follows, and is used as the RSV
(retrieval status value) to rank the whole collection with respect to qa:

     = Xk ~                             (2)

w~ is the weight of term k in query qa as given in Eqn. 1, and ~ measures the proportion that term k
is used in d~. The sum is over all terms k that overlap between d~ and q~.

(b) Document-Focused Retrieval

This method involves the reverse question: given document d~, should query qa be considered as relevant
to it? This point of view is appropriate for muting where we have a set of static queries being processed
against a document stream. Each document serves as a focus to which queries can be ranked. However,
evaluation of retrieval is usually done with respect to a specific query q8, and so an approximate
probabilistic measure V~ of this answer is given to each document d~ and used as RSV as follows:

     = Xk w~*d~/La.

                                       (3)

(c) Soft-Boolean Retrieval (Query-Focused)

This method depends on a boolean query being available for each topic, which we creat manually for this
experiment. The idea is that topics that are represented as a list of terms (as in (a) and (1)) above) have
no structure. Boolean queries have structure but its logic is `hard' and we lose the ability to attach term
importance or rank the collection. The soft-boolean approach allows one to retain the query structure yet
provides weights to both the terms and boolean operators, so that we can soften the logic and provide
ranking capability. We have followed the extended Boolean approach of the vector model [7] for this
implementation, but with simplified weights. For example, a query CL may be of the form:

  CL = Bb Ap X,
with                                    (4)
  X = CC V~ Dd.


                                     156

  B,C,D are index terms with weights b,c,d respectively, A,V are the boolean AND and OR each given a
  fixed weight p=2, and all clause weights are set to I. If document di also has these three terms with
  weights 0 <= b',c',d'<= 1, then the similarity between di and CL could be evaluated recursively as:

    x `= Sim(di,X) = sqrt[((cc')2 +(dd')2)i(~+d2)),
    Sim(di,CL) = 1 - sqrt[ ~2(l~b~)2+l*(l~x')2 )/(1)2 +1))

                                                     (5)

   All document and query term weights are taken from the edges of the net, so that the system is fully
   automatic once the boolean expression has been defmed manually.

   Our retrieval results for automatic query construction is then based on combining methods (a) and (b),
   thus: W~ILito = (W~ + V~)/2.  Those for manual query construction is based on w1man = r*Wiauto +
   s*sim(di,CL). Both make use of combination of retrieval methods. The constants r, 5 are chosen as 0.65
   and 0.35 respectively. The objective is that adding soft-boolean structure may enhance the retrieval results
   of the automatic method for the same queries.  Our soft-boolean evaluation algorithm currently only
   accounts for terms that also appear in the network for this query; additional terms that may have been
   inserted manually are ignored.


   2.4. Network Implementation with Learning

   2.4.1 Network for Routing, Ad Hoc and Feedback without/with Query Expansion

   The use of a network can provide a unified view of many retrieval algorithms and is a flexible tool for
   implementation.  In PIRCS, retrieval methods (a) and (1') of the previous section are implemented as
   feedforward and feedbackwards processing in a Query-Term-Document (Q-T-D) network as presented in
   [8,9].   A binary tree representing a boolean expression can also be hung onto the net for method (c).
   These are shown in Figs.l,2.

QTD                                             DTQ

cia

w
  ak ifi


A


-A
  Q

Fig.i: 3-Layer PIR Network

  tk


--0----

ThWkiWik


 T

~L]


LI


LI
D

The edges of the net are initialized as follows: w~
= d~I~ as in Eqn.2, and similariy for w~ and w~.

d1


                ˝
    Fig~2: Soft-Boolean Query
    Network

tk


0

w


T

F]

LI


LI
 D

(tk acting on q,) as in Eqn. 1 and wkl (d~ acting on tk)
Activation on d~=l gated through wkl deposits on tk

157

 OTO

 Soft-
Boolean


 d

    activation equal to the proportion that the term is used in d~. These activated terms spread to a target (`8
    gated through Wak which has the odds that given ~ that ~ is relevant. The sum of activations received
    at ~ implements the query-focused retrieval method (a) of Section 2.3, while the reverse QTD direction
    processing implements document-focused retrieval method (b). Moreover, edge weights from the net can
    also be used to initialize the trec leaf nodes for soft-boolean evaluation in the DTQ direction, Fig.2.

    Relevance feedback, which has been demonstrated in numerous experiments as the most effective tool for
    improving retrieval, is modeled as training in our net.  For example, the edge weight w~ (w1~ adapts
    according to a learning algorithm based on the average activation Xk induced in term k by the relevant set,
    the current p factor (see Eqn.l) on the edge, and a learning rate ~ (liD), thus:

    DTQ:  Awak = Ap~[p~Id(l~p~Id)],   ~ = liQ(xk~pakoId)
    QTD:  AW~ = Ap~[p~d(l~p~d)], ~ = liD(xk~pikold)


    see (8,9] for details and Fig.3 illustrating the DTQ process. Learning in both directions are allowed: DTQ
    query-focused training is done when we know the set of documents relevant to a query and corresponds
    to probabilistic retrieval [10,11], and QTD document-focused training is done when we know the set of
    queries relevant to a document and corresponds to probabilistic indexing [12]. Query-focused training let
    query representations improve with experience and prepare them to match new similar documents better,
    and normally associated with relevance feedback [13).       Document-focused training let document
    representations improve with experience and prepare them to match new similar queries better, and
    normally associated with dynamic document space modification [13]. They provide a precision enhancing
    tool because term weights are rendered `sharper' towards relevant items. Our network implementation
    differs from the traditional approaches in that two sets of weights are associated with an item, e.g. Wik and
    wkl. Wk,.= ~ are the observed properties and not modified, while w~ embeds inference and adapts as
    more evidence is gathered. Moreover, effects of training from both query-focused and document-focused
    processes are combined into one RSV, ~ that has been shown to lead to cooperatively better ranking
    results. These ideas were first introduced in a series of papers [14,15,1,2]. With learning capability, the
    net becomes a two layer direct-connect artificial neural network in each direction.

                              tk

           w
             ak )<
    q
      a                    0


          A


          A
                             U..

            0
Fig.3: Query-Focused DTO
Learning

w


T

 OTO

Learning


  d

[1)

q

 I     a

w
ak


A

A

             tk


w          0
             ti

              DTQ
             Learning
               &
 w          Expansion
ž˝~lk
              d1

w                t]
 I'

D                 0                                   D
      Fig.4: DTQ Learning with Expans on

T

    An additional feature allowed in our feedback learning is query expansion. The idea is that terms that are


                                          158

highly activated in the set of relevant documents should be highly related to the concepts and topics
wanted for that query. These terms can therefore be added to the original query as illustrated in Fig.4,
and could be expected to enhance both recall and precision, since these terms come from relevant samples.
Experiments with small collections have shown that this is indeed the case [16,8,9]. This tool resembles
that of an automatic thesaurus, associated terms being derived from user experience. From the network
viewpoint, query expansion corresponds to growing new edges from a query to highly activated terms
during relevance feedback, and weights are assigned according to the following learning algorithm:

DTQ:  wia = a*xk; Pai = p~Q*xk, w81 = In [PaV(l~Pai)] + ln [NWIFk]

which we introduced in [8,9]. In the TREC WSJ collection, we realize that documents can contain totally
different stories and they can also be exceptionally lengthy, while both feedback and feedback with query
expansion requires restricted context to work. This is a major reason why we decide to break documents
into su~documents, as discussed in Section 2.

2.4.2 Implementation of Retrieval in a Network

To satisfy TREC requirements, we need to simulate different querying circumstances as given in the
following table:

             Query                       Query Type
            Construcfion         I--------------------
             Method              I adhoc      I routing

             automatic           I PIRCSI     I PIRCSl
             manual              I PIRCS2     I PIRCS2
             feedback            I PIRCS3     I  n.a.
       fdbk + qry expansion      I PIRCS4     I  n.a.

We have submitted results to all six types of experiments. The last Sections describe the rationale and
how RSV's are calculated in our approach. The following discusses the processing of routing, ad hoc and
feedback queries:

(a) Routing Queries: a special requirement in TREC is to do routing retrieval. This simulates the
classification of new test documents with respect to a set of static queries that have been trained with past
documents. The (past) training documents are the first half of the TREC collection c(A). We process
c(A) and capture the necessary statistics of term usage. The queries from the first set of topics are then
processed against c(A), and a network created using ICTF weighting. Query-focused DTQ learning is now
applied with the supplied relevant (but not irrelevant) documents, and the resultant edge weights on the
Q-T side of the net saved. Note that each relevant document is split into multiple sub-documents for this
learning, and because we do not have relevance judgment at the sub-document level, we choose not to
expand the queries. These become our routing queries q(A). The test documents from collection B, c(B),
are then processed against c(A) as if they were queries, so that only collection A statistics are used for
the ICTF term weighting. The q(A) are now loaded with c(B) and the dictionary from c(A) to form a new
network, from which routing retrieval results based on W~8UtO and W~man (Section 2.3) are obtained, for fully
automatic and manual routing queries.

(b) Ad Hoc Queries: collection B is now re-processed as additions to collection A, forming a total
collection c(AB) and accumulating their total term usage statistics as well as a new dictionary. The ad
hoc topics are processed to form ad hoc queries q(B) against the whole c(AB). We did not perform


                                        159

document-focused QTD learning of c(AB) based on known relevant queries in q(A). The ad hoc queries
q(B), the total dictionary, and the total collection c(AB) then form a network using 1Cm weighting, from
which both automatic and manual ad hoc query retrieval results (W~auto and W1~~) are obtained.

(c) Feedback Queries: To simulate user relevance feedback, we employ the ten highest ranked sub-
documents of the automatic ad hoc results in (b), and determine their relevance to their respective topics
manually. Two queries out of 25 have no relevant documents within the first ten, and they are removed
leaving 23. We have performed two types of feedback training: without and with query expansion. In
both cases, query-focused and document-focused training were done.  Our method of adding terms to
existing queries is based on [8,9] during DTQ direction training. The n~ documents relevant to a query
q, are instantiated to I and their activation to each term node, after gated through the weights wk~=d~kIL~,
is averaged. The highest 20 activated terms with document frequency less than 2000 are then used to
expand qa. However, not all the terms are different from the existing terms of qa, so that on average, each
query got expanded with about 12.3 terms (284 new terms for 23 queries). After this, training proceeds
in the QTD direction. We have not used expansion for documents because usually there is only one
relevant query per document. We have done only one round of feedback iteration.


3. System Description

Our system was designed over a number of years for IR experimentation.        Its primary goal is to be
flexible so that unforseen developments or algorithms can be supported for testing without much change
in the basic implementation.  Many intermediate files are produced for more convenient hooks to and
from the system. These will be trimmed in later verisons. It was originally programmed in Pascal, but
has been translated and is now running in a UNIX and C environnient on a Sun SparcStation 2GS. The
system has 48MB RAM, a TREC-dedicated 1.7GB disk drive, plus about 0.5GB on another drive that we
can occasionally use. Our system characteristics and timings are described in the Appendix of this paper.


4. Results and Discussions

Our runs are named PIRCS'n', where n=l denotes fully automatic retrieval, n=2 denotes manual, i.e.
automatic combined with soft-boolean where boolean queries are manually created, n=3 denotes feedback
using automatic retrieval, and n=4 is the same as n=3 but with query expansion. Although we use sub-
documents for indexing and retrieval, our final 200-document output list has only unique document ID's;
all sub-documents of the same document ID are suppressed except for the one with the highest rank.
Precision-recall and other measures are gathered in the Appendix of this volume. Our runs are based on
twenty five queries for ad hoc and another twenty five for routing retrievals, which we recognize as too
few and may not reflect sufficiently the variety of query types in real-life.  On the other hand, the
evaluation curves and average values serve as a lower bound to the power of our system because relevance
judgments are rendered only to the combined first 100 documents of all systems. This means documents
ranked 101 to 200 in our retrieval are automatically considered irrelevant if they are not in the judged
relevant set. Also, Category B texts come from a single source, viz. WSJ of multiple years, not like
Category A with texts from multiple sources. With these reservations we like to make the following
observations:

(a) Inter-site comparison shows that the PIRCS results are the best among Category B participants for
both ad hoc and routing retrievals, and comparable to the best methods in Category A based on one
evaluation with crnlB.


                                              160

(b) Some of the topics have very specific requirements for documents to be relevant. For example: Topic
#1 needs antitrust cases as a result of complaint, not routine review; #2 needs acquisitions between a U.S.
company and another non-U.S. company; #53 needs leveraged buyout cases valued at or above $200
million; while #60 requires a policy change from merit-pay vs. seniority or vice versa. Data like `above
$200 million' or other numerics are removed either because they are on our stopword list or because of
high frequency. Other topics involve very general concepts that require the system to understand their
specific inferences. Examples are: course of action to decrease the U.S. deficit (#7); the body of water
being polluted (#12); specific commercial applications of superconductors (#21); hypocritical and
conflicting policies of the U.S. government (#74).  The possible `course of action', `commercial
applications', `conflicting policies', etc. are essentially open-ended. Yet others need synonym lists or other
aids to interpret proper terms in order not to miss documents. Examples are: Japanese, U.S. or foreign
companies (#2,3), European Community or countries (#5, #69), third world or developing countries (#4,6),
or economic indicators (#8). When would a proper noun, if identifiable, represents a company? And if
it is, is it a foreign company? Also, a few of the topics like #3, 15, 53, 56, 66 have short descriptions
with many general words, so that after stemming and stop-word processing, these queries end up with few
terms. PIRCS does not have tools for these problems. Its precision values at 50% recall and at 11-pt Avg
for ad hoc and routing retrievals are tabulated below:

                   Ad Hoc
                PIRCS1

                         PIRCS2

                                                Routing
                                         PIRCS 1      PIRCS2

               ------------------------------------------------
50% Recall     I      .276    .278      I     .340         .342
11-pt Avg      I      .311    .322      I     .343         .369

The ad hoc precision value of about 0.28 at 50% recall says that, averaged over 25 queries, if one wants
to retrieve half of all relevant documents, one would have to read about eleven documents to get three
relevant, and about nine documents to get three relevants for routing. Routing queries receive some help
from the few relevants provided for training purposes, just as in experiments first reported in [8] where
we simulate users posing queries equiped with some known relevants. The 11-pt Avg precison values
sample eleven recall points and simulate a uniform distribution of users with different recall needs, and
may reflect actual usage better. Effectiveness improves to between three in ten and three in nine retrieved
documents being relevant for ad hoc, and slighdy better for routing. These results naturally leave much
room for improvement; but considering that PIRCSl is fully automatic and relies only on statistical
methods, results seem reasonable for this large WSJ collection. See also analysis in (c) and (f).

(c) Another evaluation of the system is to look at the precision-recall values at different cut-off points
of 5, 15, 30, 100 and 200 retrieved documents. This may give users a better `feel' than the hypothetical
11-pt Avg. A question is what should these values be compared to? The theoretical limit is of course
1.0, for perfect recall and precision. However, this would punish the system unfaidy. For example, at
15 retrieved documents, many queries have x relevants with x> 15. Hence the best recall at this cut-off
should be iSix. This we call the best operational recall in contrast to the theoretical best of 1.0.
Similarly, if x <15~ these queries would have the best operational precision at this cut-off of x/15, instead
of 1.0. We have listed below for PIRCS 1 the precision-recall values at various cut-offs and also the best
operational values for comparison:

                            5      15    30    100    200
Cut-off:
Ad Hoc
  Best Oper. Recall:        .146  .298  .456  .828   .916
  Recall:                   .066  .130  .204  .419   .586
                            45%   44%   45%   51%    64%


                                         161

 Best Oper. Precision:      .984   .944  .899   .689   .469
 Precision:                 .624   .523  .480   .364   .281
                            63%    55%   53%    53%    60%
Routing
 Best Oper. Recall:         .083   .241  .387   .777   .915
 Recall:                    .048   .145  .228   .443   .576
                            58%    60%   59%    57%    63%
 Best Oper. Precision:      1.00   .995  .945   .781   .551
 Precision:                 .608   .597  .521   .398   .294
                            61%    60%   55%    51%    53%

For ad hoc retrieval at 30 retrieved document cut-off for example, it means that on average we get about
20% (.204) of all the relevant documents, but nearly one out of two (.480) retrieved documents are
relevant. However, because of the number of known relevant documents for each of the queries, the best
operational recall and precision at this cut-off is only .456 and .899. Our .204 recall and .480 precision
achieve 45% and 53% of these best values. On the whole we have achieved between 44-64% of the best
operational recall and 53-63% of the best operational precision values for ad hoc, and respectively between
57-63% and 51-61% for routing. These are quite respectable figures.

(d) Based on our strategy and experimental data, techniques that work for small collections also work for
this large WSJ collection as well. For example, using the 11-pt Avg precision values, PIRCS2 performs
better than PIRCSl in both ad hoc (0.322 vs 0.311, +3.6%) and routing (0.369 vs 0.343, +7.6%)
environments. Thus, adding soft-boolean as a third retrieval method helps, but requires forming the
boolean queries manually. An illustration of the better results of PIRCS2 over PIRCS1 can be seen by
comparing between pairs of methods using the 11-pt Avg measure, where `equal' means values within 5%:

                    AD HOC
                 PIRCS1 @ PIRCS2

                                           ROUTING
                                         PIRCS1 @ PIRCS2

        = better:          S                       3
        = equal:           9                       6
        = worse:          11                       16

Both PIRCS3 and PIRCS4 make use of feedback based on evaluating the first ten retrieved sub-documents
of PIRCS1, but PIRCS4 employs query expansion as well. These methods are automatic. Comparing
PIRCS4 with PIRCS3 in ad hoc feedback retrieval shows that query expansion is better (0.305 vs 0.282,
+8.1%) than no expansion. Note that in this feedback, only 23 queries are used because topic #66 and
#73 have no relevants in the first ten retrieved. PIRCS3 and PIRCS4 have also been re-evaluated using
the frozen rank method, viz. the ten documents used for feedback are given the same rank as in PIRCS 1,
while the other retrieved documents follow. In addition, the two queries without relevants in first ten are
put back carrying their PIRCS1 results. This way, PIRCSl,3f,4f can be directly comparable. It can be
seen that feedback by itself works (0.3407 vs 0.3107, +9.7%) and query expansion improves further
(0.3634 vs 0.3107, +17%). There are about 5.5 (126/23) relevant documents in the first ten retrieved.

(e) To illustrate the power of feedback, we have tabulated the number of queries that perform better,
about equal, or worse between pairs of methods using the 11-pt Avg measure:

PIRCS1 @ PIRCS3f

PIRCS3 @ PIRCS4f

= better:          1                      S
= equal:           S                      3
= worse:          19                      17


                               162

PIRCS3f overwhelmingly improves over PIRCS 119 to 1 with 5 ties. This is because PIRCS 1 starts with
low performance. PIRCS4f improves over PIRCS3f, but not with such wide margin, because PIRCS3f
already achieves better results. It can been seen that feedback with or without query expansion within this
collection is definitely worthwhile. Looking at the recall and precision at various retrieved document cut-
off in the Appendix of this volume, it can be seen that PIRC54f performs better than PIRCS3f, except at
the high precision cut-ff of 5. This reflects that query expansion is more of a recall enhancement tool.
This can also be seen from the number of relevants retrieved at the 200 document cut-off: 1403, 1526 and
1582 respectively for PIRCS I ,3f,4f. At high precision region, the added terms could lead to more noise.
The effect however is small, and may be related to the number and type of added terms. Thus, query
expansion behaves like an automatic thesaurus, terms added being indirectly based on user relevance.

(f)  It is tempting to compare the results of PIRCS in this large WSJ collection with those in small
collections. We have listed the ad hoc 10-pt Avg precision of the four standard collections popular with
IR research [2] below, with that of WSJ PIRCS1 results:

    MED            CRAN       CACM              WSJ           IS'
(3Oq, 1 .O3Kd)   (225q, 1 .4OKd) (52q,3.2Kd)  (25q,350Kd)   (76q, 1 .46Kd)
   0.472           0.374      0.297             0.263         0.174

MED (medicine) and CRAN (aerodynamics) use more specific, scientific vocabulary and can be expected
to have better performance. CACM (computer science) terminology generally is less `scientific' and can
often have general vocabulary; its performance and that of WSJ are similar. ISI (information science) has
very broad nonspecific queries, and the vocabulary is quite general and has the worst retrieval results.
We were afraid WSJ would have low performance like ISI. Listed below is a comparison between the
CACM [2] and WSJ PIRCS 1 P-R curves:

Recall:          0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
WSJ:             .78 .66 .55 .41 .31 .28 .19 .11  .07 .04 .01
CACM:               .62 .50 .41 .36 .30 .24 .18 .15 .11 .09

A characteristic of WSJ is that precision falls off to much smaller values at high recall region, compared
with CACM or other small collections. One should expect a large collection to have more noise, and it
is interesting that this noise impacts predominantly the low signal region.  For example, WSJ has
generality ratio of only 0.00077 versus 0.00478 for CACM, i.e. more than six times as much noise in
WSJ. Another reason for this phenomenon, perhaps to a lesser degree, is that of the incomplete evaluation
procedure for documents ranked between 101 to 200 as discussed earlier. At low recall region, however,
precision of WSJ is comparable or better than the small collections. Why is that? Our interpretation is
that first, queries are much richer and better formed in WSJ compared to those in CACM. Second, when
a collection is large, there is a very good chance that a number of relevant documents exist using closely
the same terms as the queries describing their content, especially if the queries are well-worded. These
documents will rank high, and hence precision at low recall does not suffer in spite of adverse generality
ratio. At high recall region, relevant documents do not express their content in similar terms to the queries
and have few term matches, and interference from poor generality ratio noise magnifies. Techniques to
improve precision at high recall would therefore be very important if one needs exhaustive search. It has
been quite popular to criticise IR research in that small collection results do not reflect those of large
collections. Our hope is that as experience is gained with more of these large scale collections, we might
be able to predict their behavior based on small collection results. Small collection experiments can be
performed in a matter of hours, while large collections take days using current technology. It should be
noted that all databases considered here are from one homogeneous field of knowledge. Many commercial
database providers produce CDROMs that are homogeneous and of about the same size as WSJ, and these


                                               163

results may be relevant.


5. Conclusion

Our system called PIRCS, acronym for Probabilistic Indexing and Retrieval -Component- System, has
been shown to be able to support storage and retrieval for a large collection of 0.5 GB. If appropriate
hardware is available, and with some software modification, it should support 2GB size collections. Our
strategy to IR and TREC consists of: 1) use of document components to provide a more restricted context
for retrieval and feedback, and to provide an initial ICTh term weighting; 2) two-word phrases, especially
consisting of stopwords and high frequency stems as a precision and recall enhancing tool; 3) combination
of retrieval methods, including soft-boolean, to capture cooperative effects between retrieval methods, and
4) network implementation with learning to implement feedback and feedback with query expansion,
resulting in a two-layer direct-connect artificial neural network with adaptive architecture. Our approach
leads to results that are better than expected.


6. Acknowledgment

We would like to thank our department Chairman and the Dean of Mathematics and Natural Science for
their support throughout the project. This work is partially supported by a grant from DARPA and a PSC-
CUNY grant #6-63288.


References

1. Kwok, K.L (1989). A neural network for probabilistic information retrieval. Proc. ACM SIGIR 12th
      Ann. Intl. Conf. on R&D in IR. N.J. Belkin & C.J. van Rijsbergen, eds. ACM: NY, pp.21-30.
2. Kwok, K.L (1990). Experiments with a component theory of probabilistic information retrieval based
      on single terms as document components. ACM TOIS 8:363-386.
3. Porter, M.F (1980). An algorithm fro suffix stripping. Program 14:130-137.
4. Smith, M (1990). Aspects of the p-norm model of information retrieval: Syntactic query generation,
      efficiency, and theoretical properties. TR 90-1128, Ph.D. Thesis, Cornell University.
5. Turtle, H.R & Croft, W.B (1991). Evaluation of an inference network-based retrieval model. ACM
      TOIS 9:187-222.
6. Fox, E.A; Nunn, G.L & Lee, W.C (1988). Coefficients for combining concept classes in a collection.
      Proc. ACM SIGIR 11th Ann. Intl. Conf. on R&D in IR. Y. Chiaramella, ed. PUG: Grenoble,
      pp.291-307.
7. Salton, G; Fox, E.A & Wu, H (1983). Extended boolean information retrieval. Comm. ACM 26:1022-
      1036.
8. Kwok, K.L (1991). Query modification and expansion in a network with adaptive architecture. Proc.
      ACM SIGIR 14th Ann. Intl. Conf. on R&D in IR. A. Bookstein, Y. Chiaramella, G. Salton &
      V.V. Raghavan eds. ACM: NY, pp.192-201.
9.  Kwok, K.L (199x). A network approach to probabilistic information retrieval. submitted for
      publication.
10. Robertson, S.E & Sparck Jones, K (1976). Relevance weighting of search terms. J. ASIS. 27:129-146.
11. van Rijsbergen, C.J (1979). Information Retrieval, 2nd Ed. Buflerworths: London.
12. Maron M.E & Kuhns, L.J (1960). On relevance, probabilistic indexing and information retrieval. J.
      ACM 7:216-244.
13. Salton, G (1989). Automatic Text Processing. Addison-Wesley: NY.


                                    164

14. Kwok, K.L (1986). An interpretation of index term weighting schemes based on document
      components. Proc. 1986 ACM Conf. on R&D in IR. F. Rabitti, ed. ACM: NY, pp.275-283.
15. Kwok, K.L & Kuan, W (1988). Experiments with document components for indexing and retrieval.
      Inform. Proc. Mgmnt. 24:405417
16. Salton, G & Buckley, C (1990). Improving retrieval performance by relevance feedback. J. of ASIS.
      41:288-297.


Appendix

System Summary and Timing

I. Construction of indices, knowledge bases, and other data structures
  (please describe all data structures that your system needs for searching)

 A. Which of the following were used to build your data structures
   1. Stopword List                                     YES
    a. how many words in list?                          595
   2. is a controlled vocabulary used?                  NO
   3. stemming
    a. standard stemming algorithms
      which ones?
    b. morphological analysis
   4. term weighting
   5. phrase discovery
    a. what kind of phrase?
    b. using statistical methods
    C. using syntactic methods
   6. syntactic parsing                                 NO
   7. word sense disambiguation                         NO
   8. heuristic associations                            NO
     a. short definition of these associations
   9. spelling checking (with manual correction)        NO
   10. spelling correction                              NO
   11. proper noun identification algorithm             NO
   12. tokenizer (recognizes dates, phone numbers, common patterns)
                                                        NO

                                         YES
                                         PORTER'S ALGORITHM
                                         NO
                                         YES
                                         NO

     a. which patterns are tokenized?
   13. are the manually-indexed terms used?           NO
   14. other techniques used to build data structures (brief description)
      A TABLE OF 3% MANUALLY CREATED 2-WORD PHRASES. WHEN THESE ARE IDENTIFIED
      IN ADJACENT POSITIONS IN DOCUMENTS OR QUERIES THEY ARE USED AS ADDITIONAL
      INDEX TERMS.

 B. Statistics on data structures built from ThEC text
   (please fill out each applicable section)
   1. inverted index
    a. total amount of storage (megabytes)            372
    b. total computer time to build (approximate number of hours)
                                                      95+11+2=108 for 500MB.
                                                      CLOCK TIME
                                                      YES, IF SUFFICIENT DISK.
                                                      NOT IN THIS EXPERIMENT.

    c. Is the process completely automatic?

                                    165

   if not, approximately how many hours of manual labor?
                                      0.5
 d. are term positions within documents stored?

                                      NO, BUT SENTENCE YES.
                                      YES, EXCEPT FOR I.A.14.
                                      NO

 f. single terms only?
2. clusters
 a. total amount of storage (megabytes)
 b. total computer time to build (approximate number of hours)
 C. brief description of clustering method
 d. is the process completely automatic?
   if not, approximately how many hours of manual labor?
3. ngrrrns, suffix arrays, signature files           NO
 a. total amount of storage (megabytes)
 b. total computer time to build (approximate
 c. brief description of methods used
 d. is the process completely automatic?
   if not, approximately how many hours of
4. knowledge bases
 a. total amount of storage (megabytes)
 b. total number of concepts represented
 c. type of representation (frames, semantic nets, rules, etc.)
 d. total computer time to build (approximate number of hours)
 e. total manual time to build (approximate number of hours)
 f. use of manual labor
   (1) mostiy manually built using special interface
   (2) mostly machine built with manual correction
   (3) initial core manually built to "bootstrap" for
      completely machine-built completion
   (4) other (describe)
 g. auxiliary files needed for machine use
   (1) machine-readable dictionary (which one?)
   (2) other (identify)
5. special routing structures (what?)                SEE I.B.6
   NETWORK NODE, EDGE FILES.
   ROUTING USING NETWORK NODE AND EDGE FILES IS SThAIGIflFORWARD.

                               number of hours)

                               manual labor?
                                      NO

 a. total amount of storage (megabytes)
   NODE FILE: 4x7.5   EDGE FILE: 4x4
   NETWORK SEGMENTED INTO 4, BECAUSE OF INSUFFICIENT RAM.

 b. total computer time to build (approximate number of hours)
   4O+5+l+4xO.2=46.8, STARTING FROM TEXT FILE.

 c. is the process completely automatic?           YES, IF SUFFICIENT
                                                   RAM AND DISK SPACE.
 d. brief description of methods used
   1. PROCESS (OLD) COLLE~ON A.
   2. PROCESS QUERIES AGAINST COLLE~ON A.
   3. PROCESS NEW COLLECTION B AS IF THEY WERE QUERIES TO MAKE USE OF COLLE~ON
         A ST~~CS.
   4. COMBINE QUERIES, (OLD) DI~ONARY AND COLLECTION B INTO NETWORK FOR
         RETRIEVAL.

6. other data structures built from TREC text (what?)


                                              166

 1. SUBDOCUMENT FILE
 3. DOCID CHECKING FILE
 5. DOCNUM FILE
 7. DIRECT FILE
 9. NODE FILE

a. total amount of storage (megabytes)
 1.481
 3.7
 5.11
 7.372

2. CODED FILE
4. TERMID CHECKING FILE
6. TERMNUM (DICTIONARY) FILE
8. INDEX TO DIRECT FILE
10. EDGE FILE


2.324
4.4
6.6
8.19

9. 4x14            10. 4x9
SYSTEM WAS DEVELOPED FOR EXPERIMENTAL
GENERATE OTHER DATA. SOME OF THE FILES ARE

RESEARCH, WITh FLEXIBILITY TO
NOT NECESSARY FOR RETRIEVAL.

  b. total computer time to build (approximate number of hours)
    1.  1.5                 2,3,4,5,6. 95
    7,8. 11
    9,10. 4x0.25=1

  C. is the process completely automatic?          YES IF SUFFICIENT RAM
                                                   AND DISK SPACE.
                                                   FOR THIS EXPT, NO.
      if not, approximately how many hours of manual labor?
                                                   2
  d. brief description of methods used
    RAW TEXT --> SUBDOCUMENT FILE
    SUBDOCUMENT --> CODED FILE, DOCID FILE, TERMID FILE
           DOCNUM FILE, TERMNUM (DICTIONARY) FILE.
    ZIPF-LAW PROGRAM TRUNCATES DI~ONARY VIA USER ASSIGNED LIMITS.
    CODED, TERMNUM --> DIRECT FILE with INDEX
    DIRECT --> INVERTED FILE
    DIRECT, INVERTED --> NODE, EDGE FILES.

C. Data built from sources other than the input text
 1. internally-built auxiliary files
  a. domain independent or domain specific (if two separate
     files, please fill out one set of questions for each file)

b. type of file (thesaurus, knowledge base, lexicon, etc.)

c. total amount of storage (megabytes)

d. total number of concepts represented

DOMAIN SPECIFIC

WORD PAIR

0.005

                                     396
  e. type of representation (frames, semantic nets, rules, etc.)

  f. total computer time to build (approximate number of hours)
    (1) if already built, how much time to modify for TREC?
                                     0
                          C[HIS IS A FILE CREATED VIA EDITOR).
  g. total manual time to build (approximate number of hours)
                                     16
    (1) if already built, how much time to modify for TREC?


                                167

    h. use of manual labor
      (1) mostly manually built using special interface
      (2) mostly machine built with manual correction
      (3) initial core manually built to "bootstrap" for
        completely machine-built completion
      (4) other (describe)

   2. externally-built auxiliary file
    a. type of file (Treebank, WordNet, etc.)
    b. total amount of storage (megabytes)
    c. total number of concepts represented
    d. type of representation (frames, semantic nets, rules, etc.)

II. Query construction
  (please fill out a section for each query construction method used)

 A. Automatically built queries (ad-hoc)
   1. topic fields used

   2. total computer time to build query (cpu seconds)

   3. which of the following were used?
     a. term weighting with weights based on terms in topics

SEARCH  FOR   WSJ   ThRMINOLOGY   IN
LIBRARY AND FROM TOPICS.
NONE


~ThE>, <DESC>, <NARR>, <CON>

S (AVERAGE FOR EACH QUERY).

                                                       YES + OTHER WEIGHTS
     b. phrase extraction from topics                  NO
     c. syntactic parsing of topics                    NO
     d. word sense disambiguation                      NO
     e. proper noun identification algorithm           NO
     f. tokenizer (recognizes dates, phone numbers, common patterns)
      (1) which patterns are tokenized?                NO
     g. heuristic associations to add terms            NO
     h. expansion of queries using previously-constructed data structure
      (from part I)                                    YES
      (1) which structure?                             WORD-PAIR PHRASE FILE
     i. automatic addition of Boolean connectors or proximity operators
                                                       NO
     j. other (describe)                               NONE
 B. Manually constructed queries (ad-hoc)
    1. topic fields used

2. average time to build query (minutes)

<TIThE>, <DESC>, <NARR>, <CON>

                                                      300 minutes for 25 queries.
    3. type of query builder
     a. domain expert                                 NO
     b. computer system expert                        YES
    4. tools used to build query
     a. word frequency list                           SO~MES
     b. knowledge base browser ~nowledge base described in part I)
       (1) which structure from part I                NO
     c. other lexical tools (identify)                NO
    5. which of the following were used?
     a. term weighting                                YES
     b. Boolean connectors (AND, OR, NOT)             YES


                                                 168

   C. proximity operators
   d. addition of terms not included in topic
    (1) source of terms
   e. other (describe)
C. Feedback (ad-hoc)
  1. initial query built by method 1 or method 2?
  2. type of person doing feedback
    a. domain expert                        NO
    b. system expert                        YES
  3. average time to do complete feedback
    a. cpu time (total cpu seconds for all iterations)
    12 PER QUERY PER IThR~ON - NO EXPANSION.
    50 "     "               `I - ~ EXPANSION.

NO
YES
WORD-PAIR PHRASE FILE.

METhOD 1

b. clock time from initial construction of query to completion of
 final query (minutes)
60 PER QUERY TO DO RELEVANCE JUDGMENT.

3. average number of iterations                1
 a. average number of documents examined per iteration
                                               10
4. minimum number of iterations
5. maximum number of iterations
6. what determines the end of an iteration?

6. feedback methods used
 a. automatic term reweighting from relevant documents

1
1
DEADLINE + LACK OF
MANPOWER

YES

    b. automatic query expansion from relevant documents
      (1) all terms in relevant documents added
                                                NO
      (2) only top X terms added (what is X)
    TOP 20 MOST `A~VAThD' ThRMS ThAT HAVE DOCUMENT FREQUENCY <2000 WERE USED.
    BECAUSE MANY WERE ALREADY IN QUERY, ABOUT 12 ON THE AVERAGE WERE NEW AND
    ADDED PER QUERY.

      (3) user selected terms added             NO
    c. other automatic methods
      brief description
    FEEDBACK IS BASED ON SUB-DOCUMENTS.

                                                   NO
   d. manual methods
     (1) using individual judgment with no set algorithm
     (2) following a given algorithm (brief description)
D. Automatically built queries (routing)
 1. topic fields used

 2. total computer time to build query (cpu seconds)

 3. which of the following were used in building the query?
   a. terms selected from
    (1) topic                                      YES
    (2) all training documents                     NO
    (3) only documents with relevance judgments    NO


                                          169

~ThE>, <DESC>, <NARR>, <CON>

5 (AVERAGE FOR EACH QUERY).

     b. term weighting
       (1) with weights based on terms in topics     YES
       (2) with weights based on terms in all training documents
                                                     YES
                                                    with relevance judgme
nts                                                  YES
                                                     NO

     (3) with weights based on terms from documents

   C. phrase extraction
     (1) from topics
     (2) from all training documents
     (3) from documents with relevance judgments
   d. syntactic parsing                              NO
     (1) of topics
     (2) of all training documents
     (3) of documents with relevance judgments
   e. word sense disambiguation                      NO
     (1) using topic
     (2) using all training documents
     (3) using documents with relevance judgments
   f. proper noun identification algorithm           NO
     (1) from topics
     (2) from all training documents
     (3) from documents with relevance judgments
   g. tokenizer (recognizes dates, phone numbers, common patterns)
                                                     NO
     (1) which patterns are tokenized?
     (2) from topics
     (3) from all training documents
     (4) from documents with relevance judgments
   h. heuristic associations to add terms            NO
     (1) from topics
     (2) from all training documents
     (3) from documents with relevance judgments
   i. expansion of queries using previously-constructed data structure
     (from part I)                                   YES
     (1) which structure?                            WORD-PAIR PHRASE FILE
   j. automatic addition of Boolean connectors or proximity operators
                                                     NO
     (1) using information from the topics
     (2) using information from the all training documents
     (3) using information from documents with relevance judgments
   k. other ~rief description)                       NO
E. Manually constructed queries (routing)
  1. topic fields used

2. average time to build query (minutes)

3. type of query builder
 a. domain expert                                NO
 b. system expert                                YES
4. data used for building query
 a. from training topic                          YES
 b. from all training documents                  NO
 c. from documents with relevance judgments      NO
 d. from other sources (what?)                   NO


                                            170

~ThE>, <DESC>, <NARR>, <CON>

300 MINUThS FOR 25 QUERIES.

    5. tools used to build query                             SOM~MES
      a. word frequency list
      b. knowledge base browser ~nowledge base described in part I)
       (1) which structure from part I
      C. other lexical tools (identify)
      d. machine analysis of training documents
       (1) describe
    5. which of the following were used?
     a. term weighting                                       YES
     b. Boolean connectors (AND, OR, NOT)                    YES
     c. proximity operators                                  NO
     d. addition of terms not included in topic              NO
       (1) source of terms
     e. other (1)rief description)                           NO

III. Searching
 A. Total computer time to search (cpu seconds)
    1. retrieval time (total cpu seconds between when a query enters
     the system until a list of document numbers are obtained)
      18-30 PER QUERY, NO SOFT-BOOLEAN (COMBINE 2 METHODS).
      30-70"   "        WITH       "   " (COMBINE 3 METHODS).

    2. ranking time (total cpu seconds to sort document list)
      4 - 12 PER QUERY.

 B. Which methods best describe your machine searching methods
    1. vector space model                                    NO
    2. probabilistic model                                   YES
    3. cluster searching                                     NO
    4. ngram matching                                        NO
    5. Boolean matching                                      NO
    6. fuzzy logic (include your definition)                 YES, SOFT-BOOLEAN
    7. free text scanning                                    NO
    8. neural networks                                       YES
    9. conceptual graph matching                             NO
    10. other (describe)                                     NONE
  B. What factors are included in your ranking?
    1. term frequency                                        YES
    2. inverse document frequency                            NO
    3. other term weights (where do they come from?)
                                                   INVERSE COLLECTION ThRM FREQUENCY
                                                   TOTAL WORD OCCURRENCES.
    4. semantic closeness (as in semantic net distance)      NO
    5. position in document                                  NO
    6. syntactic clues (state how)                           NO
    7. proximity of terms                                    NO
    8. information theoretic weights                         NO
    9. document length                                       YES
    10. completeness (what % of the query terms are present) NO
    11. ngram frequency                                      NO
    12. word specificity (i.e., animal vs. dog vs. poodle)   NO
    13. word sense frequency                                 NO
    14. cluster distance                                     NO
    15. other (specify)                                      NONE


                                                   171

IV.  What machine did you conduct die TREC experimen~ on? SPARC-2GS
    How much RAM did iL have?                      48 MB
    What was the clock rate of the CPU?            40 MHz

V. Some systems are research prototypes and others are commercial.
    To help compare these Systems:
    1. How much "software engineering" went into the development of
       your system?
       TIME WAS SPENT TO TRUNCATh RECORD SIZES TO SAVE SPACE
       AND FIT CERTAIN STRUCTURES IN MEMORY; REPLACE SOME
       LINKED LISTS WITH ARRAYS.

    2. Given appropriate resources, could your system be made to run
       faster? By how much (estimate)?
       YES, 50-100%. LOTS OF CODE WAS TRANSLAThD
       FROM PASCAL TO C & USED AS IS.

    3. What features is your system missing that it would benefit by
       if it had them?
       DEDICAThD SPARCSTATION WITH MORE RAM, DISK SPACE.


                                              172