TREC2 Document Retrieval Experiments using PIRCS

                              K.L. Kwok & L. Grunfeld

                     Computer Science Dept., Queens College, CUNY
                   Flushing, NY 11367. email: kklqc@cunyvm.cuny.edu

                     Abstract

We performed the full experiments, using our network
implementation of component probabilistic indexing and
retrieval model. Documents were enhanced with a list of
semi-automatically generated two-word phrases, and queries
with automatic Boolean expressions. An item self-learning
procedure was used to initiate network edge weights for
retrieval. Initial results submitted were above median for ad
hoc, and below median for routing. They were not up to
expectation because of a bad choice of high-frequency cut-
off for terms, and no query expansion for routing. Later
experiments showed that our system does return very good
results after correcting the earlier problems and adjusting
some parameters. We also re-design our system to handle
virtually any number of large files in an incremental
fashion, and to do retrieval and learning by initiating our
network on demand, without first creating a full inverted
file.


1. Introduction

In TRECl      our system called PIRCS (acronym for
Probabilistic Indexing and Retrieval -Components- System)
took part as Category B participant, handling only the 0.5
GB Wall Street Journal collection because both our
software and hardware were not sufficient for the full set of
text files.  In TREC2, we participated in category A.
However, during a large portion of the time period we have
to face fairly uncertain and sometimes difficult conditions.
Plans to install a dedicated SPARC1O workstation and
associated large memory and disk drives did not materialize
until about three weeks from the deadline. Before this
period, the SPARC2 workstation that we have been using
was also shared with other users during the semester.
Certain things that we wished to do were not done, and
comers were cut to fit programs and data in the existing
system.    Much of our time was spent revamping our
software to be more efficient in space and time utilization.
Our focus remains as in ThECi, namely, to improve
representations of documents and queries, to test different


                                                   233

learning methods and to combine different retrieval methods
to improve final ranked retrieval output.  Section 2
summarizes our retrieval network; Section 3 discusses our
improved   system design;  Section 4     is on item
representation; Sections 5&6 are about our learning and
retrieval procedures; Section 7 discusses the results we
submitted and Section 8 contains results of our later
experiments. Section 9 follows with the conclusion.


2. A Retrieval Network in PIRCS

Our retrieval process is based on a three layer Q-T-D
(Query-Term-Document) network, details of which are
given in [KwPK93,Kwok9xj.   Here we give a review.
From Fig. 1, DTQ query-focused retrieval means spreading
an initial activation of 1 (one) from a document di towards
query ~ and gated by intervening edges wk~ and w~. The
resultant activation received at qa is W~q = 4 w~*w~, and
is the retrieval status value ~SV) of di with respect to qa.
When activation initiated at qa spreads towards di~ we obtain
activation received at di equals to W~ = 4 w~*w~, and is
our QTD document-focused RSV for di. Combining the
two additively: W~ = W~ + Wvd gives our basic retrieval
ranking function. Edge weights w~, w~ represent items (qa
or di) acting on terms ~ and reflect usage of terms within
items.  Edge weights w~, w~ (representing tk acting on qa
or di) embed Bayesian inference and are initialized based on
a component consideration of probabilistic indexing and
retrieval. These weights can improve via a learning process
when relevant documents are known to queries and vice
versa: DTQ query-focused training when we know the set
of documents and their components relevant to a query, and
QTD document-focused training when we know the set of
queries and their components relevant to a document.
Query-focused training prepares queries to match new
similar documents better, while document-focused training
helps documents to match new similar queries better. With
learning capability, the net behaves and can be viewed as a
superposition of two 2-layer direct-connect artificial neural
networks, one in each direction. If a boolean expression for
query qa is known, it can also be represented as a tree and
hung onto the net as shown in Fig.2. Edge weights from

the net are used to initialize the tree leaf nodes, from which
activation spreads to the query root node. Processing at the
nodes implements soft-Boolcan evaluation [SaFW83] in the

OTO


 q
   a

ffiv'a

       0
Fig.1: 3.Layer PIR Network


 3. System Design

ifi

D

Our previous software for  fl~EC1 has extraneous
processing that produces several intermediate files for other

DTQ direction resulting in a third RSV: S1. The RsVs are
latter combined for ranking retrieval outputs.

      Affi


d1 qj~

          ½

DTQ

Fig.2: Soft-Boolean Ouery
Network

w

D

purposes. These consume a lot of disk space for 1arg~ scale
collections.  We spent a substantial amount of time to
revamp our system, resulting in a more streaflilined flow-
chart as follows:

 DTQ

 Soft-
Boolean


 d.

  Pre?rocss -->  Create   --> Initiate --> Network --> Retrieve --> Evaluate
  Documents    Direct files I  Network    Learning     and Rank

  PreProcss         "    ---I
   Queries
                           II\
  Relevance File -->--------

 add~ we also took the (~-iIiity to ~ our
system reslilting in several inn~~ative  ~     described
in the following sub~~tions:

3.1 Full Inverted File Eliminated

II is useful to view a textbase as a document by term
matriL If one stores the matrix rowwise. we call it a direct
file. in oo'trast to an inverted file which stoes the matrix
cahinuwise. The inveeted file is usal to support fast
retrieval while the direct file is uscful for feedback learning
and query eapansion when gi~en oeram documents boiling
relevant to certain queries. ~ addition, the raw textfile is


                                                    234

useful for display lrurpses after a retrieval ranj~ed list is
prodi~cCL If one assumes that each of these these flees are
~p-miteiy - in sire ofNbytes. we need ammtm~um
of 3N -, which is quite siib:ianti~~ Removing the raw
textmay not win user support during display unless the
dircct file encodes all sic~~ords. pimctuatio~ and
pamgra~ ~ctea of the original- Most Systems
the direct file and re-~uce a ~eet of it fr~ raw text
when needei This results in a re-went of ~ bytes.
We choose. however, to - the directflle and produce our
network: with respect to the queies dynamically as needed
without first producing an inveteed file first. Our network
actually contains both direct and `inverted' dat~ It resides
in memory to support leering and retriev~ By this

~WIk

T

ifi

T

approach we achieve several objectives for our system: a)
reduce the `dead' time between a collection being acquired
and its availability for searching, since the full inverted file
is not produced; b) satisfy the 2N bytes of space
reqin]ement; c) support fast feedback learning and query
expansion by having the direct file available. The price we
pay is to have to create the network dynamically, unlike the
flill inverted file which is produced once. However, since
the flill inverted file is too large to reside in memory,
readidg parts of it in for retrieval is also time consuming.

3.2 Reduced Network Size

We produce our network in memory according to the
queries under attention and the terms used in them. Since
memory  is limited, documents are         divided into
subeollections (Section 3.3) and queries are lined in five
totenatatime. Wedefmetheactivetermset(ATS)ofa
network as the set of terms used by the current queries and
their feedback documents if any. The latter will provide
terms for query expansion. Only edges that connect items
to the active term set are initiated in the network, reducing
network space requirements substantially. With the current
implementation for 1GB subcollection and 7 queries, the
node and edge files together take about 40 MB.  Using
clock time as a measure, producing the network requires
about 40 minutes. learning is fast, about 2 minutes, and
retrieval and ranking another 8 minutes.  We hope that
fwiher improvements in design and faster hardware in the
future can improve these figures substantially.

3.3 Master and Subcollection File Structure

We view the `IREC experiments as document retrieval from
multiple collections, but reporting retrieval results in one
single ranked list of documents for each topic (query).
Although only four or five collections such as Wall Street
3ournal, Associated Press, etc. are given in ThEC, in reality
it could be many more. We consider three methods in file
design to approach the problem:

a) A centralized approach where all documents from all
collections are processed as if from a single document
stream, producing a centralized dictionary containing full
usage statistics and a giant direct file.  From these,
networks can be inltiallzed. The idea is simple, but it has
drawbacks in that eventually the direct file would exceed
file/disk size limitations and software has to be designed to
handle data crossing file/disk boundaries. Moreover, it is
inherently fragile to create single files of this size. The
advantage of this approach is that RSVs calculated are
directly comparable for all documents and a single ranked
list is produced without difiiculty.


                                                    235

b) An independent collections approach where each source
collection of about 0.5-1.0 GB say, forms a textbase with its
own local dictionary and direct file, for network initiation
and retrieval. Gne simply repeats the process for as many
collections as necessary. This is the preferred approach,
and if one has n processors and sufficient disk space, n
separate textbases can be created for learning and retrieval
in parallel saving substantial time. The problem is how to
combine the retrieval lists from each into a single ranked
list, since each textbase has its own term usage statistics and
calculates RsVs for raithing within its own environment.
Classical Boolean retrieval and coordinate matching pose no
problem. Some retrieval strategy may produce RSVs that
are comparable across collections in theory; but after
approximations are taken, it is questionable that this is still
true. Similar problem exists for retrieval from distributed
databases such as the WMS environment.

c) For ThEC2 we settle on a hybrid subcollections
approach, treating each source as a subcollection within a
master. We create a master centralized dictionary as in a)
capturing full usage statistics serving all the subeollections,
and create separate direct files for each subeollection as in
b). The central dictionary has about 620,000 uhique terms
after processing 2 GB from Diski and Disk2, and is
relatively small. It captures global term usage statistics,
while the individual direct files capture local usage statistics
within items. Separate networks are then created for each
subeollection with edge weights based on the correct global
and local statistics as in a), assuring that retrieval lists
contain RSVs that are directly comparable. This approach
combines the advantages of both a) and b), and can also
function in a parallel distributed environment.


4. Item Representation

As in ThECi, a number of preprocessing mainly for the
purpose of improving the representations of documents and
queries are done as follows:

4.1 Vocabulary Control

In addition to a manual stopword list of about 630 words
and another 528 manually identified 2-word phrases, we
also process samples from Diski and Disk2 using all five
source types (WSJ, AP, FR, ZIFF and DOE, of about
100MB each) to produce two-word phrases based on
adjacency within sentence context.  Gur objective is to
remedy losses of recall and precision due to the removal of
high frequency terms. Gur criteria for phrases is that each
word pair must have a frequency of 40, and either one or
both components must be high frequency (>=10000). A
casual scan of the resultant list generated led us to remove

some obvious non-content phrases (such as `author describ',
paper attempt', `two way'). The result is 13,787 pairs.
These plus the previously prepared manually list (which
contalus mostly of phrases with at least ore stopword as
well as some phrases identified in the `topics') are then
treated as if they were single index terms to be identified
during document and query processing ~US~3J. They
have their own global and local usage statistics, and can
improve individual collection retrieval effectiveness from a
few to 10%.

Mter documents are processed, we invoke zipfs law to
remove low (=4) and high frequency (=16000) words from
being used for representation. Low frequency terms lead to
too many nodes, and high frequency terms lead to too naany
edges,  both  conssuning    valuable memory space.
Unfortunately, our high frequency cut-off of 16000 was a
bad choice.  In fact, we used high = 12000 for routing
because at that earlier period, we were short of resources.
The effect is that our queries become too short and many
useful terms (such as `platform' in query #80, `crimin' for
criminal in query #87, etc.) are screened out. We discover
that this is a m~or factor in our disappointing results. Later
experiments use a high cut-off of 50000.

4.2 Subdocuments

As in ThECi we segment documents into subdocument
units to deal with the problems of WSJ documents having
multiple unrelated stories, and long documents in general.
A really long document is FR891 19-0111 which has 400748
words. Our criteria is to break documents either on story
boundaries or on the next paragraph boundery after a run of
360 words for all collections.   We have not found
convenient story boundary indicators in other collections as
in WSJ (which uses three dashes `---` on a line). With this
scheme, the total number of subdocuments from Disk 1&2
becomes 1,281,233 compared with an original 742,611.

Mter the ThE~ deadiine, we have the resources to
investigate the effects of subdocuments on retrieval.
Fxperiments were performed on individual subcollections
WSJ1, FR2, FRi and AP2, using segmentation sizes of 360,
720 and 1080 words. For WSJ1, we flirther break on story
boundaries only.  Results are tabulated in Table 1. It
appears that for the abnormally long documents of FR,
breaking into subdocuments is defmitely worthwhile,
achieving improvements of over 20% compared with no
segmentation. However, for the newswire documents of

                    AsIs     Break on      1080       720       360
                            "Stories"
   WSJ1
     Avil           0.421      0.432      0.428     0.424     0.418
     * of docs      98733     127151     134819    149611    193881
   FRi
     Avil           0.289                 0.351     0.354     0.354
     * of docs      26207                 50055     64650    108374

   FR2
     Avil           0.333                 0.372     0.420     0.421
     * of docs      20108                           51607     86787

   AP2
     Avil           0.423                 0.423     0.404     0.414
     * of docs      84678                 85616     95867    146354

 Table 1: Document Segmentation Avil Precision using 50 Queries Q2

WSJ1 and AP2, subdocuments have marglnal performance
effects: a llttle better for large chunk sizes, and a little
worse for small chunks. It seems that a chunk size of about
720-1000 words would get the benefits of both types. Using
different chunk sizes for different collections would
probably not be worth the effort.

We like to point out that other than effectiveness,
considerations such as isolating relevant sections for output


                                     236

display and for more precise feedback judgment would also
make document segmentation worthwhile. In particular, a
number of long docents in feedback for query expansion
would easily overload memory space in our network.


4.3 Queries

Topics are preprocessed to remove introductory phrases and

non~content terms based on word sequence patterns. We
continue to use words from the title, description, narrative
and c(Nlcept sections of the topics to form queries.
Experiments without the narrative section show slight
decrease in performance for our system in contrast to
[Crof931. We also try to produce automatically Boolean
expressions as queries from the description and concepts
sections. This is done by using punctuations to delineate
phrases and ANDing the words within them. Phrases are
teen ORed. This is a very cnide way of getting Boolean
expressions for later soft-Boolean retrieval processing.


5. Learning Procedures

`Ilour network of Fig.1 the edge weights determine the
retrieval outcome. wk~ and wk. captures the proportion of
term~indjor~ Theyarefixedandobtainedfromthe
manifestation of terms in the respective items. w~ (and w~
has a log odds factor, and an inverse Collection Term
Frequency (1CIl~) factor which is regarded as a constant for
a term ~, as follows:

   = In [rj(1-r~)] + ICn~k     (1)

Here r~ is the conditional probability that given relevance
to qa that term tk occurs, and needs to be learnt. It is
unknown unless one has a sample of relevant documents to
qa. This is not applicable to inltial ad hoc retrievals where
a document collection is being processed against a new
query with no known results.   Relevance feedback
information can remedy this later, but it is not available at
the begining. One way is to ignore the log odds factor in
EqILl, as done in our ~IREC1 experiments. resulting in
ICIF weighting. A better way, which we use for ThEC?,
is to include item self-learning to determine ~ and initiate


            d~ as `query'

the term weights. This is shown in Fig.3 and is based on
the following argument. Consider a docnment di containlng
certain concepts and topics. Imagine this author wiShing to
inquire the textbase for the same topics as this document
s/lie has written; what query would be most suitable?
Naturally the author's own words in the document can serve
as the `query', and there is also known to be one relevant
item to this `query' in the collection, viz. the document
itself. in other words, every item is assumed to be self-
relevant. One relevant item is however not sufficient for
estimating r~. Our method is to consider each document as
constituted of many independent conceptual components
each being described by a list of terms. We therefore work
m a component universe rather than in the docement
collection. Components can be units such as sentences or
phrases, but we have used single terms for simplicity.
Right from the start then, even without any relevance
feedback. we can divide the component universe into two
parts: one set relevant, and the other non-relevant to each
the `query'.  Standard probabilistic retrieval theory now
enables us to defme this `query' optimally -- meaning that
the defmed `query' will rank its set of relevant components
optimally with respect to the other components when used
for retrieval. The definition of this `query' (i.e. its terms
and weights) becomes the iuitial indexing representation of
the document, and it is used in our ad hoc QTD retrieval.
This is the principle of document self-recovery introduced
in ~wok9O) and implemented as a self4earning process in
a network ~wok89,~J shown in FigA. One can argue that
this relevant set of components from one document is too
small. But that is all the information one has at this stage.
Previous experiments ~wok9Oj show that this kind of
weighting can outperform ICIF weight by a few percent.
Moreover, our network self-learning parameters can be
adjusted to provide a smooth transition from ICIF to full
seif~learn weights. or any value in between. We invoke the


                d1 as  query'

                 A                                        A

relev Q            irrelev           relev  d1


       5     5

   tin

5

5      5
   5        components
      5

5

document space

I

  G
o  0   0
-JO     0
      0
 Qo     0

    irrelev

    0   0
0
0      oo
     0

component space

      Documents are not monolithic, but constituted of components

    Fig.3   Item Self-Learn Using Its Own Self-Relevant Components


                                  237

OTO


   cia


          Q
Fig.4: Query self-learning

 
0

T

   OTQ


D

same self-learning procedure for a query by adding the
query to the collection as a `document' temporarily,
resulting in self-learn initial weights for the indexing
representation of the query. This weight is used in our ad
hoc DTQ retrieval. The bottom line is that this component
consideration enables the probabilistic model to self-
bootstrap, allows term frequencies in items to be employed
instead of `binary', and takes into account of query-focused
and document-focused retrieval in a cooperative fashlon.

Iii the case of routing retrieval, relevant documents are
avallable for training each topic ~. These documents are
again considered as constituted of components, and are used
in the network to estirnate r~.  The estirnate should be
better than seif~learning because the sample silte of relevant
components is much larger. Our learning algofitlun updates
the co(iditional probability ~ (and r~ as follows (Fig.5):

 Ar~ = 11Q*(; - r~~oM)             (2)


Here, 11~ is a leaning rate for trainng on the query side
and; is the average activation deposited on term t~ by the
given relevant set. If a term ~ not in the original query is
highly activated by the relevant set, it can also be linked to
q£with edge weights given by:

 w1a = a ~:;
   =  *   * xl                     (3)


This implements query expansion as was also done in
ThECi.

cia

w

0

Fig.5: DTO Learning with

238

 WkI


¼


 T

 DTQ
Learning
   &
Expansion


 d.

D

6. Retrieval Methodology

To satisfy TRE~ requirements. we submitted results as
named in the following:

 piccsl: routing, no training;
 pircs2: routing, with training from Diski relevants,
     Disk2 not used and no query expansion;
 pircs3: ad hoc, no soft-Boolean added;
 pir~: ad hoc, with soft-Boolean adde~

Routing allows training the old Q2 topic set (topics 51-100)
before doing retrieval on new Dis~ documents.   Dis~
term usage statistics are not used. Ad hoc retrieval involves
using new Q3 topic set (topics 101-150) to do retrieval on
old documents in Diskl and Disk2. All our quenes are
automatically constructed. We did not perform feedback
experiments using Q3.

For baseline routing ~ircsl) and ad hoc ~ircs3) retrievals,
we use the item self-learn (SL) edge weights.  Rounng
pircs2 denotes retrieval based on fther learning from
known relevant samples and represent improvements from
pircsl. There can be hundreds of known relevants for each
of the Q2 topic setfrom documents inDiski and Disk2 as
given from the results of TRECi. One way of employing
them is to do a retrieval (rankng) of Diski and Dis~
documents, and then make use of the first n (say n=100)
relevant documents, as for feedback learning. However, we
did not have enough resources to create a network and do
retrieval (ranking) on 2GB of documents at that time and
have to settle on a simplified strategy. First, we decided to
use Diskl only (1 GB) for training. Secondly, we believe

     v'~~a


¼

a retrieval (ranking) of Diski is still expensive and perhaps
not necessary; rather we just select `nonbreak' relevants
from Diski for training. `Noubreak' means documents that
do not get split into multiple subdocuments based on our
criteria given in Section 3. The idea is that the quality of
documents for training is important, and short relevants are
the choice. They may not be those ranked early dwing a
retrieval. With these simplifications, a network is produced
with Diski, the query-term edges are trained, and then
stored for later routing retrieval using Disk3.

Ad hoc pirc~ denotes retrieval based on combining the
baseline pircs3 with a soft-boolean retrieval. The pircs4
ranking formula becomes r*W1+s*S~ (see definitions in
section 2).    Our Boolean expressions for queries are
produced automatically as discussed in Section 4.3, and
edge weights are used to initiate the leaf nodes of the
Boolean expression tree.


7. Discussion of Submitted Results

From our routing retrieval table in the master appendix of
this volume, it can be seen that pircs2 improves over pisesi
by about 7% based on average non-interpolated precision
(.266 vs .249) and about 3.8% based on relevants retrieved
(6135 vs 5913), showing that our simplified method of
using only the Diski `nonbreak' training documents still
works. We did not do a retrieval and rank. Compared with
the other sites, our result is below median both using the
average non-interpolated precision for individual queries (18
better, 2 equal and 30 below median), and using the
relevants retrieved at 100 documents (18 better, 8 equal and
24 below median).   if we assume the existence of an
overall `nia~~i-system' that produces the best non-
interpolated precision values among all sites for all 50
queries, then its average precision over all queries is 0.5054
and 8348 relevants retrieved. Our piccs2 achieves only
.2661.505 = 52.7% of the average precision but 6135/8348
=73.5% of the relevants retrieve~

From the ad hoc retrieval table in the appendix of this
volume it can be seen that pircs4, which is pircs3 combined
with automatic soft-Boolean retrieval, improves over pircs3
only by about 1%.   Processing time however increases
substantially.  Our automatic Boolean expressions are
cuudely formed; manual Boolean queries may do better.
Compared with other sites, our result is above median both
using the average non-interpolated precision for individual
queries (34 better, 2 equal and 14 below median), and using
the relevants retrieved at 100 documents (36 better, 4equal
and 10 below median). The `maxi-system' has an average
precision over all queries of 0A354 and 9027 relevants
retrieved. pirc54 achieves about 0.29810A35 = 68.5% of
this best precision value and 7464~027 = 82.7% of the


                                        239

relevants retrieved. They are much better than for routing.
It would be most useflil and interesting if one can choose
the best reported result for each query before the answers
are known. For these experiments our high frequency term
cut-off is 16000, which is still too low. The next Section
discusses our later results.


8. Further Experimental Results

After the ThE~ Confer~nce, we decided to repeat both
experiments. We realite that our disappointing results are
due to several factors: 1) bad high frequency term cut~off
leading to in~ufficient representation; 2) no query expansion;
3) msufficient training samples; and 4) parameters need
tuning. Fxcept for 4) these are rem~~ as follows: high-
frequency cut-off is set at 50000, leaning for routing is
done from both diski and dis'c2 and only docic:uents that
`break' into six or less subdocuents are used, and query
expansionisalsodone. TherunsarenamedinTable2as:

 piscs5: routing, with learning but no query expansion;
 piics6: routing, query expansion level of 20;
 pircs7: routing, `upperbound', no expansion;
 piccs8: ad hoc without Boolean queries.

As in ThECi, our query expansion level of 20 actually adds
less than 20 terms because some of the top-ranked terms
may already appear in the query. It can be seen that results
are substantially better than those in Section 7.  Iu
particular, pircs6 routing with query expansion have average
precision of 0.355 and the number of relevants retrieved are
7476 out of 10489. These are 12% and 5% respectively
better than pircs5 (0.318, 7098): routing with leaniing but
no query expansion, and achieving 70.3% and 89.6% of the
maxi-system values. The same average precision value and
relevants retrieved for ad hoc retrieval pircs8 are 0.344 and
8279 out of 10785, representii~g 79% and 91.7% of the ad
hoc maxi-system respectively. At 20 docents retrieved,
the precision values for routing and ad hoc are respecively
0.583 and 0.564.  This means that averaging over 50
queries, out of the first 20 retrieved over 11 are relevant.
Considering the sire of these textbaees, these are quite good
results.  These numbers are user-oriented, and users
naturally hope to see 100% precision. As discussed in
ThECi, from a system poiut of view the precision at n
documents retrieved shoold not be compared to the
theoretical value of 1.0, but to an operatioxal ~
value x~ ifthe total numberofrelevants xfor a query is
less than n. For example, at n=100 docuents retrieved 20
routing and 16 ad hoc queries have total relevants x less
than 100. The operational maximum precision averaged
over 50 queries for routing is only 0.8, and that for ad hoc
is 0.871. At 100 documents, routi pircs6 value of 0A39
and ad hoc pircs8 value of 0.468 therefore achieves 54.9%

                                                                        NewAdhoc
                      -                 Revised Routing

                      pii~s5       Pii~~   %           pii~7    %          pirc~

Total number of documents over all 50 queries
      Retrieved:      50000        50000               50000               50000
      Relevant:       10489        10489               10489               10785
      Rel_ret:        7098         7476   +5-3         7385   + 4.0        8279
Interpolated recall - precisron averages:

      0.0             .765         .814   + 6A         .793   +3.7         .823
      0.1             .554         .628   +13.3        .594   + 7.2        .561
      0.2             A91          .546   +11.2        .534   +8.6         .505
      0.3             .421         .469   +11.4        .469   +11.4        .456
      OA              .380         A13    + 8.7        .416   +9.5         .417
      0.5             .336         .363   + 8.0        .368   +9.5         .368
      0.6             .270         .310   +14.8        .314   +16.3        .320
      0.7             .212         .240   +13.2        .241   +13.7        .246
      0.8             .150         .165   +10.0        .171   +14.0        .160
      0.9             .079         .099   +25.3        .098   +24.0        .087
      1.0             .014         .006   -57.1        .015   + 7.1        .012

Average precision (non-interpolated) over all rel does
                      .318         .355   +11.6        .350   +10.1        .344
Precision at
         S does:      .600         .660   +10.0        .624   +4.0         .612
         10           .582         .632   + 8.6        .598   + 2.7        .572
         15           .545         .617   +13.2        .572   + 5.0        .573
         20           .527         .583   +10.6        .563   + 6.8        .564
         30           .507         .553   + 9.1        .534   +5.3         .540
      100             .402         .439   +92          .427   +62          .468
      200             .334         .360   +7.8         .353   + 5.7        .3%
      500             .222         .238   +7.2         .232   +4.5         .264
      1000"           .142         .150   + 5.6        .148   +4.2         .166
R-precislon
      Exact           .358         .385   +            .385   +            .378


                       Table 2: Revised Routing and New Ad Hoc Retrieval Results

and 53.7% of the operational maximum. The R-precision
Exact calculates the precision value at x retrieved. where x
is the known number of relevants for each query, and can
be compared with the theoretical value of 1.0.

The `Upperbound' retrieval pircs7 (suggested by Sparck-
Jones) means perfonning learning from the known Dis1:3
(not Disk 1&2) relevant documents before retrieval.  In
other words, we assume the answer documents are known
for training and represent the best that probabilistic they
can provide using our system. This is however not the true
upperbound [Spar79] for retrieval from Disk3. because the
vocabulary and usage statistics are still those of Disk 1&2.
The vocabulary is retained for comparison with routing


                                                     240

results. pircs7 retrieval achieves average precision of 0.350,
which improves over pircss (training from Disk 1&2) by
about 10% in average precision and about 4% in relevants
retrievei of corse in real life. the answer documents are
not known; but it is interesting to note that query expansion
using  Disk  1&2 documents can provide similar
performance, showing the imp&tance of query expansion.

We later concentrate on routing and discover that additional
gains can be achieved by fine tnning of the parameters in
our model.  For learning: 1) we find that our original
method of using only `nonbreak' documents in the given set
of relevants actually outprfonns other doccnent selection
strategies including using all relevants, `break six' or

                                          New Routing Results

               No     Trained %    Trained  %    Trained  %    Trained  %    Trained
               Train. Exp 0        Exp 20        Exp 40        Exp 80        Exp 100

Total number of documents over all 50 queries
Retrieved:     50000  50000        50000         50000         50000         50000
Relevant:      10489  10489        10489         10489         10489         10489
Rel_ret:       6551    6961 + 6.0  7496   +14.0  7646    +17.0 7695   +17.0  7712    +18.0

Interpolated P~~~11 - Precision Averages:

 0.00          0.7200 0.7480 +4.0  0.8471 +18.0  0.8475 +18.0  0.8646 +20.0  0.8660 +20.0
 0.10          0~124 0.5815 +13.0  0.6645 +30.0  0.6751 +32.0  0.6810 +33.0  0.6801  +33.0
 0.20          0.4431 05254 +19.0  0.5981 +35.0  0.6116 +38.0  0.6135 +38.0  0.6115 +38.0
 0.30          0.4016 0A728 +18.0  0.5371 +34.0  05413 +35.0   0.5465 +36.0  0.5452 +36.0
 0.40          0.3486 0.4402 +26.0 0.4751 +36.0  0A774 +37.0   0.4829 +39.0  0.4878 +40.0
 0.50          0.2970 0.3862 +30.0 0A167 +40.0   0A288 +44.0   0.4229 42.0   0.4214 42.0
 0.60          0.2382 0.3048 +28.0 0.3496 47.0   0.3681 +55.0  0.3699 +55.0  0.3690 +55.0
 0.70          0.1945 0.2430 +25.0 0.2772 43.0   0.2880 48.0   0.2815 45.0   0.2843 46.0
 0.80          0.1284 0.1865 +45.0 0.1911 49.0   0.1937 +51.0  0.1999 +56.0  0.2005 +56.0
 0.90          0.0740 0.0860 +16.0 0.1130 +53.0  0.1144 +55.0  0.1219 +65.0  0.1238 +67.0
 1.00          0.0119 0.0187 +57.0 0.0140 +18.0  0.0107 -10.0  0.0114 - 4.0  0.0171  +44.0

Average precision (non-interpolated) over all rel does
           02905 0-3517 +21.0  0.3962 +36.0
Precision at:

                                         OAOSO +39.0  0.4084 +41.0  0.4095 +41.0

  5 does:      0.5600 0.5760 +3.0   0.6960 +24.0  0.7160 +28.0  0.7320 +31.0  0.7280 +30.0
  10 does:     0.5440 0.5820 + 7.0  0.6880 +26.0  0.6860 +26.0  0.6980 +28.0  0.7000 +29.0
  15 does:     0.5173 0.5627 +9.0   0.6573 +27.0  0.6707 +30.0  0.6800 +31.0  0.6813 +32.0
  20 does:     0A910 0.5510 +12.0   0.6470 +32.0  0.6540 +33.0  0.6630 +35.0  0.6610 +35.0
  30 does:     0.4653 0.5313 +14.0  0.6147 +32.0  0.6173 +33.0  0.6240 +34.0  0.6267 +35.0
 100 docs:     0-3698 0A396 +19.0   0.4824 +30.0  0A930 +33.0   0.4974 +35.0  0.5002 +35.0
 200 does:     0.3049 0.3562 +17.0  0.3887 +27.0  0.3945 +29.0  0.4002 +31.0  0.4004 +31.0
 500 does:     0.2038 0.2241 +10.0  0.2452 +20.0  02490 +22.0   0.2500 +23.0  0.2498 +23.0
 1000 does:    0.1310 0.1392 +6.0   0.1499 +14.0  0.1529 +17.0  0.1539 +17.0  0.1542 +18.0

R-Precision ~recision after R (= num~rel for a query) does retrieved):
  Exact:    0-3346 0-3942 +18.0  0.4251 +27.0  0A283 +28.0

0.4281 +28.0

0.4291 +28.0

                 Table 3: New Routing Results at Several Query Expansion Levels

radciiig and selecting the n besL  Moreover. these
`nonbreak' docicnents total only 5225. less than 1,3 of
16114 relevants ueed and is ther~fore very efficienL CI~ere
are actu~y 16400 relevants from Disk 1&2, but during
processing a small perentage was lost). 2) All edges and
their weights on the query side of the network are de:ened
by the activations deposited by the relevant documents; this
means the original query plays no part in their d:~tion.
3) Negative edge weights are set to small positive weights
of 0.1. Forretrieval: 4) after ranking, several subdoeuments


                                              241

of the same document ID may rank high. and we combine
their largest three RSVs in the ratio of 1:0.2:0.05 as the
single reported RSV for the whole documenL Previously
we ignored the third. and the ratio for combining the largest
two was differenL  We choose to stop at two or three
subdocuments because noise from long documents may
creep back. Such tijaing of parameters led to the results in
Table 3 for our latest routing results.  We use the
convention `Trained Exp K' to denote query expansion level
K. with K=O meaning weight adaptation without adding new

terms. The `No Train.' column shows results without using
any known relevants for training and serve as the basis for
comparison. It can be seen using the measure of average
precison over all recall points that training without query
expansion improves over no training by 21%, and training
with expansion at the 40 level improves over the basis by
about 39%.   This measure, as well as the R-precialon,
aeems to level off from expansion level 40 onwards.
However the number of relevants retrieved improves from
6551 (no training) to 7712 (expansion level 100) in a
monotone fashion. lijither query expansion level appears to
improve the high-recall region of the precision-recall curve
without materially affecting the low-recall region as
observed in ~w~x] using the WSJ collection only.
Precision values at the different cutoffs of documents
retrieved seem to level off at the expansion level of 80. At
20 retrieved documents cutoff, we now achieve a precision
over 0.65, meaning that more than 13 of the 20 documents
retrieved are relevant on the average.  The timing of
pararneters give us over 10% additional improvements
above those obtaed in the revised routing results of Table
2. It appears that a query expansion level of 40 achieves a
comprornise  between  good effectiveness   and good
efficiency for our system. We did not do massive query
expansion at high levels of 200 or more. However, the
results are comparable to the best of those reported in the
ThE~ conference.


9. Conclusion

We have upgraded our PIRCS system to use dynamic
network creation for leaming and retrieval, and to handle
files in a master-subeollection desi~ The former approach
allows us to eliminate fall inverted file creation resulting in
2 x collection sire space requirement, reduced `dead' tirne
for a collection to be searchable, and provide fast leaming.
The latter approach renders our system to be sufficiently
flexible to handle a large number of files in a robust
fashion, yet produce a retrieval ranked list as if all
documents were inone file. Although our subrnitted results
for ThE£2 were not up to expectation becanse of
insufficient resources at the time of the experiments, the
reasons for the behavior of our system were isolated. New
experiments show    that PIRCS can provide highly
competitive retrieval effectiveness in both ad hoc and
routing environments.


Acknowledgment

Ms. Lu Chi-Ni provided the program for our twoword
phrase detection.  We would like to acknowledge the
continal support of the Department Chranan and the Dean


                                                  242

of Mathematics and Natural Science at Queens College
throughout the projecL This work is partially supported by
a grant from ARPA and a PSC-CUNY grant #663288.


References

[13uS~3] Buckley, C. Salton, 0 & Allan, J (1993).
Automatic retrieval with locality information using SMART.
hi: TheFirstText REtrieval Conference (I~C-1). Harman,
DX. (ECL). NIST Special Publication 500-207. pp.59-72.

[Crof93]   Croft, W.B (1993). The University of
Massachusetts Tipster projec~ In: The First Text REtrieval
Conference ~~~EC-1). Hannan, DX. (Ed.). NIST Special
Publication 500-207. pp.101-105.

~wFK93] Kwok, KI, Papadopolous, L & Kwan, Y.Y.
Retrieval experiments with a large collection using PIRCS.
In: The First Text REtrieval Conference Cfl~EC-1). Harian,
D.K. (Ed.). NIST Special Publication 500-207. pp. 153-172.

~wok9O] Kwok, KI (1990). Experiments   with a
component theory of probabilistic information retrieval
based on singie terms as document components. ACM TOIS
8:363-386.

~wo~x] Kwok, K.L (199x). A network approach to
probabilistic information retrieval. Accepted for publication
in ACM TOIS.

[SaFW83] Salton, 0; Fox, E.A & Wu, H (1983). Extended
boolean information retrieval. Comm. ACM 26:1022-1036.

~OSp76] Robertson, S.E & Sparck Jones, K (1976).
Relevance weighting of search terms. J. ASIS. 27:129-146.

[Spar79] Sparck Jones, K (1979). Experiments in relevance
weighting of search terms. Info. Froc. Mgmnt. 15:133-144.