Retrieval Experiments with a Large Collection using PIRCS K.L. Kwok, L. Papadopoulos and Kathy Y.Y. Kwan email:kklqc@cunyvm.bitnet Computer Science Department1 Queens College, CUNY flushing, NY 11367 ABSTRACT Our strategy to Information Retrieval and to the TREC experiments is based on techniques that have previously been demonstrated to work for small to medium size collections: 1) use of document components for retrieval and term weighting; 2) two-word phrases to achieve better precision and recall; 3) combination of retrieval methods, and 4) network implementation with learning capability to support feedback and query expansion. Evaluation shows that we return the best results for Category B in both ad hoc and muting retrievals, and our approach is comparable to the best methods used in Category A experiments. It appears techniques that work for small collections such as combining soft-boolean retrieval with probabilistic model, user relevance feedback, and feedback with query expansion also work for this large collection. 1. Introduction Over the past years, we have built an experimental system for automated information retrieval (IR) called PIRCS, acronym for Probabilistic Indexing and Retrieval - Components - System. It provides storage and retrieval capability based on collection statistics using content terms as index terms for document representation. Its design is based on a network, and has flexibility in mind more than efficiency or performance, so that future and unforseen approaches to IR may be supported. As it turns out, when the TREC experiments were announced, we found that our system fits nicely with the requirements. Certain changes (mostly because of large file sizes) and a number of peripheral programs are needed, but the basic design remains intact. Because available RAM and disk space are limited, we can only handle Category B Wall Street Journal files (WSJ 500 MByte raw data); however, we forsee no problem running the system for Category A files (2 GByte raw data) if we have the appropriate hardware. In what follows we shall describe our strategies for IR and TREC in Section 2, system description in Section 3 and the Appendix, results and discussions in Section 4, and conclusions in Section 5. 2. Strategies for IR and TREC Experiments Over the past twenty five or 50 years, a number of techniques have been known to work in IR for small to medium collections. These include: manually created thesauri and phrases to enhance recall and precision; term weighting that can account for importance of content and discrimination; user relevance feedback and feedback with query expansion; combining multiple retrieval methods, and using multiple document and query representations. Many other methods exist and have been experimented with, (such as clustering, natural language processing, various artificial intelligence techniques, etc.) but their effectiveness in general are still in question. Our strategy to IR and to the TREC experiments is based on the following that reflect on the previous workable techniques: 1) use of document components; 2) two-word phrases; 3) combination of retrieval methods, and 4) network implementation with learning. We shall discuss how and why each of these methodologies may help contribute to the effectiveness of our results. 153 2.1. Use of Document Components and Query Formulation We view each source document not as monolithic, but as constituted of components. Document components are utilized in two ways: for constructing a more restricted context for term weighting, retrieval and feedback, and for defining initial weights to the content terms for representation. These are discussed in the following sub-sections. 2.1.1 Sub-Documents as Document Components A survey of the WSJ files shows that document lengths vary substantially, from a couple of lines to hundreds, with several thousand words. Moreover, many documents carry unrelated news stories, separated by three dashes `---` on an independent line. We believe that treating such documents as monolithic objects will have adverse impact on: a) precision, because they might lead to high probability that homographs would occur in a different sense and context from what one intends; b) term weighting, because unreliable estimates of the necessary probabilities for the index terms might result, and affecting retrieval; c) feedback and query expansion, because documents that are long and have mixed unrelated topics will make these processes imprecise; and d) system output effectiveness, because after retrieval, users still have to manipulate a large document to locate where the relevant passage is. There may be some risks in using document breakups. Boolean AND's would not be satisfied if the two factors for an AND happen to be split up in separate sub-documents. Coordinate matching would get less term counting if one does not somehow combine the counts of each sub-documents for ranking purposes, assumming all match terms are topically relevant. List queries with term weights may or may not suffer: even though a sub-document would have less term match than a flill document, shorter document lengths may lead to higher term weights depending on the weighting method used. Since all documents are treated in the same fashion, the sub-document weights will be affected to the same degree. If we allow for a substantial chunk of text, such as a few hundred words, a writer generally would have a chance to express what s/he intends fairly completely, either in this or in another sub-document. Since we are neither using Boolean retrieval nor coordinate matching, we believe using sub-documents with more uniform lengths outweigh the risks. Our experimental results seem to support our conjecture. Therefore the first processing we did was to break each document into component units of approximately equal lengths. We creat a new component, (which we call a sub-document, with two more digits attached at the end of the original document ID, assumming a breakup of at most 100 sub-documents per document) whenever we recognize the story separation mark `---`, or when we have a run of texts exceeding N words and ending on a paragraph boundary. A sub-document should not be too short lest it carries little content. Moreover, since each sub-document is an independent entry consuming space and time resources, we do not want to exceed a certain limit, which we arbitrary set as double the original number. After some experimentation, we found that a break at N=360 raw words satisfies our design goal. The original number of documents in WSJ is (first half plus second half) 98733+74486 = 173219; after breaking into components, we end up with 192935+156880 = 349815 sub-documents. This forms our database for subsequent processing. We would prefer to break documents based on more sophisticated strategies such as context, but we have not done so. 2.12 Query Formulation In PIRCS without soft-boolean, queries and documents are items of the same category, each containing a list of content terms. For a query, we obtain the list from the , <desc>, <narr> and <con> paragraphs of the topic in a flilly automatic fashion. These paragraphs also go through a filtering program that removes standard introductory phrases such as: `To be relevant, a document (will I must I..) (discuss 154 I cite I report I ..)`, etc. If a capital `NOT' is found in a sentence, the rest of the sentence including the `NOT' is also removed, because it is difficult for list queries to handle negation. We understand that this does not completely solve the negation problem because some `not's are not capitalized, and many negations are expressed by other means. The remaining words from the four paragraphs are then merged and processed against the collection dictionary to form a query representation. No breakup into sub- documents is done for topics. We also manually form a boolean query for each topic for soft-boolean retrieval, thus providing both an alternative representation for queries as well as a different retrieval method. We essentially scan the same paragrapbs as before. Sometimes we also consult the document frequency of a term to screen out high frequency terms, to arrive at a smaller expression. We might occasionally add to the boolean expression some new terms that are not in the original paragraphs. However, the way our evaluation program works is that these new terms are ignored because they are not part of the automatically formed query. 2.1~ Initial Term Weighting based on Single Terms as Conceptual Components After the previous processes, we apply the use of document components a second time. We regard each content term within a sub-document or query as an independent concept. This allows us to use the principle of document self-recovery to give initial weights to each term of an item, or to use the simpler Inverse Collection Term Frequency (ICTF) weighting [1,2]. Because the former requires experimentally adjusting some paramenters and we did not have sufficient relevance judgment information, we decided to use the simplier ICTF for our initial weighting of a term in an item (query or document) as follows: = In [p/(l-p)] + In [(l-s~~/s~. (1) w~ is the weight given to term k in item i; p = 1/50, a constant chosen based on previous experience; and 5jk = (Fk-dik)/(NW-LI) if item i is a document, and 5jk = Fk/NW if item i is a query. Here Fk = Xi d~k is the collection frequency of term k, ~ is the term frequency of term k in item i, L1 = xk d~ is the length of item i, and N~ = Xi = xk Fk is the total number of terms in the database. The fraction (1~5ik)/5ik inside the logarithm ln in Eqn. 1 is approximately N~Fk, if N~» Fk» d~, hence the nomenclature ICTF. 2.2. Two-word Phrases and Other Vocabulary Control Document frequencies (of terms) of a few hundred are small compared with a few hundred thousand documents. Yet a few hundred high-ranked documents that are irrelevant would stretch a user's patience to great limits. We therefore believe that in large collection environments, precision enhancement tools are very important. Syntactic phrases, or statistically generated phrases within tight context would probably be useflil as indexing terms. However, we do not have these tools yet for the experiments. A look at the collection also shows that WSJ jargon contains many two-word phrases that are a combination of very common words. Examples are `big three', `buy down', `drug money', `go public', `take over', etc. By themselves, many of the single terms would be screened out because they are either on the stopword list, or have high document frequencies. In combination however these two-word phrases are precise in meaning and would probably impact favorably on both precision and recall. We therefore spent a fair amount of manual effort to record these content-specific combinations in a two-word phrase file. During processing, whenever such two-word phrases occur in adjacent positions in a sentence, a new index term is created consisting of the combination, in addition to the single terms. Such two-word combinations are also common in other fields. Our file has 396 such pairs, containing also some from the computing field, and not necessarily just consisting of function or common words. We would like to have a larger set, but we did not have the resource nor the domain expertise. Another vocabulary control 155 tool is a stopword list with 595 entries, some of which are morphological variants of the same word. In addition, we use Porter's stemming [3] algorithm for suffix stripping. There are many other vocabulary control tools such as spell-checker, proper noun and date identification, synonym list, thesaurus, semantic nets, etc. These could be added in the future. 2.3. Multiple Retrieval Methods It is well known that different retrieval algorithms on the same collection would lead to different retrieval results and that combining them could substantially enhance effectiveness [2,4,5,6]. Our network approach to IR (Section 4) can support multiple retrieval methods fairly easily. Three methods of retrieval are used in PIRCS for these experiments, and they agree nicely with the different types of TREC query requirements: (a) Query-Focused Retrieval This method involves the question: given a query CL, should document d~ be considered as relevant to it? This point of view is appropriate for ad hoc environment where we have a set of static documents being processed against a query stream. Each query serves as a focus to which documents are ranked. An approximate probabilistic measure W~ of this answer for d~ is evaluated as follows, and is used as the RSV (retrieval status value) to rank the whole collection with respect to qa: = Xk ~ (2) w~ is the weight of term k in query qa as given in Eqn. 1, and ~ measures the proportion that term k is used in d~. The sum is over all terms k that overlap between d~ and q~. (b) Document-Focused Retrieval This method involves the reverse question: given document d~, should query qa be considered as relevant to it? This point of view is appropriate for muting where we have a set of static queries being processed against a document stream. Each document serves as a focus to which queries can be ranked. However, evaluation of retrieval is usually done with respect to a specific query q8, and so an approximate probabilistic measure V~ of this answer is given to each document d~ and used as RSV as follows: = Xk w~*d~/La. (3) (c) Soft-Boolean Retrieval (Query-Focused) This method depends on a boolean query being available for each topic, which we creat manually for this experiment. The idea is that topics that are represented as a list of terms (as in (a) and (1)) above) have no structure. Boolean queries have structure but its logic is `hard' and we lose the ability to attach term importance or rank the collection. The soft-boolean approach allows one to retain the query structure yet provides weights to both the terms and boolean operators, so that we can soften the logic and provide ranking capability. We have followed the extended Boolean approach of the vector model [7] for this implementation, but with simplified weights. For example, a query CL may be of the form: CL = Bb Ap X, with (4) X = CC V~ Dd. 156 B,C,D are index terms with weights b,c,d respectively, A,V are the boolean AND and OR each given a fixed weight p=2, and all clause weights are set to I. If document di also has these three terms with weights 0 <= b',c',d'<= 1, then the similarity between di and CL could be evaluated recursively as: x `= Sim(di,X) = sqrt[((cc')2 +(dd')2)i(~+d2)), Sim(di,CL) = 1 - sqrt[ ~2(l~b~)2+l*(l~x')2 )/(1)2 +1)) (5) All document and query term weights are taken from the edges of the net, so that the system is fully automatic once the boolean expression has been defmed manually. Our retrieval results for automatic query construction is then based on combining methods (a) and (b), thus: W~ILito = (W~ + V~)/2. Those for manual query construction is based on w1man = r*Wiauto + s*sim(di,CL). Both make use of combination of retrieval methods. The constants r, 5 are chosen as 0.65 and 0.35 respectively. The objective is that adding soft-boolean structure may enhance the retrieval results of the automatic method for the same queries. Our soft-boolean evaluation algorithm currently only accounts for terms that also appear in the network for this query; additional terms that may have been inserted manually are ignored. 2.4. Network Implementation with Learning 2.4.1 Network for Routing, Ad Hoc and Feedback without/with Query Expansion The use of a network can provide a unified view of many retrieval algorithms and is a flexible tool for implementation. In PIRCS, retrieval methods (a) and (1') of the previous section are implemented as feedforward and feedbackwards processing in a Query-Term-Document (Q-T-D) network as presented in [8,9]. A binary tree representing a boolean expression can also be hung onto the net for method (c). These are shown in Figs.l,2. QTD DTQ cia w ak ifi A -A Q Fig.i: 3-Layer PIR Network tk --0---- ThWkiWik T ~L] LI LI D The edges of the net are initialized as follows: w~ = d~I~ as in Eqn.2, and similariy for w~ and w~. d1 ½ Fig~2: Soft-Boolean Query Network tk 0 w T F] LI LI D (tk acting on q,) as in Eqn. 1 and wkl (d~ acting on tk) Activation on d~=l gated through wkl deposits on tk 157 OTO Soft- Boolean d activation equal to the proportion that the term is used in d~. These activated terms spread to a target (`8 gated through Wak which has the odds that given ~ that ~ is relevant. The sum of activations received at ~ implements the query-focused retrieval method (a) of Section 2.3, while the reverse QTD direction processing implements document-focused retrieval method (b). Moreover, edge weights from the net can also be used to initialize the trec leaf nodes for soft-boolean evaluation in the DTQ direction, Fig.2. Relevance feedback, which has been demonstrated in numerous experiments as the most effective tool for improving retrieval, is modeled as training in our net. For example, the edge weight w~ (w1~ adapts according to a learning algorithm based on the average activation Xk induced in term k by the relevant set, the current p factor (see Eqn.l) on the edge, and a learning rate ~ (liD), thus: DTQ: Awak = Ap~[p~Id(l~p~Id)], ~ = liQ(xk~pakoId) QTD: AW~ = Ap~[p~d(l~p~d)], ~ = liD(xk~pikold) see (8,9] for details and Fig.3 illustrating the DTQ process. Learning in both directions are allowed: DTQ query-focused training is done when we know the set of documents relevant to a query and corresponds to probabilistic retrieval [10,11], and QTD document-focused training is done when we know the set of queries relevant to a document and corresponds to probabilistic indexing [12]. Query-focused training let query representations improve with experience and prepare them to match new similar documents better, and normally associated with relevance feedback [13). Document-focused training let document representations improve with experience and prepare them to match new similar queries better, and normally associated with dynamic document space modification [13]. They provide a precision enhancing tool because term weights are rendered `sharper' towards relevant items. Our network implementation differs from the traditional approaches in that two sets of weights are associated with an item, e.g. Wik and wkl. Wk,.= ~ are the observed properties and not modified, while w~ embeds inference and adapts as more evidence is gathered. Moreover, effects of training from both query-focused and document-focused processes are combined into one RSV, ~ that has been shown to lead to cooperatively better ranking results. These ideas were first introduced in a series of papers [14,15,1,2]. With learning capability, the net becomes a two layer direct-connect artificial neural network in each direction. tk w ak )< q a 0 A A U.. 0 Fig.3: Query-Focused DTO Learning w T OTO Learning d [1) q I a w ak A A tk w 0 ti DTQ Learning & w Expansion ¾½~lk d1 w t] I' D 0 D Fig.4: DTQ Learning with Expans on T An additional feature allowed in our feedback learning is query expansion. The idea is that terms that are 158 highly activated in the set of relevant documents should be highly related to the concepts and topics wanted for that query. These terms can therefore be added to the original query as illustrated in Fig.4, and could be expected to enhance both recall and precision, since these terms come from relevant samples. Experiments with small collections have shown that this is indeed the case [16,8,9]. This tool resembles that of an automatic thesaurus, associated terms being derived from user experience. From the network viewpoint, query expansion corresponds to growing new edges from a query to highly activated terms during relevance feedback, and weights are assigned according to the following learning algorithm: DTQ: wia = a*xk; Pai = p~Q*xk, w81 = In [PaV(l~Pai)] + ln [NWIFk] which we introduced in [8,9]. In the TREC WSJ collection, we realize that documents can contain totally different stories and they can also be exceptionally lengthy, while both feedback and feedback with query expansion requires restricted context to work. This is a major reason why we decide to break documents into su~documents, as discussed in Section 2. 2.4.2 Implementation of Retrieval in a Network To satisfy TREC requirements, we need to simulate different querying circumstances as given in the following table: Query Query Type Construcfion I-------------------- Method I adhoc I routing automatic I PIRCSI I PIRCSl manual I PIRCS2 I PIRCS2 feedback I PIRCS3 I n.a. fdbk + qry expansion I PIRCS4 I n.a. We have submitted results to all six types of experiments. The last Sections describe the rationale and how RSV's are calculated in our approach. The following discusses the processing of routing, ad hoc and feedback queries: (a) Routing Queries: a special requirement in TREC is to do routing retrieval. This simulates the classification of new test documents with respect to a set of static queries that have been trained with past documents. The (past) training documents are the first half of the TREC collection c(A). We process c(A) and capture the necessary statistics of term usage. The queries from the first set of topics are then processed against c(A), and a network created using ICTF weighting. Query-focused DTQ learning is now applied with the supplied relevant (but not irrelevant) documents, and the resultant edge weights on the Q-T side of the net saved. Note that each relevant document is split into multiple sub-documents for this learning, and because we do not have relevance judgment at the sub-document level, we choose not to expand the queries. These become our routing queries q(A). The test documents from collection B, c(B), are then processed against c(A) as if they were queries, so that only collection A statistics are used for the ICTF term weighting. The q(A) are now loaded with c(B) and the dictionary from c(A) to form a new network, from which routing retrieval results based on W~8UtO and W~man (Section 2.3) are obtained, for fully automatic and manual routing queries. (b) Ad Hoc Queries: collection B is now re-processed as additions to collection A, forming a total collection c(AB) and accumulating their total term usage statistics as well as a new dictionary. The ad hoc topics are processed to form ad hoc queries q(B) against the whole c(AB). We did not perform 159 document-focused QTD learning of c(AB) based on known relevant queries in q(A). The ad hoc queries q(B), the total dictionary, and the total collection c(AB) then form a network using 1Cm weighting, from which both automatic and manual ad hoc query retrieval results (W~auto and W1~~) are obtained. (c) Feedback Queries: To simulate user relevance feedback, we employ the ten highest ranked sub- documents of the automatic ad hoc results in (b), and determine their relevance to their respective topics manually. Two queries out of 25 have no relevant documents within the first ten, and they are removed leaving 23. We have performed two types of feedback training: without and with query expansion. In both cases, query-focused and document-focused training were done. Our method of adding terms to existing queries is based on [8,9] during DTQ direction training. The n~ documents relevant to a query q, are instantiated to I and their activation to each term node, after gated through the weights wk~=d~kIL~, is averaged. The highest 20 activated terms with document frequency less than 2000 are then used to expand qa. However, not all the terms are different from the existing terms of qa, so that on average, each query got expanded with about 12.3 terms (284 new terms for 23 queries). After this, training proceeds in the QTD direction. We have not used expansion for documents because usually there is only one relevant query per document. We have done only one round of feedback iteration. 3. System Description Our system was designed over a number of years for IR experimentation. Its primary goal is to be flexible so that unforseen developments or algorithms can be supported for testing without much change in the basic implementation. Many intermediate files are produced for more convenient hooks to and from the system. These will be trimmed in later verisons. It was originally programmed in Pascal, but has been translated and is now running in a UNIX and C environnient on a Sun SparcStation 2GS. The system has 48MB RAM, a TREC-dedicated 1.7GB disk drive, plus about 0.5GB on another drive that we can occasionally use. Our system characteristics and timings are described in the Appendix of this paper. 4. Results and Discussions Our runs are named PIRCS'n', where n=l denotes fully automatic retrieval, n=2 denotes manual, i.e. automatic combined with soft-boolean where boolean queries are manually created, n=3 denotes feedback using automatic retrieval, and n=4 is the same as n=3 but with query expansion. Although we use sub- documents for indexing and retrieval, our final 200-document output list has only unique document ID's; all sub-documents of the same document ID are suppressed except for the one with the highest rank. Precision-recall and other measures are gathered in the Appendix of this volume. Our runs are based on twenty five queries for ad hoc and another twenty five for routing retrievals, which we recognize as too few and may not reflect sufficiently the variety of query types in real-life. On the other hand, the evaluation curves and average values serve as a lower bound to the power of our system because relevance judgments are rendered only to the combined first 100 documents of all systems. This means documents ranked 101 to 200 in our retrieval are automatically considered irrelevant if they are not in the judged relevant set. Also, Category B texts come from a single source, viz. WSJ of multiple years, not like Category A with texts from multiple sources. With these reservations we like to make the following observations: (a) Inter-site comparison shows that the PIRCS results are the best among Category B participants for both ad hoc and routing retrievals, and comparable to the best methods in Category A based on one evaluation with crnlB. 160 (b) Some of the topics have very specific requirements for documents to be relevant. For example: Topic #1 needs antitrust cases as a result of complaint, not routine review; #2 needs acquisitions between a U.S. company and another non-U.S. company; #53 needs leveraged buyout cases valued at or above $200 million; while #60 requires a policy change from merit-pay vs. seniority or vice versa. Data like `above $200 million' or other numerics are removed either because they are on our stopword list or because of high frequency. Other topics involve very general concepts that require the system to understand their specific inferences. Examples are: course of action to decrease the U.S. deficit (#7); the body of water being polluted (#12); specific commercial applications of superconductors (#21); hypocritical and conflicting policies of the U.S. government (#74). The possible `course of action', `commercial applications', `conflicting policies', etc. are essentially open-ended. Yet others need synonym lists or other aids to interpret proper terms in order not to miss documents. Examples are: Japanese, U.S. or foreign companies (#2,3), European Community or countries (#5, #69), third world or developing countries (#4,6), or economic indicators (#8). When would a proper noun, if identifiable, represents a company? And if it is, is it a foreign company? Also, a few of the topics like #3, 15, 53, 56, 66 have short descriptions with many general words, so that after stemming and stop-word processing, these queries end up with few terms. PIRCS does not have tools for these problems. Its precision values at 50% recall and at 11-pt Avg for ad hoc and routing retrievals are tabulated below: Ad Hoc PIRCS1 PIRCS2 Routing PIRCS 1 PIRCS2 ------------------------------------------------ 50% Recall I .276 .278 I .340 .342 11-pt Avg I .311 .322 I .343 .369 The ad hoc precision value of about 0.28 at 50% recall says that, averaged over 25 queries, if one wants to retrieve half of all relevant documents, one would have to read about eleven documents to get three relevant, and about nine documents to get three relevants for routing. Routing queries receive some help from the few relevants provided for training purposes, just as in experiments first reported in [8] where we simulate users posing queries equiped with some known relevants. The 11-pt Avg precison values sample eleven recall points and simulate a uniform distribution of users with different recall needs, and may reflect actual usage better. Effectiveness improves to between three in ten and three in nine retrieved documents being relevant for ad hoc, and slighdy better for routing. These results naturally leave much room for improvement; but considering that PIRCSl is fully automatic and relies only on statistical methods, results seem reasonable for this large WSJ collection. See also analysis in (c) and (f). (c) Another evaluation of the system is to look at the precision-recall values at different cut-off points of 5, 15, 30, 100 and 200 retrieved documents. This may give users a better `feel' than the hypothetical 11-pt Avg. A question is what should these values be compared to? The theoretical limit is of course 1.0, for perfect recall and precision. However, this would punish the system unfaidy. For example, at 15 retrieved documents, many queries have x relevants with x> 15. Hence the best recall at this cut-off should be iSix. This we call the best operational recall in contrast to the theoretical best of 1.0. Similarly, if x <15~ these queries would have the best operational precision at this cut-off of x/15, instead of 1.0. We have listed below for PIRCS 1 the precision-recall values at various cut-offs and also the best operational values for comparison: 5 15 30 100 200 Cut-off: Ad Hoc Best Oper. Recall: .146 .298 .456 .828 .916 Recall: .066 .130 .204 .419 .586 45% 44% 45% 51% 64% 161 Best Oper. Precision: .984 .944 .899 .689 .469 Precision: .624 .523 .480 .364 .281 63% 55% 53% 53% 60% Routing Best Oper. Recall: .083 .241 .387 .777 .915 Recall: .048 .145 .228 .443 .576 58% 60% 59% 57% 63% Best Oper. Precision: 1.00 .995 .945 .781 .551 Precision: .608 .597 .521 .398 .294 61% 60% 55% 51% 53% For ad hoc retrieval at 30 retrieved document cut-off for example, it means that on average we get about 20% (.204) of all the relevant documents, but nearly one out of two (.480) retrieved documents are relevant. However, because of the number of known relevant documents for each of the queries, the best operational recall and precision at this cut-off is only .456 and .899. Our .204 recall and .480 precision achieve 45% and 53% of these best values. On the whole we have achieved between 44-64% of the best operational recall and 53-63% of the best operational precision values for ad hoc, and respectively between 57-63% and 51-61% for routing. These are quite respectable figures. (d) Based on our strategy and experimental data, techniques that work for small collections also work for this large WSJ collection as well. For example, using the 11-pt Avg precision values, PIRCS2 performs better than PIRCSl in both ad hoc (0.322 vs 0.311, +3.6%) and routing (0.369 vs 0.343, +7.6%) environments. Thus, adding soft-boolean as a third retrieval method helps, but requires forming the boolean queries manually. An illustration of the better results of PIRCS2 over PIRCS1 can be seen by comparing between pairs of methods using the 11-pt Avg measure, where `equal' means values within 5%: AD HOC PIRCS1 @ PIRCS2 ROUTING PIRCS1 @ PIRCS2 = better: S 3 = equal: 9 6 = worse: 11 16 Both PIRCS3 and PIRCS4 make use of feedback based on evaluating the first ten retrieved sub-documents of PIRCS1, but PIRCS4 employs query expansion as well. These methods are automatic. Comparing PIRCS4 with PIRCS3 in ad hoc feedback retrieval shows that query expansion is better (0.305 vs 0.282, +8.1%) than no expansion. Note that in this feedback, only 23 queries are used because topic #66 and #73 have no relevants in the first ten retrieved. PIRCS3 and PIRCS4 have also been re-evaluated using the frozen rank method, viz. the ten documents used for feedback are given the same rank as in PIRCS 1, while the other retrieved documents follow. In addition, the two queries without relevants in first ten are put back carrying their PIRCS1 results. This way, PIRCSl,3f,4f can be directly comparable. It can be seen that feedback by itself works (0.3407 vs 0.3107, +9.7%) and query expansion improves further (0.3634 vs 0.3107, +17%). There are about 5.5 (126/23) relevant documents in the first ten retrieved. (e) To illustrate the power of feedback, we have tabulated the number of queries that perform better, about equal, or worse between pairs of methods using the 11-pt Avg measure: PIRCS1 @ PIRCS3f PIRCS3 @ PIRCS4f = better: 1 S = equal: S 3 = worse: 19 17 162 PIRCS3f overwhelmingly improves over PIRCS 119 to 1 with 5 ties. This is because PIRCS 1 starts with low performance. PIRCS4f improves over PIRCS3f, but not with such wide margin, because PIRCS3f already achieves better results. It can been seen that feedback with or without query expansion within this collection is definitely worthwhile. Looking at the recall and precision at various retrieved document cut- off in the Appendix of this volume, it can be seen that PIRC54f performs better than PIRCS3f, except at the high precision cut-ff of 5. This reflects that query expansion is more of a recall enhancement tool. This can also be seen from the number of relevants retrieved at the 200 document cut-off: 1403, 1526 and 1582 respectively for PIRCS I ,3f,4f. At high precision region, the added terms could lead to more noise. The effect however is small, and may be related to the number and type of added terms. Thus, query expansion behaves like an automatic thesaurus, terms added being indirectly based on user relevance. (f) It is tempting to compare the results of PIRCS in this large WSJ collection with those in small collections. We have listed the ad hoc 10-pt Avg precision of the four standard collections popular with IR research [2] below, with that of WSJ PIRCS1 results: MED CRAN CACM WSJ IS' (3Oq, 1 .O3Kd) (225q, 1 .4OKd) (52q,3.2Kd) (25q,350Kd) (76q, 1 .46Kd) 0.472 0.374 0.297 0.263 0.174 MED (medicine) and CRAN (aerodynamics) use more specific, scientific vocabulary and can be expected to have better performance. CACM (computer science) terminology generally is less `scientific' and can often have general vocabulary; its performance and that of WSJ are similar. ISI (information science) has very broad nonspecific queries, and the vocabulary is quite general and has the worst retrieval results. We were afraid WSJ would have low performance like ISI. Listed below is a comparison between the CACM [2] and WSJ PIRCS 1 P-R curves: Recall: 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 WSJ: .78 .66 .55 .41 .31 .28 .19 .11 .07 .04 .01 CACM: .62 .50 .41 .36 .30 .24 .18 .15 .11 .09 A characteristic of WSJ is that precision falls off to much smaller values at high recall region, compared with CACM or other small collections. One should expect a large collection to have more noise, and it is interesting that this noise impacts predominantly the low signal region. For example, WSJ has generality ratio of only 0.00077 versus 0.00478 for CACM, i.e. more than six times as much noise in WSJ. Another reason for this phenomenon, perhaps to a lesser degree, is that of the incomplete evaluation procedure for documents ranked between 101 to 200 as discussed earlier. At low recall region, however, precision of WSJ is comparable or better than the small collections. Why is that? Our interpretation is that first, queries are much richer and better formed in WSJ compared to those in CACM. Second, when a collection is large, there is a very good chance that a number of relevant documents exist using closely the same terms as the queries describing their content, especially if the queries are well-worded. These documents will rank high, and hence precision at low recall does not suffer in spite of adverse generality ratio. At high recall region, relevant documents do not express their content in similar terms to the queries and have few term matches, and interference from poor generality ratio noise magnifies. Techniques to improve precision at high recall would therefore be very important if one needs exhaustive search. It has been quite popular to criticise IR research in that small collection results do not reflect those of large collections. Our hope is that as experience is gained with more of these large scale collections, we might be able to predict their behavior based on small collection results. Small collection experiments can be performed in a matter of hours, while large collections take days using current technology. It should be noted that all databases considered here are from one homogeneous field of knowledge. Many commercial database providers produce CDROMs that are homogeneous and of about the same size as WSJ, and these 163 results may be relevant. 5. Conclusion Our system called PIRCS, acronym for Probabilistic Indexing and Retrieval -Component- System, has been shown to be able to support storage and retrieval for a large collection of 0.5 GB. If appropriate hardware is available, and with some software modification, it should support 2GB size collections. Our strategy to IR and TREC consists of: 1) use of document components to provide a more restricted context for retrieval and feedback, and to provide an initial ICTh term weighting; 2) two-word phrases, especially consisting of stopwords and high frequency stems as a precision and recall enhancing tool; 3) combination of retrieval methods, including soft-boolean, to capture cooperative effects between retrieval methods, and 4) network implementation with learning to implement feedback and feedback with query expansion, resulting in a two-layer direct-connect artificial neural network with adaptive architecture. Our approach leads to results that are better than expected. 6. Acknowledgment We would like to thank our department Chairman and the Dean of Mathematics and Natural Science for their support throughout the project. This work is partially supported by a grant from DARPA and a PSC- CUNY grant #6-63288. References 1. Kwok, K.L (1989). A neural network for probabilistic information retrieval. Proc. ACM SIGIR 12th Ann. Intl. Conf. on R&D in IR. N.J. Belkin & C.J. van Rijsbergen, eds. ACM: NY, pp.21-30. 2. Kwok, K.L (1990). Experiments with a component theory of probabilistic information retrieval based on single terms as document components. ACM TOIS 8:363-386. 3. Porter, M.F (1980). An algorithm fro suffix stripping. Program 14:130-137. 4. Smith, M (1990). Aspects of the p-norm model of information retrieval: Syntactic query generation, efficiency, and theoretical properties. TR 90-1128, Ph.D. Thesis, Cornell University. 5. Turtle, H.R & Croft, W.B (1991). Evaluation of an inference network-based retrieval model. ACM TOIS 9:187-222. 6. Fox, E.A; Nunn, G.L & Lee, W.C (1988). Coefficients for combining concept classes in a collection. Proc. ACM SIGIR 11th Ann. Intl. Conf. on R&D in IR. Y. Chiaramella, ed. PUG: Grenoble, pp.291-307. 7. Salton, G; Fox, E.A & Wu, H (1983). Extended boolean information retrieval. Comm. ACM 26:1022- 1036. 8. Kwok, K.L (1991). Query modification and expansion in a network with adaptive architecture. Proc. ACM SIGIR 14th Ann. Intl. Conf. on R&D in IR. A. Bookstein, Y. Chiaramella, G. Salton & V.V. Raghavan eds. ACM: NY, pp.192-201. 9. Kwok, K.L (199x). A network approach to probabilistic information retrieval. submitted for publication. 10. Robertson, S.E & Sparck Jones, K (1976). Relevance weighting of search terms. J. ASIS. 27:129-146. 11. van Rijsbergen, C.J (1979). Information Retrieval, 2nd Ed. Buflerworths: London. 12. Maron M.E & Kuhns, L.J (1960). On relevance, probabilistic indexing and information retrieval. J. ACM 7:216-244. 13. Salton, G (1989). Automatic Text Processing. Addison-Wesley: NY. 164 14. Kwok, K.L (1986). An interpretation of index term weighting schemes based on document components. Proc. 1986 ACM Conf. on R&D in IR. F. Rabitti, ed. ACM: NY, pp.275-283. 15. Kwok, K.L & Kuan, W (1988). Experiments with document components for indexing and retrieval. Inform. Proc. Mgmnt. 24:405417 16. Salton, G & Buckley, C (1990). Improving retrieval performance by relevance feedback. J. of ASIS. 41:288-297. Appendix System Summary and Timing I. Construction of indices, knowledge bases, and other data structures (please describe all data structures that your system needs for searching) A. Which of the following were used to build your data structures 1. Stopword List YES a. how many words in list? 595 2. is a controlled vocabulary used? NO 3. stemming a. standard stemming algorithms which ones? b. morphological analysis 4. term weighting 5. phrase discovery a. what kind of phrase? b. using statistical methods C. using syntactic methods 6. syntactic parsing NO 7. word sense disambiguation NO 8. heuristic associations NO a. short definition of these associations 9. spelling checking (with manual correction) NO 10. spelling correction NO 11. proper noun identification algorithm NO 12. tokenizer (recognizes dates, phone numbers, common patterns) NO YES PORTER'S ALGORITHM NO YES NO a. which patterns are tokenized? 13. are the manually-indexed terms used? NO 14. other techniques used to build data structures (brief description) A TABLE OF 3% MANUALLY CREATED 2-WORD PHRASES. WHEN THESE ARE IDENTIFIED IN ADJACENT POSITIONS IN DOCUMENTS OR QUERIES THEY ARE USED AS ADDITIONAL INDEX TERMS. B. Statistics on data structures built from ThEC text (please fill out each applicable section) 1. inverted index a. total amount of storage (megabytes) 372 b. total computer time to build (approximate number of hours) 95+11+2=108 for 500MB. CLOCK TIME YES, IF SUFFICIENT DISK. NOT IN THIS EXPERIMENT. c. Is the process completely automatic? 165 if not, approximately how many hours of manual labor? 0.5 d. are term positions within documents stored? NO, BUT SENTENCE YES. YES, EXCEPT FOR I.A.14. NO f. single terms only? 2. clusters a. total amount of storage (megabytes) b. total computer time to build (approximate number of hours) C. brief description of clustering method d. is the process completely automatic? if not, approximately how many hours of manual labor? 3. ngrrrns, suffix arrays, signature files NO a. total amount of storage (megabytes) b. total computer time to build (approximate c. brief description of methods used d. is the process completely automatic? if not, approximately how many hours of 4. knowledge bases a. total amount of storage (megabytes) b. total number of concepts represented c. type of representation (frames, semantic nets, rules, etc.) d. total computer time to build (approximate number of hours) e. total manual time to build (approximate number of hours) f. use of manual labor (1) mostiy manually built using special interface (2) mostly machine built with manual correction (3) initial core manually built to "bootstrap" for completely machine-built completion (4) other (describe) g. auxiliary files needed for machine use (1) machine-readable dictionary (which one?) (2) other (identify) 5. special routing structures (what?) SEE I.B.6 NETWORK NODE, EDGE FILES. ROUTING USING NETWORK NODE AND EDGE FILES IS SThAIGIflFORWARD. number of hours) manual labor? NO a. total amount of storage (megabytes) NODE FILE: 4x7.5 EDGE FILE: 4x4 NETWORK SEGMENTED INTO 4, BECAUSE OF INSUFFICIENT RAM. b. total computer time to build (approximate number of hours) 4O+5+l+4xO.2=46.8, STARTING FROM TEXT FILE. c. is the process completely automatic? YES, IF SUFFICIENT RAM AND DISK SPACE. d. brief description of methods used 1. PROCESS (OLD) COLLE~ON A. 2. PROCESS QUERIES AGAINST COLLE~ON A. 3. PROCESS NEW COLLECTION B AS IF THEY WERE QUERIES TO MAKE USE OF COLLE~ON A ST~~CS. 4. COMBINE QUERIES, (OLD) DI~ONARY AND COLLECTION B INTO NETWORK FOR RETRIEVAL. 6. other data structures built from TREC text (what?) 166 1. SUBDOCUMENT FILE 3. DOCID CHECKING FILE 5. DOCNUM FILE 7. DIRECT FILE 9. NODE FILE a. total amount of storage (megabytes) 1.481 3.7 5.11 7.372 2. CODED FILE 4. TERMID CHECKING FILE 6. TERMNUM (DICTIONARY) FILE 8. INDEX TO DIRECT FILE 10. EDGE FILE 2.324 4.4 6.6 8.19 9. 4x14 10. 4x9 SYSTEM WAS DEVELOPED FOR EXPERIMENTAL GENERATE OTHER DATA. SOME OF THE FILES ARE RESEARCH, WITh FLEXIBILITY TO NOT NECESSARY FOR RETRIEVAL. b. total computer time to build (approximate number of hours) 1. 1.5 2,3,4,5,6. 95 7,8. 11 9,10. 4x0.25=1 C. is the process completely automatic? YES IF SUFFICIENT RAM AND DISK SPACE. FOR THIS EXPT, NO. if not, approximately how many hours of manual labor? 2 d. brief description of methods used RAW TEXT --> SUBDOCUMENT FILE SUBDOCUMENT --> CODED FILE, DOCID FILE, TERMID FILE DOCNUM FILE, TERMNUM (DICTIONARY) FILE. ZIPF-LAW PROGRAM TRUNCATES DI~ONARY VIA USER ASSIGNED LIMITS. CODED, TERMNUM --> DIRECT FILE with INDEX DIRECT --> INVERTED FILE DIRECT, INVERTED --> NODE, EDGE FILES. C. Data built from sources other than the input text 1. internally-built auxiliary files a. domain independent or domain specific (if two separate files, please fill out one set of questions for each file) b. type of file (thesaurus, knowledge base, lexicon, etc.) c. total amount of storage (megabytes) d. total number of concepts represented DOMAIN SPECIFIC WORD PAIR 0.005 396 e. type of representation (frames, semantic nets, rules, etc.) f. total computer time to build (approximate number of hours) (1) if already built, how much time to modify for TREC? 0 C[HIS IS A FILE CREATED VIA EDITOR). g. total manual time to build (approximate number of hours) 16 (1) if already built, how much time to modify for TREC? 167 h. use of manual labor (1) mostly manually built using special interface (2) mostly machine built with manual correction (3) initial core manually built to "bootstrap" for completely machine-built completion (4) other (describe) 2. externally-built auxiliary file a. type of file (Treebank, WordNet, etc.) b. total amount of storage (megabytes) c. total number of concepts represented d. type of representation (frames, semantic nets, rules, etc.) II. Query construction (please fill out a section for each query construction method used) A. Automatically built queries (ad-hoc) 1. topic fields used 2. total computer time to build query (cpu seconds) 3. which of the following were used? a. term weighting with weights based on terms in topics SEARCH FOR WSJ ThRMINOLOGY IN LIBRARY AND FROM TOPICS. NONE ~ThE>, <DESC>, <NARR>, <CON> S (AVERAGE FOR EACH QUERY). YES + OTHER WEIGHTS b. phrase extraction from topics NO c. syntactic parsing of topics NO d. word sense disambiguation NO e. proper noun identification algorithm NO f. tokenizer (recognizes dates, phone numbers, common patterns) (1) which patterns are tokenized? NO g. heuristic associations to add terms NO h. expansion of queries using previously-constructed data structure (from part I) YES (1) which structure? WORD-PAIR PHRASE FILE i. automatic addition of Boolean connectors or proximity operators NO j. other (describe) NONE B. Manually constructed queries (ad-hoc) 1. topic fields used 2. average time to build query (minutes) <TIThE>, <DESC>, <NARR>, <CON> 300 minutes for 25 queries. 3. type of query builder a. domain expert NO b. computer system expert YES 4. tools used to build query a. word frequency list SO~MES b. knowledge base browser ~nowledge base described in part I) (1) which structure from part I NO c. other lexical tools (identify) NO 5. which of the following were used? a. term weighting YES b. Boolean connectors (AND, OR, NOT) YES 168 C. proximity operators d. addition of terms not included in topic (1) source of terms e. other (describe) C. Feedback (ad-hoc) 1. initial query built by method 1 or method 2? 2. type of person doing feedback a. domain expert NO b. system expert YES 3. average time to do complete feedback a. cpu time (total cpu seconds for all iterations) 12 PER QUERY PER IThR~ON - NO EXPANSION. 50 " " `I - ~ EXPANSION. NO YES WORD-PAIR PHRASE FILE. METhOD 1 b. clock time from initial construction of query to completion of final query (minutes) 60 PER QUERY TO DO RELEVANCE JUDGMENT. 3. average number of iterations 1 a. average number of documents examined per iteration 10 4. minimum number of iterations 5. maximum number of iterations 6. what determines the end of an iteration? 6. feedback methods used a. automatic term reweighting from relevant documents 1 1 DEADLINE + LACK OF MANPOWER YES b. automatic query expansion from relevant documents (1) all terms in relevant documents added NO (2) only top X terms added (what is X) TOP 20 MOST `A~VAThD' ThRMS ThAT HAVE DOCUMENT FREQUENCY <2000 WERE USED. BECAUSE MANY WERE ALREADY IN QUERY, ABOUT 12 ON THE AVERAGE WERE NEW AND ADDED PER QUERY. (3) user selected terms added NO c. other automatic methods brief description FEEDBACK IS BASED ON SUB-DOCUMENTS. NO d. manual methods (1) using individual judgment with no set algorithm (2) following a given algorithm (brief description) D. Automatically built queries (routing) 1. topic fields used 2. total computer time to build query (cpu seconds) 3. which of the following were used in building the query? a. terms selected from (1) topic YES (2) all training documents NO (3) only documents with relevance judgments NO 169 ~ThE>, <DESC>, <NARR>, <CON> 5 (AVERAGE FOR EACH QUERY). b. term weighting (1) with weights based on terms in topics YES (2) with weights based on terms in all training documents YES with relevance judgme nts YES NO (3) with weights based on terms from documents C. phrase extraction (1) from topics (2) from all training documents (3) from documents with relevance judgments d. syntactic parsing NO (1) of topics (2) of all training documents (3) of documents with relevance judgments e. word sense disambiguation NO (1) using topic (2) using all training documents (3) using documents with relevance judgments f. proper noun identification algorithm NO (1) from topics (2) from all training documents (3) from documents with relevance judgments g. tokenizer (recognizes dates, phone numbers, common patterns) NO (1) which patterns are tokenized? (2) from topics (3) from all training documents (4) from documents with relevance judgments h. heuristic associations to add terms NO (1) from topics (2) from all training documents (3) from documents with relevance judgments i. expansion of queries using previously-constructed data structure (from part I) YES (1) which structure? WORD-PAIR PHRASE FILE j. automatic addition of Boolean connectors or proximity operators NO (1) using information from the topics (2) using information from the all training documents (3) using information from documents with relevance judgments k. other ~rief description) NO E. Manually constructed queries (routing) 1. topic fields used 2. average time to build query (minutes) 3. type of query builder a. domain expert NO b. system expert YES 4. data used for building query a. from training topic YES b. from all training documents NO c. from documents with relevance judgments NO d. from other sources (what?) NO 170 ~ThE>, <DESC>, <NARR>, <CON> 300 MINUThS FOR 25 QUERIES. 5. tools used to build query SOM~MES a. word frequency list b. knowledge base browser ~nowledge base described in part I) (1) which structure from part I C. other lexical tools (identify) d. machine analysis of training documents (1) describe 5. which of the following were used? a. term weighting YES b. Boolean connectors (AND, OR, NOT) YES c. proximity operators NO d. addition of terms not included in topic NO (1) source of terms e. other (1)rief description) NO III. Searching A. Total computer time to search (cpu seconds) 1. retrieval time (total cpu seconds between when a query enters the system until a list of document numbers are obtained) 18-30 PER QUERY, NO SOFT-BOOLEAN (COMBINE 2 METHODS). 30-70" " WITH " " (COMBINE 3 METHODS). 2. ranking time (total cpu seconds to sort document list) 4 - 12 PER QUERY. B. Which methods best describe your machine searching methods 1. vector space model NO 2. probabilistic model YES 3. cluster searching NO 4. ngram matching NO 5. Boolean matching NO 6. fuzzy logic (include your definition) YES, SOFT-BOOLEAN 7. free text scanning NO 8. neural networks YES 9. conceptual graph matching NO 10. other (describe) NONE B. What factors are included in your ranking? 1. term frequency YES 2. inverse document frequency NO 3. other term weights (where do they come from?) INVERSE COLLECTION ThRM FREQUENCY TOTAL WORD OCCURRENCES. 4. semantic closeness (as in semantic net distance) NO 5. position in document NO 6. syntactic clues (state how) NO 7. proximity of terms NO 8. information theoretic weights NO 9. document length YES 10. completeness (what % of the query terms are present) NO 11. ngram frequency NO 12. word specificity (i.e., animal vs. dog vs. poodle) NO 13. word sense frequency NO 14. cluster distance NO 15. other (specify) NONE 171 IV. What machine did you conduct die TREC experimen~ on? SPARC-2GS How much RAM did iL have? 48 MB What was the clock rate of the CPU? 40 MHz V. Some systems are research prototypes and others are commercial. To help compare these Systems: 1. How much "software engineering" went into the development of your system? TIME WAS SPENT TO TRUNCATh RECORD SIZES TO SAVE SPACE AND FIT CERTAIN STRUCTURES IN MEMORY; REPLACE SOME LINKED LISTS WITH ARRAYS. 2. Given appropriate resources, could your system be made to run faster? By how much (estimate)? YES, 50-100%. LOTS OF CODE WAS TRANSLAThD FROM PASCAL TO C & USED AS IS. 3. What features is your system missing that it would benefit by if it had them? DEDICAThD SPARCSTATION WITH MORE RAM, DISK SPACE. 172