Multilevel Ranking m Large Text Collections Using FAIRS S-C. Chang, H. Dediu, H. Azzam, M-W. Du GTE Laboratories Abstraci 2.0 System Overview A description of a general-purpose multilevel ranking information retrieval prototype is presented. The methods used in weighting and ranking the retrieved documents are discussed. Experiments with the TREC92 collection of text and queries have been conducted without manual pre- processing. Initial results have shown the multilevel rank- ing scheme to be highly competitive in precision and recall relative to other ranking strategies. 1.0 Introduction Information retrieval research at GTh Laboratories has led to the development of a prototype system called FAIRS (Friendly Adaptable Information Retrieval System). FAIRS has evolved into a functional and flexible system currently running on SunOS, HPILJX, Ultrix, AIX, VMS and PC platforms. It has been used in environments as diverse as literature searching, library operations, research and development, customer support, market analysis and management. FAIRS is being further tested as a retrieval engine for very large collections of text, such as those pre- sented by the TREC92 collection and wide-area distrib- uted collections. FAIRS is designed to minimize user effort in the prepara- tion of text, the learning of query syntax while providing a user-modifiable multilevel ranking scheme1. Experiments with the TREC92 collections of text and queries have been conducted with no human intervention in the processing of either text or queries. The results of experiments against the collection of Wall Street Journal articles are listed in Section 3A. 1. PatentPending 329 FAIRS uses pure text as its information base, while allowing flexible links into non-textual information. FAIRS extracts information out of an unstructured, amorphous collection of data, in four main steps: 1. First, it partitions (logically) a raw text file or a col- lection of such files into retrievable record units. This simple record partitioning is necessary and suf- ficient for indexing to begin. The goal is to use source information as-is [I]. 2. Second, FAIRS automatically constructs an index. A feature exists where deletions are permitted from an index. Statistics on the collections can also be gener- ated at this time. Such statistics are used in normaliz- ing the weighting of retrieved documents. 3. The third step involves the user queries. Queries are accepted and interpreted in an intelligent and sensi- tive manner [1,2]. A flexible approach to the under- standing of the query is essential to providing good responses. 4. Finally, once the query is processed, the relevant hits are retrieved quickly and displayed in an ordered list ranked according to a relevance measure. The records corresponding to the hits can then be viewed, printed or mailed in their entirety on demand. 2.1 Char8cterIstIcs In FAIRS, both the responses to information requests and the way relevance is determined can be customized by the user. FAIRS can provide a tool for decision mak- ing by presenting to the user all the relevant facts in an elegant and timely fashion. There are some interesting and novel features of FAIRS and research issues associ- ated with each of these strategies. They will be dis- cussed in sequence. 2.1.1 Using Files As-Is Raw textual information is broken into records (i.e., the units that will be subsequently retrieved). Consistent with the goal of using the information as-is (i.e., eliminating file conversion), FAIRS does not require the input files to be in any particular format [1). Any ASCII file can be indexed and searched. However, if a file does have a logi- cal field structure that the user wants to use in searching (e.g., restrict a search to text in the NAME field of records), then this structure must be described to FAIRS. If the information files do not have "recordt' or "field" structures then an implicit method must be used to parti- tion these files into records. To use the files as-is, FAIRS allows the user to impose a record structure. Any character string(s) can be designated as an end-of-record marker(s). Such a page/record structure has been used successfully and a number of documents have been quickly trans- formed into textbases that could be processed by FAIRS. Alternatively, a fixed line count, such as 66 lines or 78 lines, can be used as an end-of-record marker. Further- more, each file may be declared as an individual record if multiple files are present. Records can be further divided into fields by using any character string(s) as field mark- ers. For retrieval, FAIRS requlres the information base being searched to be partitioned into indexed, retrievable records. Either the record/field structure in the original text file can be used or, in the absence of such structure, it can be imposed. In either event, a description of record/ field format will control how FAIRS breaks the raw text file into records. Note that a file format description is sep- arate from the described input file which can be used as-is (without conversion). These methods of record definition are the result of practi- cal experience with text collections generated in typical business environments. A flexible input text structure or pre-processor is essential to the effectiveness of a general- purpose information retrieval system. 2.1.3 Stemming Words are stemmed at index time and at query parsing time. The stemming rules used are published by Paice [4]. The rules are incomplete but are claimed to be satis- factory. Words shorter than 4 characters are not stemmed. Conflation is used as an alternative to the inadequacies of truncation. Paice's scheme does not return "correct" roots, but does ensure that members of a family reduce to the same root, and members of different families reduce to different roots. 2.1 A Query Formation FAIRS handles free association queries for information [1,2]. Free association is the essence of being "user friendly" in information retrieval, rather than graphical user interface features such as windows, icons, mice, and pulldown menus. Free association means that there is no set of keywords that must be known and used exactly. Queries do not have to be phrased as Boolean expressions with ANDs and ORs. Words, phrases, and even word stems thought to be relevant are listed in a free association fashion. A FAIRS query has the simple form: ierm1[w1] [,]term2[w2] []...ierm~[w~] Where term~ can be a single word or a phrase and W~ is a user-specified weight for that.word. There will be a hit for each record in the information base that contains at least one word from the query. Commas are the only delimiters used to delineate phrases, so that relative adjacency relation among search words (i.e., the distance between two words that appear as a query term such as "scenic view" or "soft- ware engineering") could be considered automatically. The weight W~ is used at ranking time but if omitted, it is set by default to 1. 2.1.5 Previewing Recorda 2.1.2 Stop Lists FAIRS constructs an index to support the full text retrieval of records. A user-defined stop list can be used to limit the words indexed. For the TREC92 experiments the stop word list consisted of 280 words. It was customized with the addition of several common embedded SGML strings and spurious field definitions such as: docno, doc, text, journal, title, summary, descriptor, author, fileid, file_id, &bullet, ¶, °rees, etc. 330 The ranking results are shown to the user via several interactive and selectable preview features. The preview features show some aspects of the top ranked records. The aspects include the first (or all) lines containing query words, coverage, and frequency data or simply the first few lines of highly ranked records. This is designed to give users a better feel for how relevant the retrieved records really are, and thus give him a feel for how well do the ranking rules work, and how they should be customized. After examining a preview or the full text of the highly ranked records, the user can then revise the query or even change the ranking strategy. 2.1.6 Synonyms A synonym definition capability (glossary) is also avail- able within FAIRS. Given a word, FAIRS will retrieve instances of that word, as well as instances of its syn- onyms. The option can be invoked to broaden a query. Users can build their own vocabulary or invoke and mod- ify the system-wide synonym dictionary. This can be used to ease the problem of different word usages from differ- ent people. A very fast elastic string matching algorithm2 [5] is being evaluated for inclusion among the query expansion fea- tures of FAIRS. 2.1.7 Displaying Records When a request is made, users will always be presented with the retrieved records in their full-text form. Further- more, FAIRS provides several ways to associate non-text information with each record. There are basically two types of links: implicit and explicit. Implicit links use source information as-is while explicit links involve spe- cial fields embedded in the source text. Implicit links are useful in situations where an implied one4o-one mapping may be established between records and image files. Explicit links may be used to express one-to-many rela- tions between records and other media. 2.1.8 Ranking Gne of the most interesting aspects of FAIRS is its uncon- ventional ranking scheme to determine the relative rele- vance of retrieved records. The ranking scheme is designed to mimic the human relevancy judgement pro- cess. When a person is asked to determine the relative relevance between two records he is likely to first weigh them using a set of criteria. If the two records have the same weight with the set of criteria, a secondary set of criteria may be used to differentiate them, and so on. The criteria used can be highly heuristic. The adaptation of this "multilevel ranking scheme" has been filed with the Patent office in the United States. To enable FAIRS to use free association in place of Bool- ean semantics, a multilevel ranking model [1,2] for full- 2. Patent Pending 331 text information retrieval has been developed and imple- mented. FAIRS ranks records with respect to a particular query according to a set of rules. The default rules consist of six attributes in six levels. The six attributes are the importance, popularity, frequency, location of a search word, and record size, and record JD of the record it occurs in. Each attribute may have either positive, negative or no impact (neutral) on the relevance judgement of a record. Such arrangement also guarantees the automatic consider- ation of coverage (i.e., percentage of different query words covered by record), which is next to impossible to imple- ment in a Boolean environment [1,2]. The ranking rules of FAIRS are always accessible and modifiable by the user. Descriptions of the attributes chosen for FAIRS follow: FREQUENCY: The number of occurrences of the key- word in the record. This attribute may be used to reflect interest in finding records with more repetitions of a given term. That is, when set to have a positive impact, the more instances of a term in a record, the more relevant the record. Therefore, the record with the higherfrequency of a term is more likely to be retrieved. iMPORTANCE: FAIRS provides the searcher the ability to assign an arbitrary weight (importance) to each query key- word; thus the user has additional control over how records are retrieved. For example, specifying brea~ast:4 assigns a weight of 4 to the term breakfast in a query. In general, keyword weighting allows the searcher to change how FAIRS sorts records in order to identify the most rele- vant ones. If a keyword is not weighted, FAIRS supplies a default weight which is in reverse proportion to its input position in the query (words that came to mind first deserve more weight). Therefore, if the searcher chooses to weight some or all keywords in a query with a range of weights, records strong in the heavily-weighted keywords are ranked before others. This attribute is defaulted to have a positive impact on the relevance judgement. POPULARITY: This is the number of times a term occurs in the entire collection, as opposed to the number of times it appears in the retrieved record. For example, if the word software appeared, at least once, in 15 records in a collec- tion, its popularity is considered to be 15. This attribute is usually used in the negative sense and, by default, FAIRS assumes that the more popular a term is, the less effective it is in retrieval. REC_ID: The record ID is the location of a record in the collection. It may indicate the age of the record.This is also useful when the records are arranged according to their degree of significance (in either increasing or decreasing order.) This attribute is a good example for showing there is no perfect ranking rules that work in all situations. This attribute may have either a positive or neg- alive impact, depending on user intention. Although there is no obviously better default setting for this attribute, we chose positive to be the default, so as to give priority to the later records as they are likely to be more timely. REC_SIZE: The total word count of the record. This attribute may be used to counter (normalize) the size advantage that a larger record may have over smaller ones during rank judgement. A larger record may have more keywords simply because it contains more words. It is therefore we set its default to have a negative impact on the relevance judgement. Of course, when records are of similar size, this attribute will have minimal effect, and should probably be disabled to improve response time. WORD_LOC: The location of the first occurrence of the keyword in the record. A negative setting (the default) of this attribute assumes that important words appear in the beginning of a collection, or record. For example, head- ings or tides which contain keywords describing the con- tents of a document, usually appear at the beginning. Of course, this depends solely on how the contents of the information are organized. This serves as another example of the context-sensitivity of the ranking process. The following table shows the default ranking rules: Table 1: Ranking Attributes Settings for TREC92 level Imp Pop Freq Size ID Loc 1 --- --- --- 3 --- neg - - --- --- 4 pos neg pos neg - - 6 pos The first level, having no attribute values, automatically accounts for coverage if the weight computation is done using Method 1, as described below in section 2.1.9. To perform ThEC92 experiments, we changed the ranking rules as follows: Size was introduced on level one to bal- ance the effect of the very large records found in the Fed- eral Register. Consequently, we also had to specify impacts for the Importance, Popularity and Frequency attributes on the first level since they are the most impor- tant criteria for ranking. The following table shows the TREC92 ranking rules: 332 Table 2: Ranking Attributes Settings for TREC92 level Imp Pop Freq Size ID Loc 1 pos neg pos neg - - 2 pos neg - neg - 3 pos neg pos neg - 5 --- --- --- --- pos 2.1.9 Weight Computation Two methods were available to compute the weight of a document: U Method 1 The weight of a retrieved record r at level l is determined by the following formula: K FA~ W1[r] = Ie[i]Ha1,~'['1I £=1 L~=1J Where: e[i] = 1 if the ith keyword exists in the record and 0 otherwise, = The value of the attributej9 A Number of attributes (currently 6,) K = Total number of query keywords, s1,1 = 1 if the value of attributej is positive, 0 if the value of attributej is neutral, -1 if the value of attributej is negative. In other words, the weight at a certain level is the sum of the product of the attributes at that level. This weight com- putation method automatically calculates coverage since for each keyword e[i], the product of attributes is never 0. The value of the attribute is configured by the user either before running FAIRS, or before delivering the query. To set the value before running FAIRS, a file must be created containing the initial values. A default value is set during system start-up. * Method2 One disadvantage of method 1 is that it lacks a common reference scale that evenly distributes the influence of the attributes on the weight of the document. For example, consider a textbase with one very large record, record number 1, and assume the following settings for the vari- able attributes: FREQUENCY = neutral (s[1] =0) POPULARITY = neutral (s[2] =0) WORD~LOC = neutral (s[3] =0) REC_ID = neutral (s[4] =0) REC_SIZE = posiave (s[5] = 1) The contributing factors to the weight of a retrieved docu- ment are IMPORTANCE and REC_SIZE. Because of its large size (a[5] » 0,) record I will be irrelevantly retrieved in response to a large percentage of pending que- nes. To avoid the above scenario, method 2 uses the following formula to compute the normalized weight of a retrieved record r at level 1: K ~ A W1[r] = ~ e(i] aills U]I ~tl L JX=1~iCj Where: e[i) = 1 if keyword i exists in the record and 0 other- wise, ail] = The value of the attributej, A = Number of attributes (currently 6,) K = Total number of query keywords, sill = 1 if the value of attributej is positive, 0 if the value of attributej is neutral, -1 if the value of attributej is negative. C = Number of attributes whose value is not 0. amill = The maximum value of the attributej. To avoid the effect of disproportionately large attribute values, al/Jiamill is set to 1 if a[,] is one of the largest 2% attribute values. Method 2 has the advantage of accommo- dating large attribute values by normalizing with respect to their maximum. However, this method does not automatically calculate the coverage since sill contributes as a multiplicand rather than as an exponent as in Method 1. Therefore, if the attributes are all neutral (level 1 in Table 1--default attributes,) the weight would be 0, evan though query terms would occur in the record r. We are therefore still looking for ways to improve the method. 333 3.0 Experiments The TREC92 collection of text and topics was used to quantify and analyze the performance of FAIRS in the domain of very large collections. The entire collection includes 2.3Gb of text from various sources and of various formats. 3.1 System Configuration The hardware platform used for indexing and query pro- cessing was an IBM RS/6000 320 workstation with AIX 3.0 operating system. The core memory (RAM) installed was 32 Mb. The net disk space available was 4,169,728Kb. The CPU clock rate was 25 MHz, with a MIPS rating of 27. The disk access time was 9.8 ms on average. 3.2 Indexing Performance Our experiments showed that pefformance in indexing is strictly constrained by I/O wait. We used several tech- niques to reduce this constraint and optimize throughput. We scheduled index runs to run simultaneously, thus keep- ing the CPU busy during I/O wait. We used a kernel which employs automatic disk caching (providing an order of magnitude improvement in indexing time.) To further exploit caching, large amounts of high-speed core memory should be used (our hinit of 32 Mb is by no means ideal.) The I/O bottleneck also implies that the very fastest disks be used. We chose 9.8 ms disks as the fastest available for the platform for a reasonable price. Using the above con- figuration, the sustained indexing throughput was on the order of 10 days (240 hrs.)/Gb. The storage overhead for index output was 110%. That is, for 1 Gb of text, 2.1 Gb of space should be available before indexing is initiated. By this measure, the entire TREC92 collection of 2.3 Gb would require 4.83 Gb of storage. Since only 4.17 Gb was available, we participated in category B (Wall Street Jour- nal only.) 3.3 Query processing Before submitting the raw topics to FAIRS, they were con- verted automatically into FAIRS-compatible queries of the form described in section 2.1.4. The TREC92 topics were pre-processed by a simple syntactic term token generator. The pre-processor removed stop words, stemmed and removed cases from topic terms. It used an ad-hoc term weighting scheme to assign IMPORTANCE weights to terms according to their positions in the topic (e.g. terms occurring in the title section were assigned an arbitrarily higher importance than those in other sections.) The importance weights were assigned with a minimum of attention. As described below, failure analysis had later shown that great improvements in precision could be made if pre~processing were better tuned with term expansion, duplicates removal and a heuristic importance measure. Mter conversion, queries were submitted to the retrieval engine through a non-interactive interface. The time for processing (ranking) was approximately 30 minutes per query. Again, I/O wait was the dominant factor in running time. The same system solutions which apply to indexing can be applied with equal effectiveness to query process- ing. Modification of some of the internal data Structures for index storage (concordance list optimization) could also improve ranking running time. 3.4 Retrieval Performance FAIRS participated in the following categories: Ad-hoc Wall Street Journal disk 1 (AWl), Ad-hoc Wall Street Journal Disk 2 (AW2), Routing, Category B ~B). Rele- vance judgements are as follows: AWl: 50 topics, AW2: 50, RB: 25; total 125. Of these judgements, FAIRS' recall rates were ranked as follows: 77 on or above median, 45 below median, and 3 undetermined (1 tied, 2 with 0 relevant.) 61.6% on or above median, 36% below, 2.4% undetermined. Table 3: Recall Performance Summary %>= %< Category med med AWl 66.0 32.0 AW2 46.0 50.0 RB 84.0 16.0 Total 61.6 36.0 %rell ret 54~ Beside recall, relevancy judgements made available recall/ precision figures across 11 points of recall rates. The aver- age of the 11 recall/precision figures is called the 11-pt. average. FAIRS' 11-pt. average rates were ranked as fol- lows: 79 on or above median, 43 below median, and 3 undetermined. The percentages for Il-Pt. averages are: 63.2% on or above, 34A% below, 2A% undetermined. 334 Table 4: 11-pt. Average Performance Summary Category med med AWl 85.7 31.0 AW2 45.8 54.2 RB 84.0 16.0 Totals 63.2 34A 3A.1 AWl Ad-hoc WSJ disk I, results were submitted by three sys- tems. In this category, of all systems' submissions, 4,056 documents were judged relevant, 10,000 were submitted by FAIRS in response to 50 queries, of which 1,561 were among the relevant. The average 11 point average (over 50 queries over 11 recall rates for each query) was 0.2083. The distribution of relevant retrieved (recall) over the 50 topics was: 20 ranked best, 12 on the median, 16 worst, 2 tied. The distribution of 11-Pt. averages over the 50 topics was: 22 ranked best, 14 on the median and 13 worst, I tied. *Thble 5: AWl RecaII/Predslon Performance Best Med Worst Recall 20(40%) 13 (26%) 16(32%) 11-Pt. 22(44%) 14(28%) 12(24%) Total recall placed FAIRS second among 3 participants, and first in 11-Pt. averages recall/precision. 3A.2 AW2 Ad-hoc WSJ disk 2, results were submitted by two sys- tems. In this category 2,172 documents were judged rele- vant, 10,000 were submitted in response to 50 queries, of which 1,188 were among the relevant. The 11 point aver- age precision was 0.2216. The distribution of relevant retrieved (recall) over the 50 topics was: 17 ranked best, 22 worst, 9 tied, 2 unevaluated. Of the tied, the 11-pt. averages favored FAIRS 6 times, the other system 3 times. Table 6: AW2 Recall/Precision Performance rn Best Worst ~ ~e Recall 17(34%) 22(44%)_~ 11 Lilt.[_22(44%) 26 (52%)_[ 2 For those queries for which there was a tie in recall values, there are two queries which had 0 records judged relevant. Of the remaining 9, we considered the 11-pt. average as a tie-breaker. The result was 6 best, 3 worst. Combining the recall and 11-pt. averages, for AW2, FAIRS had 23 sub- missions on or above the median (46%), 25 below (50%) (4% undetermined). 3A.3 RB Routing, Category B results were submitted by 7 systems. For those topics judged, 3,766 documents were considered relevant, 5,000 were submitted by FAIRS in response to 25 queries. Of those submissions, 1,124 were among the relevant. The distribution of relevant retrieved (recall) over the 25 topics was: 2 ranked best, 13 ranked above the median, 6 on the median, 4 below, for a total of 21 on or above the median, 4 below. The following graph illustrates the performance index of the recall rates of FAIRS compared to the group. It shows FAIRS is above average most of the time. The average recall P! of FAIRS is 65.8. 1 3 5 7 9 11 13 15 i7 19 21 23 25 Recall ~Qiaery The next graph shows the performance index of the 11- point average of FAIRS compared to the group. It again shows FAIRS to be above average most of the time. The average 11-point-average P1 of FAIRS is 61.8. 100 50 Table 7: RB RecalVprecision Performance I ~tion to Median > I = 1 Recall ~_15(60%) 6(24%) :4(16%) 1 11-pt. ~ 15(60%) 6(24%) 4(16%)] This is the only group that had enough participants to make a comparative-performance analysis meaningful. We compared our 11-point average and recall rates for each query to the best, the median, and the worst scores of that query. The performance index (P!) is calculated as fol- lows: ____ (score-median~ score > median 50+50 I P1 = bess - median score - w:orsrs:s), score