System Summary and Timing Organization Name: Queens College, City University of New York List of Run ID's: pircsAAS, pircsAAL, pircsAM1, pircsAM2, pircsCw, pircsCwc, pircsRG0, pircsRG6 Construction of Indices, Knowledge Bases, and other Data Structures Methods Used to build Data Structures - Length (in words) of the stopword list: English - 630; Chinese - 2000 - Stemming Algorithm: Porter's algorithm - Term Weighting: yes - Phrase Discovery?: - Kind of Phrase: adjacent 2-word - Method Used (statistical, syntactic, other): statistical - Tokenizer?: - Manually-Indexed Terms?: yes for pircsAM1 & 2 - Other Techniques for building Data Structures: Statistics on Data Structures built from TREC Text - Inverted index - Clusters - N-grams, Suffix arrays, Signature Files - Knowledge Bases - Use of Manual Labor - Special Routing Structures - Run ID: pircsRG0, pircsRG6 - Type of Structure: Network of linked NODE and EDGE files capturing the query expansion terms and learnt weights built dynamically - Total Storage (in MB): about 90MB for 10 queries per 500MB textbase - Total Computer Time to Build (in hours): 2 - Automatic Process? (If not, number of manual hours): yes - Brief Description of Method: built from direct files of queries and documents and known relevant document information - Other Data Structures built from TREC text - Run ID: pircsAAS & L, pircsAM1 & 2, pircsCw & wc - Type of Structure: Compressed, truncated direct file; network of linked NODE and EDGE files built during query time - Total Storage (in MB): direct file - about 80MB per 500MB raw text; network - about 60MB for 10 queries per 500MB textbase - Total Computer Time to Build (in hours): direct file - about 20 minutes; network - about 5 min per 10 query - Automatic Process? (If not, number of manual hours): Yes - Brief Description of Method: built from direct files of queries and documents Data Built from Sources Other than the Input Text - Internally-built Auxiliary File - Domain (independent or specific): independent - Type of File (thesaurus, knowledge base, lexicon, etc.): English Stop words - Total Storage (in MB): .004 - Number of Concepts Represented: 630 - Type of Representation: array - Total Computer Time to Modify for TREC (if already built): none already exists - Use of Manual Labor - Internally-built Auxiliary File - Domain (independent or specific): independent - Type of File (thesaurus, knowledge base, lexicon, etc.): English adjacent 2-word Phrases - Total Storage (in MB): 1 - Number of Concepts Represented: 80134 - Type of Representation: array - Total Computer Time to Build (in hours): already exists - Total Computer Time to Modify for TREC (if already built): - Use of Manual Labor - Internally-built Auxiliary File - Domain (independent or specific): independent - Type of File (thesaurus, knowledge base, lexicon, etc.): Chinese Stop words & Lexicon List - Total Storage (in MB): .04 - Number of Concepts Represented: about 15000 - Type of Representation: array - Total Computer Time to Build (in hours): 0.5 - Total Manual Time to Build (in hours): 200 - Use of Manual Labor - Initial Core Manually Built to "bootstrap" for Completely Machine- Built Completion: yes; initial about 2000 entries - Externally-built Auxiliary File Query construction Automatically Built Queries (Ad-Hoc) - Topic Fields Used: pircsAAS - desc; pircsAAL, pircsCw & wc - all sections - Average Computer Time to Build Query (in cpu seconds): 3 sec - Method used in Query Construction - Term Weighting (weights based on terms in topics)?: yes - Phrase Extraction from Topics?: English 2-word - Tokenizer?: - Expansion of Queries using Previously-Constructed Data Structure?: yes - Structure Used: Network - Automatic Addition of Boolean Connectors or Proximity Operators?: Automatically Built Queries (Routing) - Topic Fields Used: title, desc, narr, con - Average Computer Time to Build Query (in cpu seconds): 4 hours - Method used in Query Construction - Terms Selected From - Topics: yes - All Training Documents: select training documents via genetic algorithm from top retrieved docs with relevance judgement - Only Documents with Relevance Judgments: yes - Term Weighting with Weights Based on terms in - Documents with Relevance Judgments: yes - Phrase Extraction from - Topics: yes - Documents with Relevance Judgments: yes - Syntactic Parsing - Word Sense Disambiguation using - Proper Noun Identification Algorithm from - Tokenizer - Heuristic Associations to Add Terms from - Expansion of Queries using Previously-Constructed Data Structure: - Structure Used: Network - Automatic Addition of Boolean connectors or Proximity Operators using information from Manually Constructed Queries (Ad-Hoc) - Topic Fields Used: pircsAM1 - desc; pircsAM2 - all - Average Time to Build Query (in Minutes): pircsAM1 about 30; pircsAM2 about 3 hrs for 50 queries - Type of Query Builder - Computer System Expert: yes - Tools used to Build Query - Knowledge Base Browser?: - Other Lexical Tools?: - Method used in Query Construction - Term Weighting?: yes - Addition of Terms not Included in Topic?: - Source of Terms: pircsAM2 - human knowledge Searching Search Times - Run ID: pircsAAS, pircsAAL, pircsAM1, pircsAM2 - Computer Time to Search (Average per Query, in CPU seconds): about 108 on 2GB using 2-stage retrieval. - Component Times: 1st stage: 48 2nd stage: 60 - Search Times - Run ID: pircsRG0, pircsRG6 - Computer Time to Search (Average per Query, in CPU seconds): about 8 min clock time for pircsRG0 and pircsRG6. - Component Times: Build network 8 min (per 10 query) Retrieval 80 min (per 10 query) Sort, merge reformat results 2 min Machine Searching Methods - Probabilistic Model?: yes - Boolean Matching?: yes - Neural Networks?: yes Factors in Ranking - Term Frequency?: yes - Other Term Weights?: yes, within-doc term frequency, inverse collection term frequency. - Proximity of Terms?: yes, 2 word phrases - Document Length?: yes Machine Information - Machine Type for TREC Experiment: 2 x Sparc 10 model 30 - Was the Machine Dedicated or Shared: Dedicated - Amount of Hard Disk Storage (in MB): 13000 - Amount of RAM (in MB): 128 System Comparisons - Amount of "Software Engineering" which went into the Development of the System: not much change from TREC4. - Given appropriate resources - Could your system run faster?: yes - By how much (estimate)?: probably half the time. - Features the System is Missing that would be beneficial: more specificity in representation; ability to differentiate contexts Significant Areas of System - Brief Description of features in your system which you feel impact the system and are not answered by above questions: 1. handles varied length documents by segmenting into subdocuments of about 550 words; 2. training with subdocuments in routing; 3. training documents are selected by a ga search for routing. 4. system does not build full inverted file; 5. system retrieve from subcollections and combine ranked lists from each into one final retrieval list; subcollections are served by one master lexicon; 6. can combine multiple retrieval methods. 7. supports Chinese text processing & retrieval in GB encoding.