T R E C - 6 I n t e r a c t i v e T r a c k S p e c i f i c a t i o n Goal --- The high-level goal of the Interactive Track in TREC-6 is the investigation of searching as an interactive task by examining the process as well as the outcome. To this end an experiment has been designed, including: - an interactive search task and measures - 6 topics and a collection to be searched - 4 classes of data to be collected at each site and submitted to NIST - a procedure for evaluation at NIST of the submitted results General Description ------------------- A minimum of 4 participating searchers and one experimental system per site will be required. (See "E.4 Augmentation" in the detailed experimental design for information about how to use more than 4 searchers or more than one experimental system within this design.) Each searcher will perform six searches on the Financial Times of London 1991-1994 collection (part of the TREC-6 adhoc collection), using 6 topics especially chosen from the TREC-6 adhoc topics and modified for use in the interactive track. Each searcher will perform three of the searches on the site's experimental system and three on a control system (the "simple_client" version of ZPRISE 2.0), the purpose of which is to aid in comparing systems across sites. The detailed experimental design (see below) determines the order in which each searcher uses the systems (experimental or control) and the single order in which all participants at all sites will see the topics. The design's grouping rather than alternating of control and experimental systems sacrifices balance of the 4 possible system->system sequences and any associated carry-over effects for reduced time/complexity for the searcher, who must switch systems only once rather than multiple times. Using a single ordering of topics for all participants rather than a distinct one for each set of 4 searchers limits the scope of the conclusions, but provides simpler, more precise comparisons of system effects between sites and within sites which run more than one experimental system and/or more than 4 participants. In resolving experimental design questions not covered here (e.g. scheduling of tutorials and searches, etc.), participating sites should try to minimize the differences between the conditions under which a given searcher uses the control and those under which s/he uses the experimental system. For example, running the 3 control searches for a participant on one day and the 3 searches on the experimental system on another invites unequal, confounding conditions. Each of the topics describes an information need with many aspects - an aspect being roughly one of many possible answers to a question which the topic in effect poses. Here is an example interactive topic from TREC-5 - note the "Please save" paragraph, which has been added to the normal adhoc topic especially for the interactive track: Number: 274i Topic: Electric Automobiles Description: What are the latest developments in the production of electric automobiles? Please save at least one document that identifies EACH DIFFERENT recent development in the field of electric automobiles. If one document discusses several developments, then you need not save other documents that repeat those developments, since your goal is to identify the different ones that have been discussed. Narrative: The economic feasibility of electric automobiles appears to be limited by a number of factors, including the limited range of operation between recharges of batteries. What progress has been made in addressing these factors? The task of the interactive searcher is to save relevant documents, which, taken together, cover as many different aspects of the topic as possible in the 20 minutes allowed per search. Here are the aspects the NIST assessor identified within the documents saved by searchers for this topic in TREC-5: - government funding of electric car development programs - industrial development of high energy batteries - industrial development of hybrid electric cars - government regulatory encouragement of electric car development - industrial investment in electric car development/production/ marketing - consortiums formed for electric car development - setbacks - planned developments dropped, difficulties - increased use of aluminum bodies - practical testing of electrical cars - development of fuel cell technology - development of alternating-current motor Searchers will be encouraged to avoid saving documents which contribute no aspects beyond those in documents already saved, but will be told there is no scoring penalty for doing so. Instructions to be given to searchers ------------------------------------- The following introductory instructions are to be given once to each searcher before the first search: "Imagine that you have just returned from a visit to your doctor during which it was discovered that you are suffering from high blood pressure. The doctor suggests that you take a new experimental drug, but you wonder what alternative treatments are currently available. You decide to investigate the literature on your own to learn what different alternatives are available to you for high blood pressure treatment. You really need only one document for each of the different treatments for high blood pressure. You find and save a single document that lists 4 treatment drugs. Then you find and save another 4 documents that each discusses a separate alternative treatment: one that discusses the use of calcium, one that talks about regular exercise, another that mentions biofeedback, and one that cites the snakeroot plant as a possible alternative treatment. In all, you have identified 8 different aspects for this topic in 5 documents. Now we would like you to identify as many aspects as possible for each topic that will be presented to you. You will be given 20 minutes to search for each topic's aspects. Please save 1 relevant document for each of the aspects that you identify. If you save 1 document that contains many aspects, try not to save additional documents that contain only those aspects, unless a document contains additional aspects as well. As you identify an aspect, please write down a word or short phrase to identify the aspect - enough to help you keep track of which aspects you have found. Carefully read each description and narrative for each topic since they provide information on which documents are relevant and because the interpretation of "aspects" changes from topic to topic. For example, aspects can refer to different developments in a field, to different instances in which an event can occur, or to different kinds of treatments, to names of persons, places or things, etc. -- as it did in our example above. Do you have any questions about - what we mean by aspects - what we mean by relevant - the way in which you are save nonredundant documents for each aspect?" Text of the topics for TREC-6 Interactive by topic number --------------------------------------------------------- Note: this is NOT the order in which participants will see the topics. The order of presentation to searchers is defined in section E.3 below. Number: 303i Hubble Telescope Achievements <desc> Description: Identify positive accomplishments of the Hubble telescope since it was launched in 1991. <narr> Narrative: Documents are relevant that show the Hubble telescope has produced new data, better quality data than previously available, data that has increased human knowledge of the universe, or data that has led to disproving previously existing theories or hypotheses. Documentslimited to the shortcomings of the telescope would be irrelevant. Details of repairs or modifications to the telescope without reference to positive achievements would not be relevant. <aspects> Aspects: Please save at least one RELEVANT document that identifies EACH DIFFERENT positive accomplishment of the sort described above. If one document discusses several such accomplishments, then you need not save other documents that repeat those aspects, since your goal is to identify different positive accomplishments of the sort described above. -------- <num> Number: 307i <title> New Hydroelectric Projects <desc> Description: Identify hydroelectric projects proposed or under construction by country and location. Detailed description of nature, extent, purpose, problems, and consequences is desirable. <narr> Narrative: Relevant documents would contain as a minimum a clear statement that a hydroelectric project is planned or construction is under way and the location of the project. Renovation of existing facilities would be judged not relevant unless plans call for a significant increase in acre-feet or reservoir or a marked change in the environmental impact of the project. Arguments for and against proposed projects are relevant as long as they are supported by specifics, including as a minimum the name or location of the project. A statement that an individual or organization is for or against such projects in general would not be relevant. Proposals or projects underway to dismantle existing facilities or drain existing reservoirs are not relevant, nor are articles reporting a decision to drop a proposed plan. <aspects> Aspects: Please save at least one RELEVANT document that identifies EACH DIFFERENT hydroelectric project of the sort described above. If one document discusses several such projects, then you need not save other documents that repeat those aspects, since your goal is to identify different hydroelectric projects of the sort described above. -------- <num> Number: 322i <title> International Art Crime <desc> Description: Isolate instances of fraud or embezzlement in the international art trade. <narr> Narrative: A relevant document is any report that identifies an instance of fraud or embezzlement in the international buying or selling of art objects. Objects include paintings, jewelry, sculptures and any other valuable works of art. Specific instances must be identified for a document to be relevant; generalities are not relevant. <aspects> Aspects: Please save at least one RELEVANT document that identifies EACH DIFFERENT instance of fraud or embezzlement of the sort described above. If one document discusses several such instances, then you need not save other documents that repeat those aspects, since your goal is to identify different instances of fraud or embezzlement of the sort described above. -------- <num> Number: 326i <title> Ferry Sinkings <desc> Description: Any report of a ferry sinking where 100 or more people lost their lives. <narr> Narrative: To be relevant, a document must identify a ferry that has sunk causing the death of 100 or more humans. It must identify the ferry by name or place where the sinking occurred. Details of the cause of the sinking would be helpful but are not necessary to be relevant. A reference to a ferry sinking without the number of deaths would not be relevant. <aspects> Aspects: Please save at least one RELEVANT document that identifies EACH DIFFERENT ferry sinking of the sort described above. If one document document discusses several such sinkings, then you need not save other documents that repeat those aspects, since your goal is to identify different sinkings of the sort described above. -------- <num> Number: 339i <title> Alzheimer's Drug Treatment <desc> Description: What drugs are being used in the treatment of Alzheimer's Disease and how successful are they? <narr> Narrative: A relevant document should name a drug used in the treatment of Alzheimer's Disease and also its manufacturer, and should give some indication of the drug's success or failure. <aspects> Aspects: Please save at least one RELEVANT document that identifies EACH DIFFERENT drug of the sort described above. If one document discusses several such drugs, then you need not save other documents that repeat those aspects, since your goal is to identify different drugs of the sort described above. -------- <num> Number: 347i <title> Wildlife Extinction <desc> Description: The spotted owl episode in America highlighted U.S. efforts to prevent the extinction of wildlife species. What is not well known is the effort of other countries to prevent the demise of species native to their countries. What other countries have begun efforts to prevent such declines? <narr> Narrative: A relevant item will specify the country, the involved species, and steps taken to save the species. <aspects> Aspects: Please save at least one RELEVANT document that identifies EACH DIFFERENT country of the sort described above. If one document discusses several such countries, then you need not save other documents that repeat those aspects, since your goal is to identify different countries of the sort described above. Data to be collected and submitted to NIST ------------------------------------------ 4 sorts of result data will be collected for evaluation/analysis (for all searches unless otherwise specified): ===> Due at NIST by 1. September 1997: 1. sparse format data ===> Due at NIST with site report for TREC-6: 2. rich format data 3. a full narrative description of one interactive session for whichever topic is designated as T1 4. any further guidance or refinement of the task specification given to the searchers Sparse format data for each search will comprise the list of documents saved and the elapsed clock time of the search. The searcher's selection (choice) of items for the final output list must be identified in terms of each document's TREC document identifier (<DOCNO>). The elapsed (clock) time in seconds taken for the search, from the time the searcher first sees the topic until s/he declares the search to be finished, should be recorded. It is assumed that the interactive search takes place in one uninterrupted session. If a session is unavoidably interrupted, it is recommended that it be abandoned and the topic to another searcher. Sparse format data will be the basis for the summary evaluation at NIST, which will produce a triple for each search: aspectual precision, aspectual recall, and elapsed clock time. Rich format data for each search will record: - the word or phrase each searcher records to describe each aspect s/he identifies (no reference to the containing document(s)) - significant events in the course of the interaction and their timing. Rich format data are intended for analytical evaluation by the experimenters. All significant events and their timing in the course of the interaction should be recorded. The events listed below are those that seem to be fairly generally applicable to different systems and interactive environments; however, the list may need extending or modifying for specific systems and so should be taken as a suggestion rather than a requirement: o Intermediate search formulations: if appropriate to the system, these should be recorded. o Documents viewed: "viewing" is taken to mean the searcher seeing a title or some other brief information about a document; these events should be recorded. o Documents seen: "seeing" is taken to mean the searcher seeing the text of a document, or a substantial section of text; these events should be recorded. o Terms entered by the searcher: if appropriate to the system, these should be recorded. o Terms seen (offered by the system): if appropriate to the system, these should be recorded. o Selection/rejection: documents or terms selected by the user for any further stage of the search (in addition to the final selection of documents). - A full narrative description of one interactive session for topic 326i. - Any further guidance and/or refinement of the task specification given to the searchers should also be reported. Format of sparse data to be submitted to NIST --------------------------------------------- TWO files from each site I. Search file Here a "search" is the interaction of a searcher given a topic and asked to carry out the interactive search task using a given system against the collection - lasting at most 20 minutes. One line for EACH SEARCH, each line containing the following blank-delimited items from left to right: 1. Unique site ID 2. Search ID - site's choice (links search & document files) 3. Searcher ID - site's choice 4. System ID - ZPRISE as control: "ZP", others up to sites 5. TREC topic number 6. Elapsed time - number of secs., fractions truncated Clock time from the moment the searcher sees the topic until the moment the searcher indicates the search is complete or time is up.. II. Documents file One line for each document in a given search result, each line containing the following blank-delimited items from left to right: 1. Chronological sequence number ( "1", "2") within a search Use number of last time saved if saved multiple times. 2. Search ID (from search file) 3. TREC document identifier (DOCNO) NOTE: Reported data items listed within each line must NOT contain whitespace. Format of other data to be submitted to NIST -------------------------------------------- Data other than that in sparse-format should be submitted as ASCII text files. Evaluation of data submitted to NIST ------------------------------------ Evaluation by NIST of the sparse format data will proceed as follows. For each topic, a pool will be formed containing the unique documents saved by at least one searcher for that topic regardless of site. For each topic, the NIST assessor, normally the topic author, will be asked to: - read the topic carefully - read each of the documents from the pool for that topic and gradually: - create a list of the aspects found somewhere in the documents - select and record a short phrase describing each aspect found - determine which documents contain which aspects - bracket each aspect in the text of the document in which it was found For each search (by a given participant for a given topic at a given site), NIST will use the submitted list of selected documents and the assessor's aspect-document mapping for the topic to calculate: - the fraction of total aspects (as determined by the assessor) for the topic that are covered by the submitted documents (i.e., aspectual recall) - the fraction of the submitted documents which contain one or more aspects (i.e., aspectual precision) The third measure elapsed clock time will be taken directly from the submitted results for each search. Detailed experimental design and motivation ------------------------------------------- A. Starting points: 1. There is interest in testing whether any good experimental design exists which can detect system differences of interest across sites - given the assumed restrictions, the following goals, and the nature of the problem set under investigation. 2. The TREC-5 design can be built on and improved upon. If there are no interactions between factors (topic, participant, system), it allows uncorrelated estimation of a control-adjusted system effect (E-C) but not of the participant and topic effects. We would like to be able to estimate topic and participant effects at least uncontaminated by other main factors. Also, we suspect there is interaction between the factors. 3. Experimental participants cannot be randomly assigned to experimental systems. In other words we can't currently: a. install all systems at one experimental site b. provide reliably usable network access to all systems from all sites c. transport one set of participants to all sites B. Goals: The high-level goal of the TREC Interactive Track in TREC-6 is the investigation of searching as an interactive task by examining the process as well as the outcome. To that end an experiment has been designed with the following goals: 1. Allow for clean comparison of the effect of experimental interactive IR systems across sites on a performance measure 2. Allow some estimation of topic and participant effects as well as of interactions 3. Be based on a common interactive search task (including a performance measure), which mirrors some interesting subset of real-world conditions 4. Be based on an affordable minimal experimental unit 5. Allow sites to i) add participants beyond the minimum ii)test more than one local experimental system 6. Accommodate execution at geographically distant sites 7. Take into account the likelihood of variation across participants and topics 8. Reflect the likelihood of at least 2-factor interactions (topic-participant, topic-system, participant-system) 9. Capture data which can be used in the design of follow-on experiments (e.g., variability of topics and participants) B. User task Aspectual searching C. Dependent variables: Aspectual recall Aspectual precision Elapsed clock time D. Factors: Topics (T) Assume random sample from a set of a particular type designed/selected for TREC-6 from the TREC-6 adhoc topics Site (S) Fixed effect Nested factors: Experimental system within site (E) Fixed effect At least 1, but more if desired Participants within site (P) Random sample At least 4, but more if desired Held constant: Document collection (D) Financial Times of London 1991-1994 Control system (C) ZPRISE 2.0 ("simple_client" version) E. Specific Design: 1. Constraints/Assumptions - (at least) 4 participants/site - 6 topics/participant - no participant sees the same topic more than once - control system is adequate for eliminating differences in responses between sites that are not due the experimental systems 2. Response Since multiple systems cannot be tested at the same site, we rely on a common control system run by all sites to remove site-to- site differences that are not due to the experimental system. The response of interest for any of the dependent variables given in part C. will be the difference in performance between the site i experimental system and the control, Ei-C, call this the control- adjusted response. The experiment design should provide good estimates of Ei-C for each site. 3. Experiment Design Topics in the order all participants will see them: T1 = 326i Ferry sinkings T2 = 322i International art crime T3 = 307i New hydroelectric projects T4 = 347i Wildlife extinctions T5 = 303i Hubble telescope achievements T6 = 339i Alzheimer's drug treatment T1 T2 T3 T4 T5 T6 326i 322i 307i 347i 303i 339i __________________________________ | Site 1 P(1,1) | E1 E1 E1 C C C P(1,2) | C C C E1 E1 E1 P(1,3) | E1 E1 E1 C C C P(1,4) | C C C E1 E1 E1 Site 2 P(2,1) | E2 E2 E2 C C C P(2,2) | C C C E2 E2 E2 P(2,3) | E2 E2 E2 C C C P(2,4) | C C C E2 E2 E2 . . . - C = Control system = ZPRISE 2.0 ("simple_client" version) - Site i, for i = 1, ..., I sites - P(i,j) is participant j at site i, j = 1, ..., J = 4 (see part 4) - Tk is topic k, k = 1, ..., K = 6 topics. T1 is the first topic all participants see, T2 the second, and so on. Participants should be randomly assigned to the rows of the design for each site. Use of 6 topics here rather than the 12 topics of TREC-5 results from a combination of the desire to improve the estimation of effects by removing confounding (e.g. by having each topic searched by more than one participant) and the desire to stay within practical limits (e.g. for demands for and on participants). The full design will permit: 1. comparison of systems between sites by comparing the control adjusted measurements, Ei-C. 2. limited assessment of topic and participant effects unconfounded by main effects. Some confounding with interactions, as in the TREC-5 design, will remain. 3. collection of limited information on interactions between the factors. (Full information on interactions is possible only if we allow participants to observe the same topic with both the control and experimental systems, i.e., by using a full factorial design.) For the purposes of analysis each 4-person-by-6-topic matrix defined above will in effect be rearranged by permuting the columns (topics) so that it looks like the following: T1 T4 T2 T5 T3 T6 __________________________________ | Site 1 P(1,1) | E1 C E1 C E1 C P(1,2) | C E1 C E1 C E1 P(1,3) | E1 C E1 C E1 C P(1,4) | C E1 C E1 C E1 Note that this matrix consists of the following 2x2 subdesign: E C C E This 2x2 design is a latin square design. It has the property that the "treatment effect", here E-C, the control- adjusted response, can be estimated free and clear of the main (additive) effects of participant and topic. Here, participant and topic are treated statistically as blocking factors. This means that even in the presence of differences between participants and topics, which clearly are anticipated, the design will provide estimates of E-C that are not contaminated by these differences. However, the estimate of E-C is contaminated by the presence of an interaction between topic and participant. Therefore, we replicate the 2x2 latin square six times, the maximum possible under the constraints in part 1, to get the full 4x6 design for each site. The contaminating effect of the topic by participant interaction is reduced by averaging the six estimates of E-C that are available, one for each 2x2 latin square. This is analogous to averaging replicate measurements of a single quantity in order to reduce the measurement uncertainty. 4. Augmentation The design for a given site can be augmented in two ways: 1. Participants can be added by repeating the 4x6 design with 4 additional participants. 2. Systems can be added by repeating the 4x6 design with a new system. In this case, label the experimental systems E(i,1), E(i,2), ... for site i. Topics cannot be added individually for each site. All augmentations other than the two listed above, however interesting, are outside the scope of this design. If sites plan such adjunct experiments, they are encouraged to design them for maximal synergy with the track design. 5. Sample size Is this design capable of detecting "significant differences" between experimental systems? The answer depends on several quantities: 1. The minimum engineering significant difference d. This is the minimum size difference you would like to detect between systems and may differ for the different dependent variables under study. The smaller the difference you wish to detect, the larger the sample size required. 2. The underlying experimental variation in response for each site. Note that (additive) differences in topic and participant have been eliminated from this experimental variation by using the latin square design (see part 3). This greatly improves the sensitivity of the experiment. The experimental variation here is due to interactions between factors and other sources of variation in response that have not been explicitly listed here. Since the experimental variation is unknown at this point and will most likely depend on site, a sequential experimental strategy is most reasonable. Consider the proposed design to be round 1 of the experiment. Round 1 alone may provide enough information to detect system differences of interest. If not, it will provide an estimate of the experimental variation for each site. This can be used to determine how many more experiments are required to detect the minimum engineering significant difference d. F. Proposed Analysis The statistical analysis will be ANOVA, more specifically, an ANOVA for a mixed model with both crossed & nested factors - to be spelled out in detail later. G. Problems identified and actions taken to address them: 1. Effectiveness of control system in eliminating site effect a. Run pre-experiments - 3-site pre-experiment is being run but results will not be available before main experiment is begun. 2. Practicality of installing and running control system a. Alternatives to PRISE ? - none at moment b. If Prise: identify and fix known problems, test early - done 3. Variation in participants a. Require all participants share a set of characteristics ? - all participants will share the characteristic that they have no prior experience of ZPRISE nor of the experimental system they use to perform searches. - beyond this, requiring any shared characterstics was deemed impractical or undesirable 4. Variation in topics a. Use results from TREC-5 to choose and/or create topics less likely to cause variation ? - new topics 5. Variation in judgements of (aspectual) relevance among and between searchers and assessors a. Define a task and performance measure less subject to variation in judgement ? - no new task was proposed - precision in TREC-6 to be based on aspectual assessment, not on standard TREC relevance assessment as in TREC-5