TREC-9 Interactive Track Guidelines
The high-level goal of the Interactive Track in TREC-9 remains the
investigation of searching as an interactive task by examining the
process as well as the outcome. To this end a experimental framework
has been designed with the following common features:
- an interactive search task: question answering
- 8 questions
- a minimum of 16 searchers
- a newspaper/wire document collection to be searched (= Q&A track's)
- a required set of searcher questionnaires
- 5 classes of data to be collected at each site and submitted to NIST
- 1 summary measure to be calculated by NIST for use by participants
The framework will allow groups to estimate the effect of their
experimental manipulation free and clear of the main (additive)
effects of participant and topic and it will reduce the effect of
In TREC-9 the emphasis will be on each group's exploration of
different approaches to supporting the common searcher task and
understanding the reasons for the results they get. No formal
coordination of hypotheses or comparison of systems across sites is
planned for TREC-9, but groups are encouraged to seek out and exploit
synergies. As a first step, groups are strongly encouraged to make the
focus of their planned investigations known to other track
participants as soon as possible, preferably via the track listserv at
email@example.com. Contact track coordinator Bill Hersh to join.
The Interactive Track will experiment for TREC-9 with 2 different
question types from previous years, a shorter time for each question,
and a shorter overall session time for each searcher.
The track looked at 4 sorts of questions:
1. Find any n Xs
e.g., Name 3 US Senators on committees regulating the nuclear
2. Find the largest/latest/... n Xs
e.g., What is the largest expenditure on a defense item by
3. Find the first or last X
e.g., Who was the last Republican to pull out of the nomination
race to be the candidate of his/her party for US president in
4. Comparison of 2 specific Xs
e.g., Do more people graduate with an MBA from Harvard Business
School or MIT Sloan?
After some pretesting we ended up with 8 questions, half of type 1
and half of type 4. Here are the questions. (NOTE that this is not the
order in which they will be presented to searchers.)
1. What are the names of three US national parks where one can find
2. Identify a site with Roman ruins in present day France?
3. Name four films in which Orson Welles appeared.
4. Name three countries that imported Cuban sugar during the period
of time covered by the document collection.
5. Which children's TV program was on the air longer: the original
Mickey Mouse Club or the original Howdy Doody Show?
6. Which painting did Edvard Munch complete first: "Vampire" or
7. Which was the last dynasty of China: Qing or Ming?
8. Is Denmark larger or smaller in population than Norway ?
We'll use the TREC-9 Q&A track data (all newspaper/wire data on the
TREC disks 2.5GB) which includes:
- AP from disks 1-3
- Wall Street Journal from disks 1-2
- San Jose Mercury News from disk 3
- Financial Times from disk 4
- Los Angeles Times from disk 5
- FBIS from disk 5
(NOTE: FBIS is included)
The searcher's task will be to answer each question and identify a
(minimal please) set of documents which supports the answer - within a
maximum of 5 minutes. Each answer may have multiple parts. The
searcher will be asked for the answer and how certain they are about
it both before and after searching. Sites should not submit more
parts to an answer than were requested since additional ones will
Instructions to be given to searchers
The goal of this experiment is to determine how well an information
retrieval system can help you to answer questions you might ask when
searching newswire or newspaper data. The questions are of one of two
- Find a given number of different answers
For example: Name 3 hydroelectric projects proposed or under
construction in the People's Republic of China.
- Choose between two given answers
For example: Which institution granted more MBAs in 1989 -
the Harvard Business School or MIT-Sloan?
You will be asked to search on four questions with one system and
four questions with another. You will have five minutes to
search on each question, so plan your search wisely. You will
be asked to answer the question and provide a measure of your
certainty of your answer both before and after searching.
You will also be asked to complete several additional questionnaires:
- Before the experiment - computer/searchng experience and attitudes
- After each question
- After each four questions with the same system
- After the experiment - system comparison and experiment feedback
Here is the minimal set.
Data to be collected and submitted to NIST (emailed to firstname.lastname@example.org)
Several sorts of result data will be collected for evaluation/analysis
(for all questions unless otherwise specified):
===> Due at NIST by 31. August 2000:
1. sparse-format data
===> Due at NIST by when the site reports for the conference notebook
2. rich-format data
3. a full narrative description of one interactive session for
a question to be determinedn by each site
4. any further guidance or refinement of the task specification
given to the searchers
5. data from the common searcher questionnaires
Sparse format data for each question will comprise the answer (with
possibly multiple parts) for each question as well as the TREC DOCNO
for each document cited in support of the answer. Sparse-format data
will be the basis for an assessment of summary search effectiveness at
NIST: basically, whether the question was answered or not.
Rich format data for each question will record:
- the searcher's answer to the question before searching begins
in the case the searcher believes s/he already knows the answer.
- significant events in the course of the interaction and their
Rich format data are intended for analytical evaluation by the
All significant events and their timing in the course of the
interaction should be recorded. The events listed below are
those that seem to be fairly generally applicable to different
systems and interactive environments; however, the list may
need extending or modifying for specific systems and so should
be taken as a suggestion rather than a requirement:
o Intermediate search formulations: if appropriate to the
system, these should be recorded.
o Documents viewed: "viewing" is taken to mean the searcher
seeing a title or some other brief information about a
document; these events should be recorded.
o Documents seen: "seeing" is taken to mean the searcher
seeing the text of a document, or a substantial section of
text; these events should be recorded.
o Terms entered by the searcher: if appropriate to the
system, these should be recorded.
o Terms seen (offered by the system): if appropriate to the
system, these should be recorded.
o Selection/rejection: documents or terms selected by the
user for any further stage of the search (in addition to the
final selection of documents).
Format of sparse data to be submitted to NIST
One ascii file from each site. One line for each question a searcher
works on even if no answer is found. Each line containing the following
items with intervening spaces and semicolons as indicated. Since
semicolons will be used to parse the lines, they can only occur as
SiteID; SystemID; SearcherID; QuestionNum; ANSWERLIST; DOCNOLIST
SiteID - unique across sites
SystemID - unique within site to each of your IR systems
SearcherID - unique within site to each of your searchers
QuestionNum - a digit, the question number in the quidelines
ANSWERLIST - a list of answer parts separated by commas
Answer parts may contain spaces. The number of parts
will vary with the question. If no answer is found then
there will be just a space followed by a semicolon.
DOCNOLIST - a list of TREC DOCNOs as found in the documents,
separated by commas.
Sites determine SiteID, SystemID, and SearcherID. They are not allowed
to contain spaces.
Evaluation of data submitted to NIST
The assessment procedure will check each question to see whether or
not it is fully answered and whether the answer (each of its parts) is
supported by the document(s) cited. Fully answered and supported
questions will be assigned a 1; otherwise a 0 will be assigned to the
question, i.e., no credit will be given for partially correct and/or
partially supported answers.
Experimental design in general
The design will be a within-subject design like that used for TREC-8
but with different numbers of questions and searchers.
Each user will search on all the questions. Questions will be presented
in the pseudo-random fashion like last year, with 16 variations to
insure each is searched at a different position (1st through 8th) by
each system. This means that one complete round of the experiment will
require 16 subjects. Contact Bill Hersh for allocation of the
The searching part of the experiment will also take about one hour.
Each question will take 7 minutes: 1 minute before, to find out if
they know the answer; 5 minutes to find the answer(s) by searching;
and 1 minute after, to answer questions about that specific search.
An example non-searching part of the experimental session would be
Introductory stuff 10 minutes
Tutorials (2 systems) 30 minutes total
Post system questions 10 minutes total (5 for each system)
Exit questions 10 minutes
(Total non-searching 1 hour)
Experimental design for a site
1. Example minimal experimental matrix as run:
Reminder: Don't actually run this one. Contact Bill Hersh
(email@example.com) to request your own matrix.
Subject Block #1 Block #2
1 System 2: 4-7-5-8 System 1: 1-3-2-6
2 System 1: 3-5-7-1 System 2: 8-4-6-2
3 System 1: 1-3-4-6 System 2: 2-8-7-5
4 System 1: 5-2-6-3 System 2: 4-7-1-8
5 System 2: 7-6-2-4 System 1: 3-5-8-1
6 System 2: 8-4-3-2 System 1: 6-1-5-7
7 System 1: 6-1-8-7 System 2: 5-2-4-3
8 System 2: 2-8-1-5 System 1: 7-6-3-4
9 System 1: 4-7-5-8 System 2: 1-3-2-6
10 System 2: 3-5-7-1 System 1: 8-4-6-2
11 System 2: 1-3-4-6 System 1: 2-8-7-5
12 System 2: 5-2-6-3 System 1: 4-7-1-8
13 System 1: 7-6-2-4 System 2: 3-5-8-1
14 System 1: 8-4-3-2 System 2: 6-1-5-7
15 System 2: 6-1-8-7 System 1: 5-2-4-3
16 System 1: 2-8-1-5 System 2: 7-6-3-4
The design for a given site can be augmented in two ways:
1. Participants can be added in groups of ? using the design
above. Additional blocks should be requested from Bill
2. Systems can be added by adding additional groups of ? users
with each new system. Additional blocks should be requested
Questions cannot be added/subtracted individually for each site.
All augmentations other than the two listed above, however interesting,
are outside the scope of this design. If sites plan such adjunct
experiments, they are encouraged to design them for maximal synergy
with the track design.
Up to each group, but all are strongly encouraged to take advantage
of the experimental design and undertake:
1. exploratory data analysis
to examine the patterns of correlation, interaction, etc.
involving the major factors. Some example plots for the TREC-6
interactive data (recall or precision by searcher or topic)
are available on the Interactive Track web site under
"Interactive Track History".
2. analysis of variance (ANOVA), where appropriate,
to estimate the separate contributions of searcher, topic and
system as a first step in understanding why the results of one
search are different from those of another.
All experiments must be done and sparse-format data sent to NIST by
31. August 2000. Rich-format data must be sent to NIST by the time the
conference notebook papers are due.
Last updated: Tuesday, 22-Sep-2015 07:55:30 MDT
Date created: Monday, 31-Jul-00
For information about this webpage contact