Assessor Opinions
TREC QA evaluations based on assumption that opinions will differ
- true in IR & true in QA
- comparative evaluation is stable, but only comparative evaluation is valid
- absolute scores do change
- must compare results from exact same test
- gain confidence in conclusions by using more questions, repeating experiment, requiring larger gap between scores