QA Test Collection
Good news: 1-judge qrels equivalent to expensive adjudicated qrels (provided number of questions large enough)
Bad news: still not a true test collection
- strings judged
- little overlap across runs in strings
- current research: how to build true equivalent to IR’s qrels