QA Judgments
Used human judges to assess correctness of response strings
- response pools consisted of unique pairs from all submitted runs
- document must support answer
- “strict” evaluation counted Not Supported wrong
- “lenient” evaluation counted Not Supported correct
- each question judged by single assessor
- all variants of a question judged by same assessor