QA Judgments
Used human judges to assess correctness of response strings
- response pools consisted of unique pairs from all submitted runs
- document must support answer
- �strict� evaluation counted Not Supported wrong
- �lenient� evaluation counted Not Supported correct
- each question judged by single assessor
- all variants of a question judged by same assessor