QA Judgments

Used human judges to assess correctness of response strings
- response pools consisted of unique pairs from all submitted runs
- document must support answer
  - “strict” evaluation counted Not Supported wrong
  - “lenient” evaluation counted Not Supported correct
- each question judged by single assessor
- all variants of a question judged by same assessor

Previous slide Next slide Back to first slide View graphic version