Human Judgments
NIST assessors judge each answer string for correctness
- 3-valued judgements:correct, unsupported, incorrect
- answer strings must be responsive
- appropiate units
- no answer stuffing
- match assessor’s interpretation of question
- e.g., location of Taj Mahal