QA List Evaluation
Each list judged as a unit
- instances marked correct/unsupported/incorrect
- subset of correct & unsupported instances marked distinct
- leftmost instance always instance of record
China, Russia, Cuba agreed
China said
Evaluation metric is accuracy
# distinct instances
target # of instances