Stability of Laboratory Tests
Mean Kendall t between system rankings produced from different qrel sets: .938
Similar results held for
- different query sets
- different evaluation measures
- different assessor types
- single opinion vs. group opinion judgments
How is filtering (with strong learning component) affected?