
AI document review benchmark framework
A transparent framework for comparing document review systems: tasks, rubrics, and reporting rules. DraftLens does not publish numeric vendor rankings here without completed, disclosed runs.
Last updated 2026-05-11
What exists today
This page intentionally does not publish comparative vendor scores, win rates, or rankings. It defines what a fair benchmark would measure and how DraftLens will document results when real, completed runs exist.
Why raw scores mislead
What a headline number hides
- Prompt sensitivity: small wording changes can swing outcomes without changing real-world usefulness.
- Dataset leakage and memorization: models can appear “smart” on familiar text.
- Rubric gaming: optimizing for the scorer instead of the reader’s risk.
What good would include
Benchmark components
- Document types spanning legal, policy, technical, and executive narrative styles.
- Human adjudication protocol for a sampled subset to keep automated rubrics honest.
- Reporting rules for partial runs, abstentions, and provider unavailability.
Limitations
Explicitly out of scope here
- Fabricated win rates or unverifiable rankings.
- Undisclosed prompt sets presented as “neutral.”
- Claims of statistical significance without published sample design.
Related