AI document review benchmark framework

A transparent framework for comparing document review systems: tasks, rubrics, and reporting rules. DraftLens does not publish numeric vendor rankings here without completed, disclosed runs.

Last updated 2026-05-11

What exists today

This page intentionally does not publish comparative vendor scores, win rates, or rankings. It defines what a fair benchmark would measure and how DraftLens will document results when real, completed runs exist.

Why raw scores mislead

What a headline number hides

Prompt sensitivity: small wording changes can swing outcomes without changing real-world usefulness.
Dataset leakage and memorization: models can appear “smart” on familiar text.
Rubric gaming: optimizing for the scorer instead of the reader’s risk.

What good would include

Benchmark components

Document types spanning legal, policy, technical, and executive narrative styles.
Human adjudication protocol for a sampled subset to keep automated rubrics honest.
Reporting rules for partial runs, abstentions, and provider unavailability.

Limitations

Explicitly out of scope here

Fabricated win rates or unverifiable rankings.
Undisclosed prompt sets presented as “neutral.”
Claims of statistical significance without published sample design.

What a headline number hides

Benchmark components

Explicitly out of scope here

Continue