DraftLens

AI document review benchmark framework

A transparent framework for comparing document review systems: tasks, rubrics, and reporting rules. DraftLens does not publish numeric vendor rankings here without completed, disclosed runs.

Last updated 2026-05-11

What exists today

This page intentionally does not publish comparative vendor scores, win rates, or rankings. It defines what a fair benchmark would measure and how DraftLens will document results when real, completed runs exist.

Why raw scores mislead

What a headline number hides

  • Prompt sensitivity: small wording changes can swing outcomes without changing real-world usefulness.
  • Dataset leakage and memorization: models can appear “smart” on familiar text.
  • Rubric gaming: optimizing for the scorer instead of the reader’s risk.

What good would include

Benchmark components

  • Document types spanning legal, policy, technical, and executive narrative styles.
  • Human adjudication protocol for a sampled subset to keep automated rubrics honest.
  • Reporting rules for partial runs, abstentions, and provider unavailability.

Limitations

Explicitly out of scope here

  • Fabricated win rates or unverifiable rankings.
  • Undisclosed prompt sets presented as “neutral.”
  • Claims of statistical significance without published sample design.

Related

Continue