Benchmark methodology

How DraftLens thinks about evaluating multi-model document review: datasets, leakage controls, scoring rubrics, and human adjudication — a living framework, not a leaderboard.

Last updated 2026-05-11

What this page is for

A credible document-review benchmark states tasks, datasets, rubrics, and adjudication rules up front—and reports failure modes and costs, not only headline accuracy. This page explains how DraftLens thinks about that design space without publishing comparative vendor scores.

Measurement

What should be measured

Task realism: whole-document behaviors (cross references, definitions) not isolated toy sentences.
Severity calibration: material vs nit, and whether “helpful” edits are meaning-preserving.
Evidence linkage: when claims require exhibits, does the system stay grounded or hallucinate anchors?
Operational stress: long inputs, partial provider failures, and honest partial outputs.

Design

Why benchmark design matters

Raw scores become misleading when prompts leak, datasets overlap with training, or rubrics reward aggressive rewriting over conservative correctness. A serious methodology publishes enough detail that a third party could attempt to reproduce the harness—knowing reproduction may still differ in implementation details.

DraftLens stance

How we think about methodology

DraftLens product behavior already emphasizes structured outputs, explicit stages, and honest partial status. Any external benchmark we publish should mirror those values: disclose what ran, what failed, and what humans adjudicated.

Future outputs

What may ship later (only with real runs)

Task mix results with confidence intervals—not single-point leaderboards.
Failure galleries: examples where models disagreed or where partial quorum applied.
Open prompts and scoring notes sufficient for independent replication attempts.

What should be measured

Why benchmark design matters

How we think about methodology

What may ship later (only with real runs)

Read next