# Atomic-statement extraction rubric

How a document is decomposed into the units everything else measures.
v1.0 · 2026-06. Pairs with `probe-protocol.md`.

## What counts as one atomic statement

The smallest span that can be judged true or false on its own:

- **Assertion of fact** — "GPT-5 Pro was released in October 2025."
- **Quantified claim** — "Commercial pages dropped below 12% retention within fourteen days."
- **Named-entity attribution** — "Liu et al. (2024) showed models under-weight mid-context tokens."
- **Recommendation with qualifier** — "Front-load the claim and its qualifier into the same sentence."

## Splitting rules

1. One subject–predicate core per statement. Compound sentences split at
   coordinating conjunctions when each side carries its own truth-value.
2. The qualifier travels with the claim it qualifies. "X improved 38%
   [across three regions over six weeks]" is ONE statement — splitting it
   manufactures an orphan.
3. Anaphora are resolved at extraction: "this study" becomes the named study.
   If the antecedent cannot be resolved from the same paragraph, tag the
   statement `unresolved-antecedent` (it scores, and usually fails, chunking).

## Tagging schema

| Field | Values |
|---|---|
| `type` | `factual` · `evaluative` · `prescriptive` · `historical` |
| `measurable` | `measurable` · `slot` (too vague to match against model output) |
| `quantified` | boolean — carries a specific number/date/duration |
| `sourced` | boolean — traceable to a named primary source |
| `entity_anchored` | boolean — names specific company/product/person/place |
| `qualifier_in_sentence` | boolean — qualifier within the same sentence |
| `antecedent` | `resolved` · `unresolved` |

## The "slot" test

A statement is a **slot** if a competitor could publish it verbatim without
lying: "Our platform delivers industry-leading performance" fits any vendor,
asserts nothing checkable, and is excluded from visibility scoring. Roughly
half the claims on a typical marketing page fail this test — that finding is
itself one of the corpus's most stable results.

## Worked example

Input paragraph:

> "We're the leading platform in the space. Independent testing in March
> 2025 measured a 38% reduction in egress latency versus the named
> comparison, and our customers love the experience."

Extraction:

| # | Statement | type | measurable | quantified | sourced | entity |
|---|---|---|---|---|---|---|
| 1 | "We're the leading platform in the space." | evaluative | **slot** | no | no | no |
| 2 | "Independent testing in March 2025 measured a 38% reduction in egress latency versus the named comparison." | factual | measurable | yes | yes | partial |
| 3 | "Our customers love the experience." | evaluative | **slot** | no | no | no |

One page, three claims, one survivor. That ratio is typical.
