# Paired-run rubric - v0.1 (draft, ships with first run) The rubric for scoring paired-run results in the cross-builder pilot. This is **v0.1**: the rubric is intentionally minimal because a maximal rubric written before the first run would be a paperwork artifact, not a measurement instrument. The recurring-skip ledger in WRAPUP.md names "pre-committed stub rubric files" as a failure mode; this rubric exists in draft form here so the first run has a measuring stick, and ships in its final form with the first run's results. **Status:** Proposed · **Recipient:** the pilot operator scoring six runs (3 builders × 2 conditions). **Last verified:** not yet · **Re-verify by:** empty (status is not yet Proven) --- ## Headline measurements The two primary numbers per pair. ### 1. Time-to-acceptance (T) Definition: clock minutes from contract-sign-off (the moment the VALUE.md passes the six-part gate, or the moment the plain description is locked) to the recipient's verdict. Recorded by the builder. Honor system. The builder logs: - **Start timestamp** - the moment the contract is locked. ISO 8601 with timezone. - **End timestamp** - the moment the recipient gives the verdict. ISO 8601 with timezone. - **Active build time within window** - a rough estimate, in hours, of how much of the elapsed clock was actually spent building. (For comparing across builders whose lives differ in interruption profile.) Calendar time is the primary measurement (it is what the standard's Q3 claim is denominated in: total time to a result the recipient accepts). Active build time is the secondary measurement for cross-builder sanity-checking. ### 2. Acceptance verdict (V) Definition: the recipient, blind to which pass was VALUE.md and which was description-only, gives one of three verdicts on each deliverable: - **Accept** - "yes, this does the change I needed." Verbatim quote captured. - **Partial** - "yes for X, no for Y." Verbatim quote captured. - **Reject** - "no, this didn't do the change." Verbatim quote captured. The verdict is recorded against the verbatim wording the recipient used at intake to describe the change they expected. If the recipient cannot articulate the expected change at intake, the pair is excluded from analysis (and that exclusion is itself a finding). ## Secondary measurements Two numbers per pair for sharpening hypothesis, not for scoring. ### 3. Re-work count (R) How many times the builder went back to the recipient between contract sign-off and acceptance to clarify the contract or re-negotiate scope. A higher re-work count is not necessarily a worse outcome (mid-stream re-negotiation can be healthy); it is captured for comparison. ### 4. Builder confidence at sign-off (C) A self-reported 1-7 Likert at the moment the contract is locked: "how confident are you that the deliverable, if built to this contract, will get an Accept from the recipient?" Captured before the builder starts building. The hypothesis: VALUE.md condition pulls C higher at sign-off than description-only does, and the gap between C-at-sign-off and V-at-delivery is smaller for VALUE.md. The pilot tests both halves directly. ## Aggregation n=3 pairs. The pilot reports per-pair values, not averages - averaging across n=3 misrepresents the data shape. Per pair, the standard reads as a directional pass for the VALUE.md condition if: - T(VALUE.md) ≤ T(description-only), AND - V(VALUE.md) ≥ V(description-only) on the Accept > Partial > Reject ordering. Per pair, the standard reads as a directional fail for the VALUE.md condition if either: - T(VALUE.md) > T(description-only) AND V(VALUE.md) ≤ V(description-only), OR - V(VALUE.md) = Reject and V(description-only) ≠ Reject. The pilot's verdict is composed across pairs: - **3 of 3 directional pass:** v1.0 → v1.1 gate clears for first cycle. Status of Q3 moves Proposed → Promised. Status of P2 moves Proposed → Promised. - **2 of 3 directional pass:** field-evidence signal in the right direction, not enough to clear the gate. Standard runs another pilot at n=6 before moving Status. - **1 of 3 or 0 of 3 directional pass:** the hypothesis is wrong in at least the recruitment shape the pilot ran. The standard names this in §Status; Q3 stays Proposed; the next move is a re-design of the experiment, not a re-run. ## What this rubric does NOT measure - **Builder satisfaction.** Whether the builder enjoyed using the standard is not on the gate. The standard is a tool, not a beverage. - **Recipient delight.** Whether the recipient felt good about the deliverable is not on the gate. Accept / Partial / Reject is binary-ish on purpose. - **Code or artifact quality.** The standard's claim is about whether the change happened for the recipient, not whether the artifact is well-built. A poorly-built artifact that does the change is an Accept. - **Long-term retention.** This pilot measures the moment of acceptance. Whether the recipient still uses the deliverable in three months is the v1.1 → v1.2 cycle's question. ## Open issues for v0.2 Captured here so they don't get lost between first run and rubric-finalization: - Should `R` (re-work) have a sub-classification (clarification vs scope change vs builder error)? - Is `C` (confidence) leaking the order-randomization to the builder via self-reflection? - Is `T` (clock time) the right denominator, or is the right denominator "calendar days the project was active"? All three settle on first contact with the first real run.