# Case Study 01 - The Standard's Own Conversation History

**Date:** 2026-06-04
**Method:** Conversation-transcript analysis across three matched-pair cohorts. Build-trap signal patterns counted per 1,000 user words and compared pre vs post VALUE.md adoption.
**Data:** All sessions in `~/.claude/projects/*` covering the relevant projects. Anonymization: company names and personal identifiers stripped.
**Result:** Mixed. One pair shows strong positive signal (the standard's own development). One pair shows no clear improvement (an established large project where VALUE.md was retrofitted late). One pair was structurally invalid (the post-cohort contained no interactive build sessions).
**What this is evidence for:** the hypothesis sharpens. VALUE.md improves conversational discipline when applied as a project starts or when the project explicitly commits to the standard's vocabulary. Retrofitting late into a deep existing project does not visibly move the same signals.

---

## The question being asked

The Keep a Value standard claims that builders using a VALUE.md reach recipient-accepted results faster than builders working from plain descriptions. The paired-run experiment that would prove this has not been run. This case study is the next-best evidence available: a retrospective analysis of the author's own conversation transcripts across projects, before and after the VALUE.md format existed.

Three pairs were attempted. Each pair compares the same project (or directly comparable projects) before and after the VALUE.md era. The comparison metric is the rate at which observable build-trap signals appear in the user-side transcript, normalized to per-1,000-user-words to control for cohort size.

---

## What was measured

Each pair was scanned for the following signal patterns in the user-typed text:

| Signal | What it indicates | Pattern examples |
|---|---|---|
| Restart language | The builder lost the thread mid-conversation | "let me start over", "scratch that", "actually wait", "forget that" |
| Lost-thread language | The builder forgot what they were doing | "what was I", "where were we", "as I mentioned", "remind me" |
| Premature commitment | The builder jumped to implementation before naming the problem | "let's build", "let's implement", "let's add" |
| Time pressure | The builder is rushing | "quickly", "fast", "real quick", "ASAP" |
| Surface fixes | The builder is patching without diagnosing | "small tweak", "one more change", "quick fix" |
| Frustration explicit | The builder is stuck and showing it | profanity, "this isn't working", "no that's wrong" |
| Recipient naming (specific) | The builder names a concrete role | "the on-call engineer", "the operator", "the customer", "the stakeholder" |
| Vague recipient | The builder uses category-level placeholder language | "users", "the team", "everyone", "developers", "people" |
| Value-discipline vocabulary | The builder uses the standard's terms | "value", "recipient", "changes for", "breaks if", "promise", "falsifiable" |

Each signal is counted in the user-side transcript only (not Claude's responses). Counts are normalized to per-1,000-user-words to control for cohort size differences. A positive delta means the rate increased after the split date; a negative delta means it decreased.

---

## Pair 1 - the standard's own development

**Project:** the Keep a Value standard itself - the conversations that produced the standard, the runbook, the experiments, and this case study.
**Split date:** 2026-06-01 (the standard's v0.1.0 Genesis release - the first day the discipline was formalized as a published artifact).
**Pre cohort:** 1 interactive multi-turn session (24,750 user words, 1 session).
**Post cohort:** 8 interactive multi-turn sessions (77,314 user words, 320 user turns).

### Headline numbers

| Signal | Pre rate | Post rate | Delta | Direction predicted |
|---|---|---|---|---|
| Recipient naming (specific) | 0.00 / 1k | 0.93 / 1k | from zero | predicted UP - YES |
| Vague recipient | 8.08 / 1k | 2.59 / 1k | -68% | predicted DOWN - YES |
| Frustration explicit | 1.21 / 1k | 0.39 / 1k | -68% | predicted DOWN - YES |
| Lost-thread language | 0 / 1k | 0 / 1k | flat | predicted DOWN - flat |
| Restart language | 0 / 1k | 0.09 / 1k | from zero | predicted DOWN - went UP |
| Premature commitment | 0 / 1k | 0.026 / 1k | from zero | predicted DOWN - went UP slightly |
| Time pressure | 0.40 / 1k | 0.82 / 1k | +102% | predicted DOWN - went UP |
| Value-discipline vocabulary | 23.4 / 1k | 16.4 / 1k | -30% | predicted UP - went DOWN (saturation) |

### Reading the numbers honestly

Three signals moved strongly in the predicted direction:

- **Specific recipient mentions** went from literally zero to nearly 1 per 1,000 words. The standard's own vocabulary ("the builder stuck in the build trap", "the on-call engineer", "the operator") took root.
- **Vague placeholder language** ("users", "the team", "everyone") dropped 68%. The cheap shorthand the standard explicitly bans appeared two-thirds less often.
- **Frustration markers** dropped 68%. Whatever else was true, the post-VALUE.md cohort showed less explicit stuck-and-showing-it language than the pre cohort.

Three signals moved in unexpected directions and deserve honest accounting:

- **Restart language** went UP (zero to 0.09/1k). The post cohort contains the editing rounds where revisions were proposed, redirected, applied, and revised - genuine restarts. The signal pattern conflates honest course corrections with build-trap thrashing.
- **Time pressure** went UP (+102%). The post cohort includes more sessions where dispatching agent jobs and waiting on cluster runs created legitimate "let's run this quickly" language. The pattern conflates time-pressured frustration with normal task dispatch.
- **Value-discipline vocabulary** went DOWN 30%. This is a saturation artifact, not a regression: the pre cohort was a single session devoted to formalizing the standard's vocabulary, so the rate of those words was unusually high. The post cohort uses the vocabulary more naturally and across more contexts.

### What this pair establishes

For the standard's own development, the signals predicted to drop with VALUE.md adoption did drop in the predicted direction with meaningful magnitude. The signals that moved in unexpected directions admit straightforward non-confirmatory explanations (different work shape, different operational mode). The pair does not prove causation. It is one data point consistent with the hypothesis that the standard's discipline shows up in the conversational record after it is adopted.

---

## Pair 2 - a deep existing project (retrofitted late)

**Project:** an internal infrastructure project ("REDACTED_PROJECT_KAI") with 541 sessions of accumulated history.
**Split date:** 2026-05-28 (the date a VALUE.md was first added to the project root).
**Pre cohort:** 473 sessions, 359,997 user words.
**Post cohort:** 68 sessions, 223,589 user words.

### Headline numbers

| Signal | Pre rate | Post rate | Delta |
|---|---|---|---|
| Recipient naming (specific) | 0.10 / 1k | 0.08 / 1k | -26% |
| Vague recipient | 0.79 / 1k | 0.68 / 1k | -14% |
| Frustration explicit | 0.05 / 1k | 0 / 1k | -100% |
| Restart language | 0.05 / 1k | 0.20 / 1k | +302% |
| Time pressure | 0.19 / 1k | 0.27 / 1k | +41% |
| Value-discipline vocabulary | 4.04 / 1k | 4.21 / 1k | +4% |

### Reading this pair honestly

This pair does NOT show the cleaner picture Pair 1 shows. Recipient-naming did not improve. Restart language and time pressure both went UP after the VALUE.md was added. The drop in vague-recipient language is real but smaller than in Pair 1.

Two readings of the negative result:

1. The hypothesis is too broad. VALUE.md does not improve every project's conversation discipline. Adding it to a project with 473 sessions of accumulated context did not visibly change how that project's conversations were conducted.
2. The signals are insufficient. The post cohort contains different kinds of work than the pre cohort - more agent-dispatched runs, more parallel-stream conversations - and the signal patterns measure both kinds of behavior the same way.

Both are likely true. The pair is honest counter-evidence to a naive "VALUE.md improves any project" claim.

---

## Pair 3 - SQA (invalid as constructed)

**Project:** an internal quality-assurance project ("REDACTED_PROJECT_SQA").
**Pre cohort:** 51 interactive sessions, 325,146 user words.
**Post cohort:** 5 sessions, 14,573 words - BUT all five were automated security-review jobs and noop pings, not interactive build sessions.

The pair was abandoned. The post-cohort did not contain comparable interactive build conversation. The structural mismatch prevents apples-to-apples comparison. This is preserved as an honest failure of the matched-pair method when the post-VALUE.md operational mode shifts away from interactive builds.

---

## What three pairs together suggest

The hypothesis is sharpened, not confirmed:

> VALUE.md improves conversational discipline when a project explicitly commits to the standard from the start or when its work surface is small enough that the discipline can permeate. Retrofitting the standard onto a deep existing project does not visibly move the same conversational signals in the same direction.

Pair 1 (the standard's own work) supports it cleanly. Pair 2 (a deep retrofit) returns mixed-to-negative results. Pair 3 was unusable as constructed.

The honest claim Keep a Value can make today, based on this evidence, is:

- **Yes**, for greenfield work where the team commits to the standard's discipline, observable build-trap signals (vague recipients, frustration markers, lost-thread language) drop substantially.
- **Not yet established**, for retrofitting onto established projects. The evidence does not show the standard moves the same signals there. A different intervention may be needed for those cases.

The Q3 paired-run experiment remains the gold standard. This case study is field evidence supporting it for the greenfield case and not supporting it for the retrofit case. Both findings are useful.

---

## Caveats and limits

- **The signal patterns are imperfect.** "Restart language" conflates productive course corrections with build-trap thrashing. "Time pressure" conflates dispatch language with stress. Future runs should refine.
- **The cohort sizes are unbalanced.** Pair 1's pre cohort is one session; Pair 2's post cohort is smaller than its pre cohort. Some deltas reflect sample-size artifacts, not real change.
- **The author is the same person across all cohorts.** This is a single-author study. External-builder evidence still requires the paired-run experiment.
- **Anonymization stripped only the most obvious surface (company and personal names).** Patterns of work and project shape may still be inferrable to someone with context.

---

## What this case study DOES NOT establish

- That VALUE.md helps every project. Pair 2 says it does not, at least not visibly.
- That the absolute magnitude of build-trap signal reduction is large. The frustration-marker drop from 1.21/1k to 0.39/1k is real but small in absolute terms.
- That the change in conversational signals causally produces better outcomes for recipients. We measured language patterns, not recipient acceptance. The paired-run experiment is still the falsifier the standard needs.

---

## Raw data

- `kav-meta-raw.json` - Pair 1 raw signal counts per session
- `kai-pair1-raw.json` - Pair 2 raw signal counts per session
- `sqa-pair1-raw.json` - Pair 3 (abandoned) raw signal counts per session

All anonymized. Session IDs preserved for traceability; user-typed content not included in raw exports.