From 17517ed7c987f6a5d24b18cf1307903ebaa4ffa8 Mon Sep 17 00:00:00 2001 From: Andy Stark Date: Tue, 23 Jun 2026 14:23:18 +0100 Subject: [PATCH 1/7] DOC-6771 added skill to address PR comments --- .claude/commands/docs/assess-comments.md | 249 ++++++++++++++++++++++ .claude/state/assess-comments.coverage.md | 63 ++++++ 2 files changed, 312 insertions(+) create mode 100644 .claude/commands/docs/assess-comments.md create mode 100644 .claude/state/assess-comments.coverage.md diff --git a/.claude/commands/docs/assess-comments.md b/.claude/commands/docs/assess-comments.md new file mode 100644 index 0000000000..a6fac94100 --- /dev/null +++ b/.claude/commands/docs/assess-comments.md @@ -0,0 +1,249 @@ +--- +description: Reconcile and assess all comments on the current branch's PR across every tool and commenter — report only, no edits +argument-hint: "[commenter-login] (optional — defaults to all commenters)" +--- + +You're going to read **all** the comments on the pull request for the current +branch — from automation (Cursor Bugbot, the security scanner, the history bot, +the AI PR summary, CI) *and* from humans — and produce a single **reconciliation +report**: what each comment raises, where the tools **agree**, where they +**contradict each other**, and where they're chasing each other in circles. This +is the synthesis layer the individual tools don't have. + +**This command assesses and reports. It does not edit PR content, commit, push, +or reply** — the one and only file it may write is its own coverage ledger +(step 11). Fixing is a separate step — hand the safe clusters to `/docs:bugbot` +or fix them yourself once you've seen the report. + +It's also meant to be run **repeatedly** as the PR evolves: the bots tend to post +fresh comments after each round of fixes, so each run is a checkpoint — verify +the few highest-impact things deeply, leave a "story so far", and re-run to pick +up the rest. Don't try to settle the whole PR in one pass. + +The command **tracks its own test coverage** in +`.claude/state/assess-comments.coverage.md`. Read it at the start of every run +(step 1) and update it at the end (step 11) — it's how the command knows, +honestly, which of its checks have actually fired on real data and which are +still unproven (ping-pong loop detection, notably). It's a *coverage* record, +not a correctness proof. + +If `$ARGUMENTS` is given, treat it as a commenter login and assess **only** that +author's comments (match case-insensitively; `cursor` matches `cursor[bot]`). +With no arguments, assess **every** commenter — bots and humans alike, because +human intent ("this is a CTF example", "deliberate UK spelling in a quote") is +exactly what lets you adjudicate the tool findings. + +1. **Load your coverage ledger, then find the branch and its PR.** First read + `.claude/state/assess-comments.coverage.md` so you know which of your own + checks are proven vs unproven this run (it also tells you to watch hardest for + the untested ones — ping-pong loops especially). Then run + `git branch --show-current`, and: + + ``` + gh pr view --json number,url,title,headRefName,body + ``` + + If there's no PR for this branch, stop and tell me — there's nothing to + assess. Keep the PR `body`: the AI summary usually lives there and is + **intent context**, not a finding to triage. + +2. **Collect every comment, from every source.** Check all of these: + + - **Inline review comments** (tied to lines): + ``` + gh api repos/{owner}/{repo}/pulls//comments --paginate + ``` + - **Top-level PR / issue comments** (summaries, bot posts, human replies): + ``` + gh api repos/{owner}/{repo}/issues//comments --paginate + ``` + - **Review verdicts** (approve / request-changes bodies): + ``` + gh api repos/{owner}/{repo}/pulls//reviews --paginate + ``` + - **Thread resolution state** — *don't skip this.* The REST `comments` + endpoint's `position`/outdated field is **not** a reliable "addressed" + signal (a resolved thread routinely still reports `position` non-null). The + authoritative signal is the GraphQL `reviewThreads`: + ``` + gh api graphql -F query='query { repository(owner:"", name:"") + { pullRequest(number:) { reviewThreads(first:100) { nodes { + isResolved isOutdated comments(first:100){ nodes { + databaseId author{login} path body } } } } } } }' + ``` + Use `databaseId` to map each thread back to the inline comments from the + REST pull, so you know which are **resolved** vs **open**. + + For each comment capture: author `login`, `created_at`, the body, and for + inline comments the `path`, line, and its thread's `isResolved` / + `isOutdated`. If `$ARGUMENTS` was given, drop every comment whose author + doesn't match it before going further. + + If nothing is left after filtering, stop and tell me. + +3. **Tag each comment by its source role.** You're reconciling *roles*, not + usernames. Classify each author as one of: + + - **bugbot** — `cursor` / `cursor[bot]`, or any body mentioning "Bugbot" + - **security** — the security scanner + - **history** — the commit-history / semantic-similarity bot + - **summary** — the AI PR summary (usually the PR body or a bot comment) + - **ci** — `github-actions` / build / staging-link bots + - **human** — everyone else + + Treat **summary**, **ci**, and staging-build links as *context*, not + findings — they tell you what the author intended and whether the build is + healthy. Don't raise them as issues. + +4. **Cluster findings by what they touch, not by who said them.** Group the + actual findings (from bugbot, security, history, humans) by file + line, or + by shared theme when they're not line-specific. One cluster may hold comments + from several tools pointing at the same code — that overlap is the whole + point. + +5. **Split clusters into OPEN and RESOLVED — and treat them differently.** Using + the resolution state from step 2, mark each cluster open or resolved. An + active PR is almost always a mix; don't let resolved threads drown the open + ones. + + - **Open clusters** are your real work — they go through prioritise → + adjudicate (the steps below). + - **Resolved clusters** are *not* re-litigated as fresh findings. But they + aren't worthless either — do two cheap, high-value things with them: + - **Fix-quality spot-check.** For a sample of resolved findings (favour the + higher-impact ones), check that the committed change *actually addressed* + the point, not just silenced the comment — e.g. if a reviewer asked to + remove a term or fix a command, confirm it's genuinely gone/correct in + the current file. Flag any "fixed in name only." + - **"Resolved ≠ fixed" flag.** Call out any thread that is **resolved but + has no corresponding code change** (resolved, `isOutdated` false, and no + intervening commit touched the line). These are often legitimate + deferrals ("leaving as-is pending eng decision before GA") — but that's a + future obligation hiding inside a green checkmark, so surface it. + **Mandatory deep-verify:** any such thread that the bot rated **High (or + Critical) severity** must be opened and verified against the current code + *this round*, regardless of the depth cap in step 6 — a serious bug must + never hide behind a green checkmark just because the run hit its budget. + If it's genuinely fixed, say so; if it's resolved-but-still-present, + that's a headline. + - Resolved threads also feed **bot calibration**: the ratio of a bot's + findings that humans accepted-and-fixed vs dismissed tells you how much to + trust that bot's next *solo* finding. Note it when it's informative. + +6. **Prioritise, and cap how deep you go.** This command is meant to be run + **repeatedly** over a PR's life — the bots re-comment after each round of + fixes, so a single run is never the whole story. Don't try to verify every + cluster to the bottom. Rank the **open** clusters by impact — broken + `relref`/`image`/`embed-md` paths, wrong commands or code samples, incorrect + technical claims, and anything that breaks a feature the PR description calls + out, rank highest; pure style nits lowest — and pick **at most the top 3–5** + to verify in depth this round. The rest you list shallowly (see step 9) as + "not assessed this round" so the next run can pick them up. + + Three things always make the deep-verify set regardless of the cap, because + they're the failures a reconciler exists to catch: **contradictions**, + **ping-pong loops**, and **resolved-but-not-outdated High/Critical findings** + (step 5). Everything else competes for the remaining slots. + +7. **Adjudicate the chosen clusters — and verify, don't trust.** For each + cluster in your deep set, open the file it points at (and the underlying data + it depends on) and judge for yourself; these tools have false positives, and + a single tool can be *correct about the code but wrong about impact*, or + *plausible but refuted once you read the data*. Label each: + + - **Agreement** — multiple sources flag the same thing → high confidence it's + real. + - **Contradiction** — sources disagree, *or* a finding conflicts with stated + human intent or the PR summary (e.g. security flags a sample that a human + said is a deliberate CTF example; bugbot wants a refactor the history bot + says was reverted before). Call these out loudest — they're where blind + auto-fixing does damage. + - **Ping-pong** — using `created_at` order and the commits in between, spot + cases where a fix for one tool's comment re-triggered another tool, or the + same spot has been edited back and forth. Flag the loop rather than + extending it. + - **Solo** — a single source, no corroboration → judge it on its merits and + say how confident you are. A solo finding you've *verified against the code + and data* is high-confidence even with one source. + - **Stale / already-addressed** — the code no longer matches what the comment + describes; a later commit already handled it. + + US-vs-UK spelling counts as a real issue in this docs repo even though it's + small; pure style nits usually don't. + +8. **Cross-check review verdicts against open findings.** Separately from the + clusters, line up the PR's **review verdicts** (approve / request-changes) + against the findings still open. Flag the dangerous combination explicitly: + **a human approval (especially a low-confidence one — "skimmed it", "LGTM", + "didn't review in detail") sitting on top of an unresolved bot finding.** + Neither the reviewer nor the bot is wrong alone, but together they can let a + real issue merge — exactly the gap a reconciler should catch. Surface it even + when the PR is marked WIP / don't-merge. + +9. **Produce the reconciliation report.** One row per **open** cluster: + + | Cluster | File / line | Sources | Verdict | Recommended action | + |---|---|---|---|---| + + - **Verdict**: agreement / contradiction / ping-pong / solo / stale. + - **Recommended action**: one of — *safe to fix* (uncontested and low-risk → + hand to `/docs:bugbot`); *needs your call* (contradiction, approval-over- + finding, or history-bot warning); *dismiss* (false positive — say why); + *already handled* (stale). + + Then, in this order: + - the **headline** — contradictions, ping-pong loops, and any approval-over- + open-finding from step 8 (these need a human) first; + - the **safe-to-fix** list; + - what you're **dismissing** and why, so I can sanity-check your reasoning; + - a short **resolved-threads** note (from step 5) — only what's worth saying: + any "fixed in name only" or "resolved ≠ fixed / deferred" flags, and a bot- + calibration line if it's informative. Don't list cleanly-resolved threads + one by one; a count is enough; + - a **"Story so far"** line: which open clusters you deep-verified this round, + how many you deferred as "not assessed this round" (list them by one-line + title), how many threads were already resolved, and a reminder that + re-running after the next round of fixes will pick up the deferred ones plus + any new comments. This is a checkpoint, not a verdict on the whole PR. + +10. **Offer a second opinion via Codex, only if it's available.** Check whether + the Codex CLI is installed — run `command -v codex`. If it is, end by + suggesting the user get an independent review of the same diff the bots saw: + + ``` + /codex:review --base main --scope branch + ``` + + Frame it as optional and note it adds runtime, and that it's worth it mainly + for the contested or high-impact clusters — the ones where a second model + disagreeing (or agreeing) changes your confidence. Don't run it yourself: + `/codex:review` is user-invoked only, and this command is report-only. If + `command -v codex` finds nothing, say nothing about Codex. + +11. **Update your coverage ledger — automatically, but evidence-gated.** Open + `.claude/state/assess-comments.coverage.md` and update it to reflect what + *actually fired on real data this run*: + + - For each capability you genuinely exercised, bump its **encounters**, set + **last verified** to today, and add this PR (with a comment id or commit + SHA) to **Evidence**. Raise confidence per the scale — but **never raise a + confidence without a citable artifact.** No evidence, no bump. + - A capability counts as **🟢 corroborated** only on ≥2 *distinct* PRs, or + once with an independent `/codex:review` concurrence on the same finding. + - If you hit a **rare pattern for the first time** — a real ping-pong loop, a + resolved-but-still-broken thread — record its concrete signature in the + **Worked examples library** (tools involved, comment ids, commits). This is + how detection sharpens over time. + - Refresh ⏳ **decaying** entries you re-confirmed this run. + - If the run revealed that an *instruction* in this command should change + (a heuristic nearly missed something), **don't edit the command yourself** + — note the suggested refinement in your output and let me decide. + + The ledger update is the one file you *may* write. After writing it, **tell + me plainly that the command learned something and updated its own coverage + ledger**, name what changed in one line, and remind me I can include that + change in the PR or leave it out — my call. + +12. **Stop there.** Don't edit anything except the coverage ledger (step 11); + don't commit, push, or reply to the PR. Your output is the assessment — + I'll decide what to act on. diff --git a/.claude/state/assess-comments.coverage.md b/.claude/state/assess-comments.coverage.md new file mode 100644 index 0000000000..66d721ddb9 --- /dev/null +++ b/.claude/state/assess-comments.coverage.md @@ -0,0 +1,63 @@ +# assess-comments — coverage ledger + +This file is the self-tracked **test coverage** of the `/docs:assess-comments` +command. It records, per capability, how confident we are that the logic has +actually *fired correctly on real data* — not how accurate its judgement is. + +**Read this honestly:** this is a *coverage* tracker, not a correctness proof. +It can only record capabilities it has *seen fire* (positive cases). It cannot +see false negatives — e.g. a ping-pong loop that was present but missed — so a +🟢 means "exercised on real PRs", never "guaranteed correct". + +The command reads this at the start of every run and updates it at the end +(step 1 and step 11). Updates are **automatic but evidence-gated**: no +confidence may rise without a citable artifact (PR number + comment id / commit +SHA). When the command updates this file it tells the user, who can choose +whether to commit the change. + +## Confidence scale +- ❓ **untested** — logic is written but has never fired on real data. +- 🟡 **seen once** — fired correctly on a single real PR. +- 🟢 **corroborated** — fired correctly on ≥2 *distinct* PRs, **or** once with an + independent `/codex:review` concurrence on the same finding. +- ⏳ **decaying** — was 🟡/🟢 but not re-confirmed in a long time (treat the API + shape, bot identities, and repo conventions as possibly drifted; re-verify). + +## Capabilities + +| Capability | Confidence | Real encounters | Last verified | Evidence | +|---|---|---|---|---| +| Branch/PR identification + arg handling | 🟢 corroborated | 3 | 2026-06-23 | #3415, #3507, #3374 | +| Multi-source collection (inline + top-level + reviews) | 🟢 corroborated | 4 | 2026-06-23 | #3415, #3507, #3510, #3374 | +| GraphQL thread-resolution pull (`isResolved`/`isOutdated`) | 🟢 corroborated | 2 | 2026-06-23 | #3510 (12/12 resolved), #3374 (15/15) | +| Source-role tagging (bugbot/security/history/summary/ci/human) | 🟢 corroborated | 3 | 2026-06-23 | #3415, #3507, #3374 | +| Open/resolved split | 🟢 corroborated | 2 | 2026-06-23 | #3510, #3374 | +| Fix-quality spot-check (genuinely fixed vs silenced) | 🟢 corroborated | 2 | 2026-06-23 | #3510 (term removals landed), #3374 (`num_docs`, dropIndex landed) | +| "Resolved ≠ fixed" flag — **legitimate deferral** variant | 🟡 seen once | 1 | 2026-06-23 | #3510 (TS.BGET:122 left pending eng) | +| "Resolved ≠ fixed" flag — **still-broken** variant | ❓ untested | 0 | — | never confirmed a resolved thread that was actually still broken | +| Cross-tool **agreement** | 🟡 seen once | 1 | 2026-06-23 | #3374 (Claude + bugbot independently on `num_docs`) | +| **Contradiction** detection | 🟡 seen once | 1 | 2026-06-23 | #3415 (approval vs open bugbot finding); #3507 (bugbot vs author intent, off-branch) | +| **Ping-pong loop** detection | ❓ untested | 0 | — | never observed a real loop; all PRs so far had clean find→fix→resolve flow | +| Approval-over-open-finding cross-check | 🟢 corroborated | 2 | 2026-06-23 | #3415 (dwdougherty), #3374 (dwdougherty low-confidence over open HIGH) | +| Depth cap / prioritisation under load | 🟡 seen once | 1 | 2026-06-23 | #3374 (19 candidate findings → 4 deep-verified) | +| Mandatory deep-verify of resolved+not-outdated HIGH | ❓ untested | 0 | — | rule added 2026-06-23; not yet fired on a fresh run | +| Bot calibration (fixed-vs-dismissed ratio) | 🟡 seen once | 1 | 2026-06-23 | #3374 (bugbot signal mostly accepted) | +| Codex second-opinion availability gate | 🟢 corroborated | 2 | 2026-06-23 | #3415, #3374 (CLI on PATH; #3374 had a real Codex review) | + +## Worked examples library + +Concrete real-world signatures, so detection sharpens over time. Add to this +whenever a rare pattern is seen for the first time. + +### Ping-pong loops +*(none observed yet — when the first real loop appears, record the tool +sequence, the comment ids, and the commits between them here)* + +### Resolved-but-still-broken +*(none confirmed yet — record any thread marked resolved whose bug was still +present in the code)* + +### Cross-tool agreement +- **#3374 `num_docs`** — Claude (Critical #2, top-level review) and bugbot + ("Wrong FT.INFO document count field", inline, 06-17) independently flagged + the same `info.numDocs` → `num_docs` bug. Verified fixed in current code. From da7c04a2bf610f0fa3f591f6833af93a0361d846 Mon Sep 17 00:00:00 2001 From: Andy Stark Date: Tue, 23 Jun 2026 16:11:24 +0100 Subject: [PATCH 2/7] DOC-6771 bugbot fixes plus skill lessons learned --- .claude/commands/docs/assess-comments.md | 28 +++++++++++++++-------- .claude/state/assess-comments.coverage.md | 12 +++++----- 2 files changed, 25 insertions(+), 15 deletions(-) diff --git a/.claude/commands/docs/assess-comments.md b/.claude/commands/docs/assess-comments.md index a6fac94100..0fc4ae73df 100644 --- a/.claude/commands/docs/assess-comments.md +++ b/.claude/commands/docs/assess-comments.md @@ -12,8 +12,11 @@ is the synthesis layer the individual tools don't have. **This command assesses and reports. It does not edit PR content, commit, push, or reply** — the one and only file it may write is its own coverage ledger -(step 11). Fixing is a separate step — hand the safe clusters to `/docs:bugbot` -or fix them yourself once you've seen the report. +(step 11). Fixing is a separate step. Note that `/docs:bugbot` only triages and +fixes **Cursor Bugbot's own** comments — so it's the right hand-off *only* for +bugbot-sourced clusters. Safe fixes that came from humans, Codex, the security +or history bots, etc. won't be touched by `/docs:bugbot`; apply those directly +(or ask me to) once you've seen the report. It's also meant to be run **repeatedly** as the PR evolves: the bots tend to post fresh comments after each round of fixes, so each run is a checkpoint — verify @@ -76,10 +79,15 @@ exactly what lets you adjudicate the tool findings. For each comment capture: author `login`, `created_at`, the body, and for inline comments the `path`, line, and its thread's `isResolved` / - `isOutdated`. If `$ARGUMENTS` was given, drop every comment whose author - doesn't match it before going further. + `isOutdated`. If `$ARGUMENTS` was given, it narrows which comments become + **findings you cluster and adjudicate** (steps 4–7) to the matching author — + but **don't discard the others**. Always retain every review verdict and the + full set of open findings as *context*, because step 8 (approval-over-open- + finding) needs to see approvals and bot findings together; filtering them out + would silently disable that safety check. In filtered mode, run step 8 + against the full comment set, not the filtered one. - If nothing is left after filtering, stop and tell me. + If no comment matches `$ARGUMENTS`, stop and tell me. 3. **Tag each comment by its source role.** You're reconciling *roles*, not usernames. Classify each author as one of: @@ -186,10 +194,12 @@ exactly what lets you adjudicate the tool findings. |---|---|---|---|---| - **Verdict**: agreement / contradiction / ping-pong / solo / stale. - - **Recommended action**: one of — *safe to fix* (uncontested and low-risk → - hand to `/docs:bugbot`); *needs your call* (contradiction, approval-over- - finding, or history-bot warning); *dismiss* (false positive — say why); - *already handled* (stale). + - **Recommended action**: one of — *safe to fix* (uncontested and low-risk; + hand **bugbot-sourced** clusters to `/docs:bugbot`, but apply safe fixes + from any other source — human, Codex, security, history — directly, since + `/docs:bugbot` won't touch those); *needs your call* (contradiction, + approval-over-finding, or history-bot warning); *dismiss* (false positive — + say why); *already handled* (stale). Then, in this order: - the **headline** — contradictions, ping-pong loops, and any approval-over- diff --git a/.claude/state/assess-comments.coverage.md b/.claude/state/assess-comments.coverage.md index 66d721ddb9..a529c1f612 100644 --- a/.claude/state/assess-comments.coverage.md +++ b/.claude/state/assess-comments.coverage.md @@ -27,18 +27,18 @@ whether to commit the change. | Capability | Confidence | Real encounters | Last verified | Evidence | |---|---|---|---|---| -| Branch/PR identification + arg handling | 🟢 corroborated | 3 | 2026-06-23 | #3415, #3507, #3374 | -| Multi-source collection (inline + top-level + reviews) | 🟢 corroborated | 4 | 2026-06-23 | #3415, #3507, #3510, #3374 | -| GraphQL thread-resolution pull (`isResolved`/`isOutdated`) | 🟢 corroborated | 2 | 2026-06-23 | #3510 (12/12 resolved), #3374 (15/15) | -| Source-role tagging (bugbot/security/history/summary/ci/human) | 🟢 corroborated | 3 | 2026-06-23 | #3415, #3507, #3374 | -| Open/resolved split | 🟢 corroborated | 2 | 2026-06-23 | #3510, #3374 | +| Branch/PR identification + arg handling | 🟢 corroborated | 4 | 2026-06-23 | #3415, #3507, #3374, #3536 | +| Multi-source collection (inline + top-level + reviews) | 🟢 corroborated | 5 | 2026-06-23 | #3415, #3507, #3510, #3374, #3536 | +| GraphQL thread-resolution pull (`isResolved`/`isOutdated`) | 🟢 corroborated | 3 | 2026-06-23 | #3510 (12/12 resolved), #3374 (15/15), #3536 (2/2 open) | +| Source-role tagging (bugbot/security/history/summary/ci/human) | 🟢 corroborated | 4 | 2026-06-23 | #3415, #3507, #3374, #3536 | +| Open/resolved split | 🟢 corroborated | 3 | 2026-06-23 | #3510, #3374, #3536 (0 resolved / 2 open) | | Fix-quality spot-check (genuinely fixed vs silenced) | 🟢 corroborated | 2 | 2026-06-23 | #3510 (term removals landed), #3374 (`num_docs`, dropIndex landed) | | "Resolved ≠ fixed" flag — **legitimate deferral** variant | 🟡 seen once | 1 | 2026-06-23 | #3510 (TS.BGET:122 left pending eng) | | "Resolved ≠ fixed" flag — **still-broken** variant | ❓ untested | 0 | — | never confirmed a resolved thread that was actually still broken | | Cross-tool **agreement** | 🟡 seen once | 1 | 2026-06-23 | #3374 (Claude + bugbot independently on `num_docs`) | | **Contradiction** detection | 🟡 seen once | 1 | 2026-06-23 | #3415 (approval vs open bugbot finding); #3507 (bugbot vs author intent, off-branch) | | **Ping-pong loop** detection | ❓ untested | 0 | — | never observed a real loop; all PRs so far had clean find→fix→resolve flow | -| Approval-over-open-finding cross-check | 🟢 corroborated | 2 | 2026-06-23 | #3415 (dwdougherty), #3374 (dwdougherty low-confidence over open HIGH) | +| Approval-over-open-finding cross-check | 🟢 corroborated | 3 | 2026-06-23 | #3415 (dwdougherty), #3374 (dwdougherty low-confidence over open HIGH), #3536 (dwdougherty high-confidence — tested — over 2 open Mediums: benign variant) | | Depth cap / prioritisation under load | 🟡 seen once | 1 | 2026-06-23 | #3374 (19 candidate findings → 4 deep-verified) | | Mandatory deep-verify of resolved+not-outdated HIGH | ❓ untested | 0 | — | rule added 2026-06-23; not yet fired on a fresh run | | Bot calibration (fixed-vs-dismissed ratio) | 🟡 seen once | 1 | 2026-06-23 | #3374 (bugbot signal mostly accepted) | From b616cf79266b442ab17a9f49863b92aa986484e8 Mon Sep 17 00:00:00 2001 From: Andy Stark Date: Tue, 23 Jun 2026 16:21:00 +0100 Subject: [PATCH 3/7] DOC-6771 more from the Bugbot --- .claude/commands/docs/assess-comments.md | 24 ++++++++++++++++++----- .claude/state/assess-comments.coverage.md | 19 ++++++++++++++---- 2 files changed, 34 insertions(+), 9 deletions(-) diff --git a/.claude/commands/docs/assess-comments.md b/.claude/commands/docs/assess-comments.md index 0fc4ae73df..32c1395a23 100644 --- a/.claude/commands/docs/assess-comments.md +++ b/.claude/commands/docs/assess-comments.md @@ -74,6 +74,11 @@ exactly what lets you adjudicate the tool findings. isResolved isOutdated comments(first:100){ nodes { databaseId author{login} path body } } } } } } }' ``` + `first:100` covers any normal PR, but unlike the `--paginate`d REST calls it + does **not** page — so if a PR genuinely has >100 review threads (or a thread + with >100 comments), check `pageInfo.hasNextPage` and follow the cursor, or + you'll silently miss resolution state and skew the open/resolved split. + Use `databaseId` to map each thread back to the inline comments from the REST pull, so you know which are **resolved** vs **open**. @@ -92,7 +97,11 @@ exactly what lets you adjudicate the tool findings. 3. **Tag each comment by its source role.** You're reconciling *roles*, not usernames. Classify each author as one of: - - **bugbot** — `cursor` / `cursor[bot]`, or any body mentioning "Bugbot" + - **bugbot** — author `login` is `cursor` / `cursor[bot]`. **Tag by author, + not body text:** a *human* reply that quotes or mentions "Bugbot" is still a + human comment — don't tag it bugbot (that would mis-route it to + `/docs:bugbot`). Only treat a body mention of "Bugbot" as a signal when the + author is some other bot relaying bugbot's output. - **security** — the security scanner - **history** — the commit-history / semantic-similarity bot - **summary** — the AI PR summary (usually the PR body or a bot comment) @@ -166,10 +175,15 @@ exactly what lets you adjudicate the tool findings. said is a deliberate CTF example; bugbot wants a refactor the history bot says was reverted before). Call these out loudest — they're where blind auto-fixing does damage. - - **Ping-pong** — using `created_at` order and the commits in between, spot - cases where a fix for one tool's comment re-triggered another tool, or the - same spot has been edited back and forth. Flag the loop rather than - extending it. + - **Ping-pong** — a genuine loop is narrower than it looks. Using `created_at` + order and the commits in between, flag it **only** when *the same concern is + reopened* (a finding you thought fixed comes back), or *a fix for tool A's + comment re-triggered tool B in a cycle* (A→fix→B→fix→A…). It is **not** + ping-pong when a fix round simply prompts a re-scan that surfaces **new, + independent findings** — even if they land in the region you just edited. + That's normal iteration; treat those as fresh clusters, not a loop. (Real + example of the non-loop case in the coverage ledger's near-miss log.) When + it *is* a true loop, flag it rather than extending it. - **Solo** — a single source, no corroboration → judge it on its merits and say how confident you are. A solo finding you've *verified against the code and data* is high-confidence even with one source. diff --git a/.claude/state/assess-comments.coverage.md b/.claude/state/assess-comments.coverage.md index a529c1f612..fd481f22f5 100644 --- a/.claude/state/assess-comments.coverage.md +++ b/.claude/state/assess-comments.coverage.md @@ -36,12 +36,12 @@ whether to commit the change. | "Resolved ≠ fixed" flag — **legitimate deferral** variant | 🟡 seen once | 1 | 2026-06-23 | #3510 (TS.BGET:122 left pending eng) | | "Resolved ≠ fixed" flag — **still-broken** variant | ❓ untested | 0 | — | never confirmed a resolved thread that was actually still broken | | Cross-tool **agreement** | 🟡 seen once | 1 | 2026-06-23 | #3374 (Claude + bugbot independently on `num_docs`) | -| **Contradiction** detection | 🟡 seen once | 1 | 2026-06-23 | #3415 (approval vs open bugbot finding); #3507 (bugbot vs author intent, off-branch) | -| **Ping-pong loop** detection | ❓ untested | 0 | — | never observed a real loop; all PRs so far had clean find→fix→resolve flow | +| **Contradiction** detection | 🟡 seen once | 1 | 2026-06-23 | #3415 (approval vs open bugbot finding). *(#3507 bugbot-vs-author was an off-branch manual demo — illustrative, not counted toward encounters.)* | +| **Ping-pong loop** detection | ❓ untested | 0 | 2026-06-23 | still no real loop. #3536 was the first *post-fix re-scan* tested: round-1 fixes (commit `da7c04a`) re-triggered a bugbot re-review that raised 3 new findings, but they were independent/pre-existing, not a back-and-forth — correctly judged NOT a loop (near-miss recorded below) | | Approval-over-open-finding cross-check | 🟢 corroborated | 3 | 2026-06-23 | #3415 (dwdougherty), #3374 (dwdougherty low-confidence over open HIGH), #3536 (dwdougherty high-confidence — tested — over 2 open Mediums: benign variant) | | Depth cap / prioritisation under load | 🟡 seen once | 1 | 2026-06-23 | #3374 (19 candidate findings → 4 deep-verified) | | Mandatory deep-verify of resolved+not-outdated HIGH | ❓ untested | 0 | — | rule added 2026-06-23; not yet fired on a fresh run | -| Bot calibration (fixed-vs-dismissed ratio) | 🟡 seen once | 1 | 2026-06-23 | #3374 (bugbot signal mostly accepted) | +| Bot calibration (fixed-vs-dismissed ratio) | 🟢 corroborated | 2 | 2026-06-23 | #3374 (bugbot signal mostly accepted); #3536 (bugbot 5/5 findings valid across 2 rounds — high trust) | | Codex second-opinion availability gate | 🟢 corroborated | 2 | 2026-06-23 | #3415, #3374 (CLI on PATH; #3374 had a real Codex review) | ## Worked examples library @@ -50,9 +50,20 @@ Concrete real-world signatures, so detection sharpens over time. Add to this whenever a rare pattern is seen for the first time. ### Ping-pong loops -*(none observed yet — when the first real loop appears, record the tool +*(no true loop observed yet — when the first real one appears, record the tool sequence, the comment ids, and the commits between them here)* +**Near-miss (NOT a loop) — #3536, 2026-06-23.** Round-1 bugbot findings (ids +3460160413, 3460160429 on commit `17517ed`) were fixed in commit `da7c04a`; +bugbot re-reviewed and raised 3 *new* findings (ids 3460820452/470/474). One new +finding (GraphQL pagination, `assess-comments.md:72-75`) sat in the same step I'd +just edited — superficially loop-shaped. But the new findings were independent, +pre-existing issues exposed by the file changing, **not** the same spot toggled +back and forth or a fix being undone. Lesson for the detector: "fix → re-scan → +new findings in an edited region" is normal iteration, **not** ping-pong. A true +loop needs the *same concern* reopened, or tool A's fix re-triggering tool B in a +cycle. + ### Resolved-but-still-broken *(none confirmed yet — record any thread marked resolved whose bug was still present in the code)* From b76371830235081a06c4d326364f71495c8e17e4 Mon Sep 17 00:00:00 2001 From: Andy Stark Date: Tue, 23 Jun 2026 16:26:41 +0100 Subject: [PATCH 4/7] DOC-6771 skill self-improvement --- .claude/state/assess-comments.coverage.md | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/.claude/state/assess-comments.coverage.md b/.claude/state/assess-comments.coverage.md index fd481f22f5..09a497c50b 100644 --- a/.claude/state/assess-comments.coverage.md +++ b/.claude/state/assess-comments.coverage.md @@ -64,6 +64,11 @@ new findings in an edited region" is normal iteration, **not** ping-pong. A true loop needs the *same concern* reopened, or tool A's fix re-triggering tool B in a cycle. +*Confirmed convergence (round 3):* the fixes for those 3 new findings were +pushed, and bugbot's next re-scan came back **clean — no comments**. A real loop +would have spawned another round; this settled. So the "not a loop" judgement is +borne out by what happened next: assess → fix → re-scan reached a fixed point. + ### Resolved-but-still-broken *(none confirmed yet — record any thread marked resolved whose bug was still present in the code)* From 916dd3ecc039b13dc37020f6f9e17694611533a0 Mon Sep 17 00:00:00 2001 From: Andy Stark Date: Tue, 23 Jun 2026 16:40:01 +0100 Subject: [PATCH 5/7] DOC-6771 more Bugbot fixes --- .claude/commands/docs/assess-comments.md | 33 +++++++++++++++++------ .claude/state/assess-comments.coverage.md | 14 +++++++++- 2 files changed, 38 insertions(+), 9 deletions(-) diff --git a/.claude/commands/docs/assess-comments.md b/.claude/commands/docs/assess-comments.md index 32c1395a23..7af5e52658 100644 --- a/.claude/commands/docs/assess-comments.md +++ b/.claude/commands/docs/assess-comments.md @@ -92,10 +92,17 @@ exactly what lets you adjudicate the tool findings. would silently disable that safety check. In filtered mode, run step 8 against the full comment set, not the filtered one. - If no comment matches `$ARGUMENTS`, stop and tell me. - -3. **Tag each comment by its source role.** You're reconciling *roles*, not - usernames. Classify each author as one of: + If no comment **or review verdict** matches `$ARGUMENTS`, stop and tell me — a + reviewer who left only an approve/request-changes verdict still counts as a + match. + +3. **Tag each comment _and review verdict_ by its source role.** You're + reconciling *roles*, not usernames. Reviews are first-class here: a + **request-changes** (or any review whose body raises specific, actionable + points) is a *finding* from its author's role, not just an approve/reject + signal for step 8 — tag and carry it forward like any comment. (A bare + approval with no actionable body stays a step-8 signal only.) Classify each + author as one of: - **bugbot** — author `login` is `cursor` / `cursor[bot]`. **Tag by author, not body text:** a *human* reply that quotes or mentions "Bugbot" is still a @@ -113,10 +120,10 @@ exactly what lets you adjudicate the tool findings. healthy. Don't raise them as issues. 4. **Cluster findings by what they touch, not by who said them.** Group the - actual findings (from bugbot, security, history, humans) by file + line, or - by shared theme when they're not line-specific. One cluster may hold comments - from several tools pointing at the same code — that overlap is the whole - point. + actual findings — from bugbot, security, history, humans, **and actionable + review-verdict bodies (step 3)** — by file + line, or by shared theme when + they're not line-specific. One cluster may hold comments from several tools + pointing at the same code — that overlap is the whole point. 5. **Split clusters into OPEN and RESOLVED — and treat them differently.** Using the resolution state from step 2, mark each cluster open or resolved. An @@ -184,6 +191,16 @@ exactly what lets you adjudicate the tool findings. That's normal iteration; treat those as fresh clusters, not a loop. (Real example of the non-loop case in the coverage ledger's near-miss log.) When it *is* a true loop, flag it rather than extending it. + - **Subsystem churn** — distinct from ping-pong, and the *earlier* warning + sign. Look across fix rounds: if several findings (even new, independent + ones) keep landing on the **same area of code you keep patching**, the + symptom isn't any single finding — it's that the subsystem is + under-specified and each patch just exposes the next adjacent gap. + Incrementally fixing the latest one tends to trigger another round. When you + see this, **recommend a single consolidated redesign of that subsystem + rather than another patch** — that's the move that actually ends the churn. + (Worked example in the ledger: PR #3536's review-handling findings across + rounds 1 and 4.) - **Solo** — a single source, no corroboration → judge it on its merits and say how confident you are. A solo finding you've *verified against the code and data* is high-confidence even with one source. diff --git a/.claude/state/assess-comments.coverage.md b/.claude/state/assess-comments.coverage.md index 09a497c50b..89c0ef5455 100644 --- a/.claude/state/assess-comments.coverage.md +++ b/.claude/state/assess-comments.coverage.md @@ -37,7 +37,8 @@ whether to commit the change. | "Resolved ≠ fixed" flag — **still-broken** variant | ❓ untested | 0 | — | never confirmed a resolved thread that was actually still broken | | Cross-tool **agreement** | 🟡 seen once | 1 | 2026-06-23 | #3374 (Claude + bugbot independently on `num_docs`) | | **Contradiction** detection | 🟡 seen once | 1 | 2026-06-23 | #3415 (approval vs open bugbot finding). *(#3507 bugbot-vs-author was an off-branch manual demo — illustrative, not counted toward encounters.)* | -| **Ping-pong loop** detection | ❓ untested | 0 | 2026-06-23 | still no real loop. #3536 was the first *post-fix re-scan* tested: round-1 fixes (commit `da7c04a`) re-triggered a bugbot re-review that raised 3 new findings, but they were independent/pre-existing, not a back-and-forth — correctly judged NOT a loop (near-miss recorded below) | +| **Ping-pong loop** detection | ❓ untested | 0 | 2026-06-23 | still no real loop across 4 bugbot rounds on #3536. Rounds 2 & 4 each raised new post-fix findings but none was a reopened concern or A↔B cycle — correctly judged NOT a loop both times. Round 4 instead revealed *subsystem churn* (next row) | +| **Subsystem churn** detection (repeated findings on one patched area) | 🟡 seen once | 1 | 2026-06-23 | #3536 — findings 429 (r1), 862 + 874 (r4) all cluster on `$ARGUMENTS` filter + review-verdict handling; recommended a consolidated redesign over another patch (worked example below) | | Approval-over-open-finding cross-check | 🟢 corroborated | 3 | 2026-06-23 | #3415 (dwdougherty), #3374 (dwdougherty low-confidence over open HIGH), #3536 (dwdougherty high-confidence — tested — over 2 open Mediums: benign variant) | | Depth cap / prioritisation under load | 🟡 seen once | 1 | 2026-06-23 | #3374 (19 candidate findings → 4 deep-verified) | | Mandatory deep-verify of resolved+not-outdated HIGH | ❓ untested | 0 | — | rule added 2026-06-23; not yet fired on a fresh run | @@ -73,6 +74,17 @@ borne out by what happened next: assess → fix → re-scan reached a fixed poin *(none confirmed yet — record any thread marked resolved whose bug was still present in the code)* +### Subsystem churn (not a loop, but the precursor) +- **#3536 review-handling, 2026-06-23.** Bugbot finding 429 (round 1) flagged the + `$ARGUMENTS` filter dropping cross-check data; it was patched. Two rounds later + bugbot raised 862 ("filter abort ignores reviews") and 874 ("review bodies skip + clustering") — *new, independent* findings (so not ping-pong), but all three + cluster on the **same under-specified subsystem**: how review verdicts flow + through collect → filter → tag → cluster → abort. Root cause: reviews were + second-class vs comments. Signature: each incremental patch exposed the next + adjacent gap. Response that ended it: one **consolidated** fix promoting reviews + to first-class across steps 2–4, rather than patching 862/874 individually. + ### Cross-tool agreement - **#3374 `num_docs`** — Claude (Critical #2, top-level review) and bugbot ("Wrong FT.INFO document count field", inline, 06-17) independently flagged From ba7921d4013c1bda6b6b525b2c4a2bae6644076f Mon Sep 17 00:00:00 2001 From: Andy Stark Date: Tue, 23 Jun 2026 16:46:42 +0100 Subject: [PATCH 6/7] DOC-6771 yet more from the Bugbot --- .claude/commands/docs/assess-comments.md | 13 ++++++++++--- .claude/state/assess-comments.coverage.md | 10 +++++++++- 2 files changed, 19 insertions(+), 4 deletions(-) diff --git a/.claude/commands/docs/assess-comments.md b/.claude/commands/docs/assess-comments.md index 7af5e52658..e0b5255112 100644 --- a/.claude/commands/docs/assess-comments.md +++ b/.claude/commands/docs/assess-comments.md @@ -169,6 +169,12 @@ exactly what lets you adjudicate the tool findings. **ping-pong loops**, and **resolved-but-not-outdated High/Critical findings** (step 5). Everything else competes for the remaining slots. + One assessment runs *outside* the cap entirely: **subsystem churn** (step 7). + Churn is a pattern across findings — including ones you deferred this round + and ones from prior rounds — so judge it against the **full** open+recent set, + never just the capped deep set. If a deferred cluster turns out to be part of + a churn pattern, pull it into deep-verify; the cap must not hide churn. + 7. **Adjudicate the chosen clusters — and verify, don't trust.** For each cluster in your deep set, open the file it points at (and the underlying data it depends on) and judge for yourself; these tools have false positives, and @@ -224,7 +230,7 @@ exactly what lets you adjudicate the tool findings. | Cluster | File / line | Sources | Verdict | Recommended action | |---|---|---|---|---| - - **Verdict**: agreement / contradiction / ping-pong / solo / stale. + - **Verdict**: agreement / contradiction / ping-pong / churn / solo / stale. - **Recommended action**: one of — *safe to fix* (uncontested and low-risk; hand **bugbot-sourced** clusters to `/docs:bugbot`, but apply safe fixes from any other source — human, Codex, security, history — directly, since @@ -233,8 +239,9 @@ exactly what lets you adjudicate the tool findings. say why); *already handled* (stale). Then, in this order: - - the **headline** — contradictions, ping-pong loops, and any approval-over- - open-finding from step 8 (these need a human) first; + - the **headline** — contradictions, ping-pong loops, **subsystem churn (with + its consolidate-don't-patch recommendation)**, and any approval-over-open- + finding from step 8 (these need a human) first; - the **safe-to-fix** list; - what you're **dismissing** and why, so I can sanity-check your reasoning; - a short **resolved-threads** note (from step 5) — only what's worth saying: diff --git a/.claude/state/assess-comments.coverage.md b/.claude/state/assess-comments.coverage.md index 89c0ef5455..153b596781 100644 --- a/.claude/state/assess-comments.coverage.md +++ b/.claude/state/assess-comments.coverage.md @@ -38,7 +38,7 @@ whether to commit the change. | Cross-tool **agreement** | 🟡 seen once | 1 | 2026-06-23 | #3374 (Claude + bugbot independently on `num_docs`) | | **Contradiction** detection | 🟡 seen once | 1 | 2026-06-23 | #3415 (approval vs open bugbot finding). *(#3507 bugbot-vs-author was an off-branch manual demo — illustrative, not counted toward encounters.)* | | **Ping-pong loop** detection | ❓ untested | 0 | 2026-06-23 | still no real loop across 4 bugbot rounds on #3536. Rounds 2 & 4 each raised new post-fix findings but none was a reopened concern or A↔B cycle — correctly judged NOT a loop both times. Round 4 instead revealed *subsystem churn* (next row) | -| **Subsystem churn** detection (repeated findings on one patched area) | 🟡 seen once | 1 | 2026-06-23 | #3536 — findings 429 (r1), 862 + 874 (r4) all cluster on `$ARGUMENTS` filter + review-verdict handling; recommended a consolidated redesign over another patch (worked example below) | +| **Subsystem churn** detection (repeated findings on one patched area) | 🟡 seen (1 PR, 2 instances) | 2 | 2026-06-23 | #3536 — (a) findings 429/862/874 on `$ARGUMENTS` filter + review handling; (b) round-5 findings 442/449 on the *churn feature itself* (added r4 in step 7, not threaded into steps 6 & 9). Both met with a consolidated fix, not a patch. Stays 🟡 — both on one PR; needs a 2nd PR for 🟢. Worked examples below | | Approval-over-open-finding cross-check | 🟢 corroborated | 3 | 2026-06-23 | #3415 (dwdougherty), #3374 (dwdougherty low-confidence over open HIGH), #3536 (dwdougherty high-confidence — tested — over 2 open Mediums: benign variant) | | Depth cap / prioritisation under load | 🟡 seen once | 1 | 2026-06-23 | #3374 (19 candidate findings → 4 deep-verified) | | Mandatory deep-verify of resolved+not-outdated HIGH | ❓ untested | 0 | — | rule added 2026-06-23; not yet fired on a fresh run | @@ -84,6 +84,14 @@ present in the code)* second-class vs comments. Signature: each incremental patch exposed the next adjacent gap. Response that ended it: one **consolidated** fix promoting reviews to first-class across steps 2–4, rather than patching 862/874 individually. +- **#3536 churn-on-the-churn-feature (round 5), 2026-06-23.** The *fix* for the + case above added "subsystem churn" to step 7 — but only there. Round 5 bugbot + raised 442 ("churn detection capped incorrectly" — gated behind step 6's depth + cap) and 449 ("report verdict omits churn" — missing from step 9's verdict list + and headline). Same signature as any churn: a feature added in one place, + adjacent integration gaps exposed next round. Cured by threading churn through + steps 6 and 9 in one pass. Lesson: adding a capability *is* a subsystem edit — + wire it through every dependent step at once, or it becomes its own churn. ### Cross-tool agreement - **#3374 `num_docs`** — Claude (Critical #2, top-level review) and bugbot From d788bce3b00b5e721f8d9f45468e3e35522c5c09 Mon Sep 17 00:00:00 2001 From: Andy Stark Date: Tue, 23 Jun 2026 16:52:55 +0100 Subject: [PATCH 7/7] DOC-6771 more Bugbot stuff --- .claude/commands/docs/assess-comments.md | 7 ++++++- .claude/state/assess-comments.coverage.md | 12 +++++++++++- 2 files changed, 17 insertions(+), 2 deletions(-) diff --git a/.claude/commands/docs/assess-comments.md b/.claude/commands/docs/assess-comments.md index e0b5255112..3c3d30b84a 100644 --- a/.claude/commands/docs/assess-comments.md +++ b/.claude/commands/docs/assess-comments.md @@ -225,7 +225,12 @@ exactly what lets you adjudicate the tool findings. real issue merge — exactly the gap a reconciler should catch. Surface it even when the PR is marked WIP / don't-merge. -9. **Produce the reconciliation report.** One row per **open** cluster: +9. **Produce the reconciliation report.** One row per **open cluster you + deep-verified this round** (step 7) — those are the only clusters with a + real verdict. Open clusters you deferred under the step-6 cap do **not** get a + table row or a verdict (you haven't verified them); they're listed, plainly + marked unverified, in the "story so far" below. The table never promises more + coverage than you actually did. | Cluster | File / line | Sources | Verdict | Recommended action | |---|---|---|---|---| diff --git a/.claude/state/assess-comments.coverage.md b/.claude/state/assess-comments.coverage.md index 153b596781..b54381cc05 100644 --- a/.claude/state/assess-comments.coverage.md +++ b/.claude/state/assess-comments.coverage.md @@ -38,7 +38,7 @@ whether to commit the change. | Cross-tool **agreement** | 🟡 seen once | 1 | 2026-06-23 | #3374 (Claude + bugbot independently on `num_docs`) | | **Contradiction** detection | 🟡 seen once | 1 | 2026-06-23 | #3415 (approval vs open bugbot finding). *(#3507 bugbot-vs-author was an off-branch manual demo — illustrative, not counted toward encounters.)* | | **Ping-pong loop** detection | ❓ untested | 0 | 2026-06-23 | still no real loop across 4 bugbot rounds on #3536. Rounds 2 & 4 each raised new post-fix findings but none was a reopened concern or A↔B cycle — correctly judged NOT a loop both times. Round 4 instead revealed *subsystem churn* (next row) | -| **Subsystem churn** detection (repeated findings on one patched area) | 🟡 seen (1 PR, 2 instances) | 2 | 2026-06-23 | #3536 — (a) findings 429/862/874 on `$ARGUMENTS` filter + review handling; (b) round-5 findings 442/449 on the *churn feature itself* (added r4 in step 7, not threaded into steps 6 & 9). Both met with a consolidated fix, not a patch. Stays 🟡 — both on one PR; needs a 2nd PR for 🟢. Worked examples below | +| **Subsystem churn** detection (repeated findings on one patched area) | 🟡 seen (1 PR, 3 instances) | 3 | 2026-06-23 | #3536 — (a) 429/862/874 on `$ARGUMENTS` filter + review handling; (b) r5 442/449 on the *churn feature*; (c) r6 3461052859 on the *cap ↔ report contract* — i.e. (b)'s consolidation was too narrow. Pattern is robust on this PR; needs a 2nd PR for 🟢. Worked examples below | | Approval-over-open-finding cross-check | 🟢 corroborated | 3 | 2026-06-23 | #3415 (dwdougherty), #3374 (dwdougherty low-confidence over open HIGH), #3536 (dwdougherty high-confidence — tested — over 2 open Mediums: benign variant) | | Depth cap / prioritisation under load | 🟡 seen once | 1 | 2026-06-23 | #3374 (19 candidate findings → 4 deep-verified) | | Mandatory deep-verify of resolved+not-outdated HIGH | ❓ untested | 0 | — | rule added 2026-06-23; not yet fired on a fresh run | @@ -92,6 +92,16 @@ present in the code)* adjacent integration gaps exposed next round. Cured by threading churn through steps 6 and 9 in one pass. Lesson: adding a capability *is* a subsystem edit — wire it through every dependent step at once, or it becomes its own churn. +- **#3536 churn persisted — consolidation scoped too narrowly (round 6), + 2026-06-23.** The round-5 "consolidated" fix threaded the *churn feature* + through steps 6 & 9 but left the deeper contract under them unfixed, so round 6 + raised finding 3461052859 ("open table needs full verdicts": step 9 promised a + verdict per open cluster while step 6 only verifies 3–5). Same area a third + time. Real root: the **cap ↔ report-completeness contract**, not the churn + feature. Fixed by making the table cover only deep-verified clusters and + marking deferred ones unverified. Meta-lesson: when round N+1 finds another gap + in an area you *just* "consolidated", your consolidation boundary was wrong — + widen it to the true subsystem, don't re-patch the edge. ### Cross-tool agreement - **#3374 `num_docs`** — Claude (Critical #2, top-level review) and bugbot