From 17517ed7c987f6a5d24b18cf1307903ebaa4ffa8 Mon Sep 17 00:00:00 2001
From: Andy Stark <andrew.stark@redis.com>
Date: Tue, 23 Jun 2026 14:23:18 +0100
Subject: [PATCH 1/7] DOC-6771 added skill to address PR comments

---
 .claude/commands/docs/assess-comments.md  | 249 ++++++++++++++++++++++
 .claude/state/assess-comments.coverage.md |  63 ++++++
 2 files changed, 312 insertions(+)
 create mode 100644 .claude/commands/docs/assess-comments.md
 create mode 100644 .claude/state/assess-comments.coverage.md
diff --git a/.claude/commands/docs/assess-comments.md b/.claude/commands/docs/assess-comments.md
new file mode 100644
index 0000000000..a6fac94100
--- /dev/null
+++ b/.claude/commands/docs/assess-comments.md
@@ -0,0 +1,249 @@
+---
+description: Reconcile and assess all comments on the current branch's PR across every tool and commenter — report only, no edits
+argument-hint: "[commenter-login] (optional — defaults to all commenters)"
+---
+
+You're going to read **all** the comments on the pull request for the current
+branch — from automation (Cursor Bugbot, the security scanner, the history bot,
+the AI PR summary, CI) *and* from humans — and produce a single **reconciliation
+report**: what each comment raises, where the tools **agree**, where they
+**contradict each other**, and where they're chasing each other in circles. This
+is the synthesis layer the individual tools don't have.
+
+**This command assesses and reports. It does not edit PR content, commit, push,
+or reply** — the one and only file it may write is its own coverage ledger
+(step 11). Fixing is a separate step — hand the safe clusters to `/docs:bugbot`
+or fix them yourself once you've seen the report.
+
+It's also meant to be run **repeatedly** as the PR evolves: the bots tend to post
+fresh comments after each round of fixes, so each run is a checkpoint — verify
+the few highest-impact things deeply, leave a "story so far", and re-run to pick
+up the rest. Don't try to settle the whole PR in one pass.
+
+The command **tracks its own test coverage** in
+`.claude/state/assess-comments.coverage.md`. Read it at the start of every run
+(step 1) and update it at the end (step 11) — it's how the command knows,
+honestly, which of its checks have actually fired on real data and which are
+still unproven (ping-pong loop detection, notably). It's a *coverage* record,
+not a correctness proof.
+
+If `$ARGUMENTS` is given, treat it as a commenter login and assess **only** that
+author's comments (match case-insensitively; `cursor` matches `cursor[bot]`).
+With no arguments, assess **every** commenter — bots and humans alike, because
+human intent ("this is a CTF example", "deliberate UK spelling in a quote") is
+exactly what lets you adjudicate the tool findings.
+
+1. **Load your coverage ledger, then find the branch and its PR.** First read
+   `.claude/state/assess-comments.coverage.md` so you know which of your own
+   checks are proven vs unproven this run (it also tells you to watch hardest for
+   the untested ones — ping-pong loops especially). Then run
+   `git branch --show-current`, and:
+
+   ```
+   gh pr view --json number,url,title,headRefName,body
+   ```
+
+   If there's no PR for this branch, stop and tell me — there's nothing to
+   assess. Keep the PR `body`: the AI summary usually lives there and is
+   **intent context**, not a finding to triage.
+
+2. **Collect every comment, from every source.** Check all of these:
+
+   - **Inline review comments** (tied to lines):
+     ```
+     gh api repos/{owner}/{repo}/pulls/<number>/comments --paginate
+     ```
+   - **Top-level PR / issue comments** (summaries, bot posts, human replies):
+     ```
+     gh api repos/{owner}/{repo}/issues/<number>/comments --paginate
+     ```
+   - **Review verdicts** (approve / request-changes bodies):
+     ```
+     gh api repos/{owner}/{repo}/pulls/<number>/reviews --paginate
+     ```
+   - **Thread resolution state** — *don't skip this.* The REST `comments`
+     endpoint's `position`/outdated field is **not** a reliable "addressed"
+     signal (a resolved thread routinely still reports `position` non-null). The
+     authoritative signal is the GraphQL `reviewThreads`:
+     ```
+     gh api graphql -F query='query { repository(owner:"<owner>", name:"<repo>")
+       { pullRequest(number:<number>) { reviewThreads(first:100) { nodes {
+         isResolved isOutdated comments(first:100){ nodes {
+           databaseId author{login} path body } } } } } } }'
+     ```
+     Use `databaseId` to map each thread back to the inline comments from the
+     REST pull, so you know which are **resolved** vs **open**.
+
+   For each comment capture: author `login`, `created_at`, the body, and for
+   inline comments the `path`, line, and its thread's `isResolved` /
+   `isOutdated`. If `$ARGUMENTS` was given, drop every comment whose author
+   doesn't match it before going further.
+
+   If nothing is left after filtering, stop and tell me.
+
+3. **Tag each comment by its source role.** You're reconciling *roles*, not
+   usernames. Classify each author as one of:
+
+   - **bugbot** — `cursor` / `cursor[bot]`, or any body mentioning "Bugbot"
+   - **security** — the security scanner
+   - **history** — the commit-history / semantic-similarity bot
+   - **summary** — the AI PR summary (usually the PR body or a bot comment)
+   - **ci** — `github-actions` / build / staging-link bots
+   - **human** — everyone else
+
+   Treat **summary**, **ci**, and staging-build links as *context*, not
+   findings — they tell you what the author intended and whether the build is
+   healthy. Don't raise them as issues.
+
+4. **Cluster findings by what they touch, not by who said them.** Group the
+   actual findings (from bugbot, security, history, humans) by file + line, or
+   by shared theme when they're not line-specific. One cluster may hold comments
+   from several tools pointing at the same code — that overlap is the whole
+   point.
+
+5. **Split clusters into OPEN and RESOLVED — and treat them differently.** Using
+   the resolution state from step 2, mark each cluster open or resolved. An
+   active PR is almost always a mix; don't let resolved threads drown the open
+   ones.
+
+   - **Open clusters** are your real work — they go through prioritise →
+     adjudicate (the steps below).
+   - **Resolved clusters** are *not* re-litigated as fresh findings. But they
+     aren't worthless either — do two cheap, high-value things with them:
+     - **Fix-quality spot-check.** For a sample of resolved findings (favour the
+       higher-impact ones), check that the committed change *actually addressed*
+       the point, not just silenced the comment — e.g. if a reviewer asked to
+       remove a term or fix a command, confirm it's genuinely gone/correct in
+       the current file. Flag any "fixed in name only."
+     - **"Resolved ≠ fixed" flag.** Call out any thread that is **resolved but
+       has no corresponding code change** (resolved, `isOutdated` false, and no
+       intervening commit touched the line). These are often legitimate
+       deferrals ("leaving as-is pending eng decision before GA") — but that's a
+       future obligation hiding inside a green checkmark, so surface it.
+       **Mandatory deep-verify:** any such thread that the bot rated **High (or
+       Critical) severity** must be opened and verified against the current code
+       *this round*, regardless of the depth cap in step 6 — a serious bug must
+       never hide behind a green checkmark just because the run hit its budget.
+       If it's genuinely fixed, say so; if it's resolved-but-still-present,
+       that's a headline.
+   - Resolved threads also feed **bot calibration**: the ratio of a bot's
+     findings that humans accepted-and-fixed vs dismissed tells you how much to
+     trust that bot's next *solo* finding. Note it when it's informative.
+
+6. **Prioritise, and cap how deep you go.** This command is meant to be run
+   **repeatedly** over a PR's life — the bots re-comment after each round of
+   fixes, so a single run is never the whole story. Don't try to verify every
+   cluster to the bottom. Rank the **open** clusters by impact — broken
+   `relref`/`image`/`embed-md` paths, wrong commands or code samples, incorrect
+   technical claims, and anything that breaks a feature the PR description calls
+   out, rank highest; pure style nits lowest — and pick **at most the top 3–5**
+   to verify in depth this round. The rest you list shallowly (see step 9) as
+   "not assessed this round" so the next run can pick them up.
+
+   Three things always make the deep-verify set regardless of the cap, because
+   they're the failures a reconciler exists to catch: **contradictions**,
+   **ping-pong loops**, and **resolved-but-not-outdated High/Critical findings**
+   (step 5). Everything else competes for the remaining slots.
+
+7. **Adjudicate the chosen clusters — and verify, don't trust.** For each
+   cluster in your deep set, open the file it points at (and the underlying data
+   it depends on) and judge for yourself; these tools have false positives, and
+   a single tool can be *correct about the code but wrong about impact*, or
+   *plausible but refuted once you read the data*. Label each:
+
+   - **Agreement** — multiple sources flag the same thing → high confidence it's
+     real.
+   - **Contradiction** — sources disagree, *or* a finding conflicts with stated
+     human intent or the PR summary (e.g. security flags a sample that a human
+     said is a deliberate CTF example; bugbot wants a refactor the history bot
+     says was reverted before). Call these out loudest — they're where blind
+     auto-fixing does damage.
+   - **Ping-pong** — using `created_at` order and the commits in between, spot
+     cases where a fix for one tool's comment re-triggered another tool, or the
+     same spot has been edited back and forth. Flag the loop rather than
+     extending it.
+   - **Solo** — a single source, no corroboration → judge it on its merits and
+     say how confident you are. A solo finding you've *verified against the code
+     and data* is high-confidence even with one source.
+   - **Stale / already-addressed** — the code no longer matches what the comment
+     describes; a later commit already handled it.
+
+   US-vs-UK spelling counts as a real issue in this docs repo even though it's
+   small; pure style nits usually don't.
+
+8. **Cross-check review verdicts against open findings.** Separately from the
+   clusters, line up the PR's **review verdicts** (approve / request-changes)
+   against the findings still open. Flag the dangerous combination explicitly:
+   **a human approval (especially a low-confidence one — "skimmed it", "LGTM",
+   "didn't review in detail") sitting on top of an unresolved bot finding.**
+   Neither the reviewer nor the bot is wrong alone, but together they can let a
+   real issue merge — exactly the gap a reconciler should catch. Surface it even
+   when the PR is marked WIP / don't-merge.
+
+9. **Produce the reconciliation report.** One row per **open** cluster:
+
+   | Cluster | File / line | Sources | Verdict | Recommended action |
+   |---|---|---|---|---|
+
+   - **Verdict**: agreement / contradiction / ping-pong / solo / stale.
+   - **Recommended action**: one of — *safe to fix* (uncontested and low-risk →
+     hand to `/docs:bugbot`); *needs your call* (contradiction, approval-over-
+     finding, or history-bot warning); *dismiss* (false positive — say why);
+     *already handled* (stale).
+
+   Then, in this order:
+   - the **headline** — contradictions, ping-pong loops, and any approval-over-
+     open-finding from step 8 (these need a human) first;
+   - the **safe-to-fix** list;
+   - what you're **dismissing** and why, so I can sanity-check your reasoning;
+   - a short **resolved-threads** note (from step 5) — only what's worth saying:
+     any "fixed in name only" or "resolved ≠ fixed / deferred" flags, and a bot-
+     calibration line if it's informative. Don't list cleanly-resolved threads
+     one by one; a count is enough;
+   - a **"Story so far"** line: which open clusters you deep-verified this round,
+     how many you deferred as "not assessed this round" (list them by one-line
+     title), how many threads were already resolved, and a reminder that
+     re-running after the next round of fixes will pick up the deferred ones plus
+     any new comments. This is a checkpoint, not a verdict on the whole PR.
+
+10. **Offer a second opinion via Codex, only if it's available.** Check whether
+   the Codex CLI is installed — run `command -v codex`. If it is, end by
+   suggesting the user get an independent review of the same diff the bots saw:
+
+   ```
+   /codex:review --base main --scope branch
+   ```
+
+   Frame it as optional and note it adds runtime, and that it's worth it mainly
+   for the contested or high-impact clusters — the ones where a second model
+   disagreeing (or agreeing) changes your confidence. Don't run it yourself:
+   `/codex:review` is user-invoked only, and this command is report-only. If
+   `command -v codex` finds nothing, say nothing about Codex.
+
+11. **Update your coverage ledger — automatically, but evidence-gated.** Open
+    `.claude/state/assess-comments.coverage.md` and update it to reflect what
+    *actually fired on real data this run*:
+
+    - For each capability you genuinely exercised, bump its **encounters**, set
+      **last verified** to today, and add this PR (with a comment id or commit
+      SHA) to **Evidence**. Raise confidence per the scale — but **never raise a
+      confidence without a citable artifact.** No evidence, no bump.
+    - A capability counts as **🟢 corroborated** only on ≥2 *distinct* PRs, or
+      once with an independent `/codex:review` concurrence on the same finding.
+    - If you hit a **rare pattern for the first time** — a real ping-pong loop, a
+      resolved-but-still-broken thread — record its concrete signature in the
+      **Worked examples library** (tools involved, comment ids, commits). This is
+      how detection sharpens over time.
+    - Refresh ⏳ **decaying** entries you re-confirmed this run.
+    - If the run revealed that an *instruction* in this command should change
+      (a heuristic nearly missed something), **don't edit the command yourself**
+      — note the suggested refinement in your output and let me decide.
+
+    The ledger update is the one file you *may* write. After writing it, **tell
+    me plainly that the command learned something and updated its own coverage
+    ledger**, name what changed in one line, and remind me I can include that
+    change in the PR or leave it out — my call.
+
+12. **Stop there.** Don't edit anything except the coverage ledger (step 11);
+    don't commit, push, or reply to the PR. Your output is the assessment —
+    I'll decide what to act on.
diff --git a/.claude/state/assess-comments.coverage.md b/.claude/state/assess-comments.coverage.md
new file mode 100644
index 0000000000..66d721ddb9
--- /dev/null
+++ b/.claude/state/assess-comments.coverage.md
@@ -0,0 +1,63 @@
+# assess-comments — coverage ledger
+
+This file is the self-tracked **test coverage** of the `/docs:assess-comments`
+command. It records, per capability, how confident we are that the logic has
+actually *fired correctly on real data* — not how accurate its judgement is.
+
+**Read this honestly:** this is a *coverage* tracker, not a correctness proof.
+It can only record capabilities it has *seen fire* (positive cases). It cannot
+see false negatives — e.g. a ping-pong loop that was present but missed — so a
+🟢 means "exercised on real PRs", never "guaranteed correct".
+
+The command reads this at the start of every run and updates it at the end
+(step 1 and step 11). Updates are **automatic but evidence-gated**: no
+confidence may rise without a citable artifact (PR number + comment id / commit
+SHA). When the command updates this file it tells the user, who can choose
+whether to commit the change.
+
+## Confidence scale
+- ❓ **untested** — logic is written but has never fired on real data.
+- 🟡 **seen once** — fired correctly on a single real PR.
+- 🟢 **corroborated** — fired correctly on ≥2 *distinct* PRs, **or** once with an
+  independent `/codex:review` concurrence on the same finding.
+- ⏳ **decaying** — was 🟡/🟢 but not re-confirmed in a long time (treat the API
+  shape, bot identities, and repo conventions as possibly drifted; re-verify).
+
+## Capabilities
+
+| Capability | Confidence | Real encounters | Last verified | Evidence |
+|---|---|---|---|---|
+| Branch/PR identification + arg handling | 🟢 corroborated | 3 | 2026-06-23 | #3415, #3507, #3374 |
+| Multi-source collection (inline + top-level + reviews) | 🟢 corroborated | 4 | 2026-06-23 | #3415, #3507, #3510, #3374 |
+| GraphQL thread-resolution pull (`isResolved`/`isOutdated`) | 🟢 corroborated | 2 | 2026-06-23 | #3510 (12/12 resolved), #3374 (15/15) |
+| Source-role tagging (bugbot/security/history/summary/ci/human) | 🟢 corroborated | 3 | 2026-06-23 | #3415, #3507, #3374 |
+| Open/resolved split | 🟢 corroborated | 2 | 2026-06-23 | #3510, #3374 |
+| Fix-quality spot-check (genuinely fixed vs silenced) | 🟢 corroborated | 2 | 2026-06-23 | #3510 (term removals landed), #3374 (`num_docs`, dropIndex landed) |
+| "Resolved ≠ fixed" flag — **legitimate deferral** variant | 🟡 seen once | 1 | 2026-06-23 | #3510 (TS.BGET:122 left pending eng) |
+| "Resolved ≠ fixed" flag — **still-broken** variant | ❓ untested | 0 | — | never confirmed a resolved thread that was actually still broken |
+| Cross-tool **agreement** | 🟡 seen once | 1 | 2026-06-23 | #3374 (Claude + bugbot independently on `num_docs`) |
+| **Contradiction** detection | 🟡 seen once | 1 | 2026-06-23 | #3415 (approval vs open bugbot finding); #3507 (bugbot vs author intent, off-branch) |
+| **Ping-pong loop** detection | ❓ untested | 0 | — | never observed a real loop; all PRs so far had clean find→fix→resolve flow |
+| Approval-over-open-finding cross-check | 🟢 corroborated | 2 | 2026-06-23 | #3415 (dwdougherty), #3374 (dwdougherty low-confidence over open HIGH) |
+| Depth cap / prioritisation under load | 🟡 seen once | 1 | 2026-06-23 | #3374 (19 candidate findings → 4 deep-verified) |
+| Mandatory deep-verify of resolved+not-outdated HIGH | ❓ untested | 0 | — | rule added 2026-06-23; not yet fired on a fresh run |
+| Bot calibration (fixed-vs-dismissed ratio) | 🟡 seen once | 1 | 2026-06-23 | #3374 (bugbot signal mostly accepted) |
+| Codex second-opinion availability gate | 🟢 corroborated | 2 | 2026-06-23 | #3415, #3374 (CLI on PATH; #3374 had a real Codex review) |
+
+## Worked examples library
+
+Concrete real-world signatures, so detection sharpens over time. Add to this
+whenever a rare pattern is seen for the first time.
+
+### Ping-pong loops
+*(none observed yet — when the first real loop appears, record the tool
+sequence, the comment ids, and the commits between them here)*
+
+### Resolved-but-still-broken
+*(none confirmed yet — record any thread marked resolved whose bug was still
+present in the code)*
+
+### Cross-tool agreement
+- **#3374 `num_docs`** — Claude (Critical #2, top-level review) and bugbot
+  ("Wrong FT.INFO document count field", inline, 06-17) independently flagged
+  the same `info.numDocs` → `num_docs` bug. Verified fixed in current code.

From da7c04a2bf610f0fa3f591f6833af93a0361d846 Mon Sep 17 00:00:00 2001
From: Andy Stark <andrew.stark@redis.com>
Date: Tue, 23 Jun 2026 16:11:24 +0100
Subject: [PATCH 2/7] DOC-6771 bugbot fixes plus skill lessons learned

---
 .claude/commands/docs/assess-comments.md  | 28 +++++++++++++++--------
 .claude/state/assess-comments.coverage.md | 12 +++++-----
 2 files changed, 25 insertions(+), 15 deletions(-)

diff --git a/.claude/commands/docs/assess-comments.md b/.claude/commands/docs/assess-comments.md
index a6fac94100..0fc4ae73df 100644
--- a/.claude/commands/docs/assess-comments.md
+++ b/.claude/commands/docs/assess-comments.md
@@ -12,8 +12,11 @@ is the synthesis layer the individual tools don't have.
 
 **This command assesses and reports. It does not edit PR content, commit, push,
 or reply** — the one and only file it may write is its own coverage ledger
-(step 11). Fixing is a separate step — hand the safe clusters to `/docs:bugbot`
-or fix them yourself once you've seen the report.
+(step 11). Fixing is a separate step. Note that `/docs:bugbot` only triages and
+fixes **Cursor Bugbot's own** comments — so it's the right hand-off *only* for
+bugbot-sourced clusters. Safe fixes that came from humans, Codex, the security
+or history bots, etc. won't be touched by `/docs:bugbot`; apply those directly
+(or ask me to) once you've seen the report.
 
 It's also meant to be run **repeatedly** as the PR evolves: the bots tend to post
 fresh comments after each round of fixes, so each run is a checkpoint — verify
@@ -76,10 +79,15 @@ exactly what lets you adjudicate the tool findings.
 
    For each comment capture: author `login`, `created_at`, the body, and for
    inline comments the `path`, line, and its thread's `isResolved` /
-   `isOutdated`. If `$ARGUMENTS` was given, drop every comment whose author
-   doesn't match it before going further.
+   `isOutdated`. If `$ARGUMENTS` was given, it narrows which comments become
+   **findings you cluster and adjudicate** (steps 4–7) to the matching author —
+   but **don't discard the others**. Always retain every review verdict and the
+   full set of open findings as *context*, because step 8 (approval-over-open-
+   finding) needs to see approvals and bot findings together; filtering them out
+   would silently disable that safety check. In filtered mode, run step 8
+   against the full comment set, not the filtered one.
 
-   If nothing is left after filtering, stop and tell me.
+   If no comment matches `$ARGUMENTS`, stop and tell me.
 
 3. **Tag each comment by its source role.** You're reconciling *roles*, not
    usernames. Classify each author as one of:
@@ -186,10 +194,12 @@ exactly what lets you adjudicate the tool findings.
    |---|---|---|---|---|
 
    - **Verdict**: agreement / contradiction / ping-pong / solo / stale.
-   - **Recommended action**: one of — *safe to fix* (uncontested and low-risk →
-     hand to `/docs:bugbot`); *needs your call* (contradiction, approval-over-
-     finding, or history-bot warning); *dismiss* (false positive — say why);
-     *already handled* (stale).
+   - **Recommended action**: one of — *safe to fix* (uncontested and low-risk;
+     hand **bugbot-sourced** clusters to `/docs:bugbot`, but apply safe fixes
+     from any other source — human, Codex, security, history — directly, since
+     `/docs:bugbot` won't touch those); *needs your call* (contradiction,
+     approval-over-finding, or history-bot warning); *dismiss* (false positive —
+     say why); *already handled* (stale).
 
    Then, in this order:
    - the **headline** — contradictions, ping-pong loops, and any approval-over-
diff --git a/.claude/state/assess-comments.coverage.md b/.claude/state/assess-comments.coverage.md
index 66d721ddb9..a529c1f612 100644
--- a/.claude/state/assess-comments.coverage.md
+++ b/.claude/state/assess-comments.coverage.md
@@ -27,18 +27,18 @@ whether to commit the change.
 
 | Capability | Confidence | Real encounters | Last verified | Evidence |
 |---|---|---|---|---|
-| Branch/PR identification + arg handling | 🟢 corroborated | 3 | 2026-06-23 | #3415, #3507, #3374 |
-| Multi-source collection (inline + top-level + reviews) | 🟢 corroborated | 4 | 2026-06-23 | #3415, #3507, #3510, #3374 |
-| GraphQL thread-resolution pull (`isResolved`/`isOutdated`) | 🟢 corroborated | 2 | 2026-06-23 | #3510 (12/12 resolved), #3374 (15/15) |
-| Source-role tagging (bugbot/security/history/summary/ci/human) | 🟢 corroborated | 3 | 2026-06-23 | #3415, #3507, #3374 |
-| Open/resolved split | 🟢 corroborated | 2 | 2026-06-23 | #3510, #3374 |
+| Branch/PR identification + arg handling | 🟢 corroborated | 4 | 2026-06-23 | #3415, #3507, #3374, #3536 |
+| Multi-source collection (inline + top-level + reviews) | 🟢 corroborated | 5 | 2026-06-23 | #3415, #3507, #3510, #3374, #3536 |
+| GraphQL thread-resolution pull (`isResolved`/`isOutdated`) | 🟢 corroborated | 3 | 2026-06-23 | #3510 (12/12 resolved), #3374 (15/15), #3536 (2/2 open) |
+| Source-role tagging (bugbot/security/history/summary/ci/human) | 🟢 corroborated | 4 | 2026-06-23 | #3415, #3507, #3374, #3536 |
+| Open/resolved split | 🟢 corroborated | 3 | 2026-06-23 | #3510, #3374, #3536 (0 resolved / 2 open) |
 | Fix-quality spot-check (genuinely fixed vs silenced) | 🟢 corroborated | 2 | 2026-06-23 | #3510 (term removals landed), #3374 (`num_docs`, dropIndex landed) |
 | "Resolved ≠ fixed" flag — **legitimate deferral** variant | 🟡 seen once | 1 | 2026-06-23 | #3510 (TS.BGET:122 left pending eng) |
 | "Resolved ≠ fixed" flag — **still-broken** variant | ❓ untested | 0 | — | never confirmed a resolved thread that was actually still broken |
 | Cross-tool **agreement** | 🟡 seen once | 1 | 2026-06-23 | #3374 (Claude + bugbot independently on `num_docs`) |
 | **Contradiction** detection | 🟡 seen once | 1 | 2026-06-23 | #3415 (approval vs open bugbot finding); #3507 (bugbot vs author intent, off-branch) |
 | **Ping-pong loop** detection | ❓ untested | 0 | — | never observed a real loop; all PRs so far had clean find→fix→resolve flow |
-| Approval-over-open-finding cross-check | 🟢 corroborated | 2 | 2026-06-23 | #3415 (dwdougherty), #3374 (dwdougherty low-confidence over open HIGH) |
+| Approval-over-open-finding cross-check | 🟢 corroborated | 3 | 2026-06-23 | #3415 (dwdougherty), #3374 (dwdougherty low-confidence over open HIGH), #3536 (dwdougherty high-confidence — tested — over 2 open Mediums: benign variant) |
 | Depth cap / prioritisation under load | 🟡 seen once | 1 | 2026-06-23 | #3374 (19 candidate findings → 4 deep-verified) |
 | Mandatory deep-verify of resolved+not-outdated HIGH | ❓ untested | 0 | — | rule added 2026-06-23; not yet fired on a fresh run |
 | Bot calibration (fixed-vs-dismissed ratio) | 🟡 seen once | 1 | 2026-06-23 | #3374 (bugbot signal mostly accepted) |

From b616cf79266b442ab17a9f49863b92aa986484e8 Mon Sep 17 00:00:00 2001
From: Andy Stark <andrew.stark@redis.com>
Date: Tue, 23 Jun 2026 16:21:00 +0100
Subject: [PATCH 3/7] DOC-6771 more from the Bugbot

---
 .claude/commands/docs/assess-comments.md  | 24 ++++++++++++++++++-----
 .claude/state/assess-comments.coverage.md | 19 ++++++++++++++----
 2 files changed, 34 insertions(+), 9 deletions(-)

diff --git a/.claude/commands/docs/assess-comments.md b/.claude/commands/docs/assess-comments.md
index 0fc4ae73df..32c1395a23 100644
--- a/.claude/commands/docs/assess-comments.md
+++ b/.claude/commands/docs/assess-comments.md
@@ -74,6 +74,11 @@ exactly what lets you adjudicate the tool findings.
          isResolved isOutdated comments(first:100){ nodes {
            databaseId author{login} path body } } } } } } }'
      ```
+     `first:100` covers any normal PR, but unlike the `--paginate`d REST calls it
+     does **not** page — so if a PR genuinely has >100 review threads (or a thread
+     with >100 comments), check `pageInfo.hasNextPage` and follow the cursor, or
+     you'll silently miss resolution state and skew the open/resolved split.
+
      Use `databaseId` to map each thread back to the inline comments from the
      REST pull, so you know which are **resolved** vs **open**.
 
@@ -92,7 +97,11 @@ exactly what lets you adjudicate the tool findings.
 3. **Tag each comment by its source role.** You're reconciling *roles*, not
    usernames. Classify each author as one of:
 
-   - **bugbot** — `cursor` / `cursor[bot]`, or any body mentioning "Bugbot"
+   - **bugbot** — author `login` is `cursor` / `cursor[bot]`. **Tag by author,
+     not body text:** a *human* reply that quotes or mentions "Bugbot" is still a
+     human comment — don't tag it bugbot (that would mis-route it to
+     `/docs:bugbot`). Only treat a body mention of "Bugbot" as a signal when the
+     author is some other bot relaying bugbot's output.
    - **security** — the security scanner
    - **history** — the commit-history / semantic-similarity bot
    - **summary** — the AI PR summary (usually the PR body or a bot comment)
@@ -166,10 +175,15 @@ exactly what lets you adjudicate the tool findings.
      said is a deliberate CTF example; bugbot wants a refactor the history bot
      says was reverted before). Call these out loudest — they're where blind
      auto-fixing does damage.
-   - **Ping-pong** — using `created_at` order and the commits in between, spot
-     cases where a fix for one tool's comment re-triggered another tool, or the
-     same spot has been edited back and forth. Flag the loop rather than
-     extending it.
+   - **Ping-pong** — a genuine loop is narrower than it looks. Using `created_at`
+     order and the commits in between, flag it **only** when *the same concern is
+     reopened* (a finding you thought fixed comes back), or *a fix for tool A's
+     comment re-triggered tool B in a cycle* (A→fix→B→fix→A…). It is **not**
+     ping-pong when a fix round simply prompts a re-scan that surfaces **new,
+     independent findings** — even if they land in the region you just edited.
+     That's normal iteration; treat those as fresh clusters, not a loop. (Real
+     example of the non-loop case in the coverage ledger's near-miss log.) When
+     it *is* a true loop, flag it rather than extending it.
    - **Solo** — a single source, no corroboration → judge it on its merits and
      say how confident you are. A solo finding you've *verified against the code
      and data* is high-confidence even with one source.
diff --git a/.claude/state/assess-comments.coverage.md b/.claude/state/assess-comments.coverage.md
index a529c1f612..fd481f22f5 100644
--- a/.claude/state/assess-comments.coverage.md
+++ b/.claude/state/assess-comments.coverage.md
@@ -36,12 +36,12 @@ whether to commit the change.
 | "Resolved ≠ fixed" flag — **legitimate deferral** variant | 🟡 seen once | 1 | 2026-06-23 | #3510 (TS.BGET:122 left pending eng) |
 | "Resolved ≠ fixed" flag — **still-broken** variant | ❓ untested | 0 | — | never confirmed a resolved thread that was actually still broken |
 | Cross-tool **agreement** | 🟡 seen once | 1 | 2026-06-23 | #3374 (Claude + bugbot independently on `num_docs`) |
-| **Contradiction** detection | 🟡 seen once | 1 | 2026-06-23 | #3415 (approval vs open bugbot finding); #3507 (bugbot vs author intent, off-branch) |
-| **Ping-pong loop** detection | ❓ untested | 0 | — | never observed a real loop; all PRs so far had clean find→fix→resolve flow |
+| **Contradiction** detection | 🟡 seen once | 1 | 2026-06-23 | #3415 (approval vs open bugbot finding). *(#3507 bugbot-vs-author was an off-branch manual demo — illustrative, not counted toward encounters.)* |
+| **Ping-pong loop** detection | ❓ untested | 0 | 2026-06-23 | still no real loop. #3536 was the first *post-fix re-scan* tested: round-1 fixes (commit `da7c04a`) re-triggered a bugbot re-review that raised 3 new findings, but they were independent/pre-existing, not a back-and-forth — correctly judged NOT a loop (near-miss recorded below) |
 | Approval-over-open-finding cross-check | 🟢 corroborated | 3 | 2026-06-23 | #3415 (dwdougherty), #3374 (dwdougherty low-confidence over open HIGH), #3536 (dwdougherty high-confidence — tested — over 2 open Mediums: benign variant) |
 | Depth cap / prioritisation under load | 🟡 seen once | 1 | 2026-06-23 | #3374 (19 candidate findings → 4 deep-verified) |
 | Mandatory deep-verify of resolved+not-outdated HIGH | ❓ untested | 0 | — | rule added 2026-06-23; not yet fired on a fresh run |
-| Bot calibration (fixed-vs-dismissed ratio) | 🟡 seen once | 1 | 2026-06-23 | #3374 (bugbot signal mostly accepted) |
+| Bot calibration (fixed-vs-dismissed ratio) | 🟢 corroborated | 2 | 2026-06-23 | #3374 (bugbot signal mostly accepted); #3536 (bugbot 5/5 findings valid across 2 rounds — high trust) |
 | Codex second-opinion availability gate | 🟢 corroborated | 2 | 2026-06-23 | #3415, #3374 (CLI on PATH; #3374 had a real Codex review) |
 
 ## Worked examples library
@@ -50,9 +50,20 @@ Concrete real-world signatures, so detection sharpens over time. Add to this
 whenever a rare pattern is seen for the first time.
 
 ### Ping-pong loops
-*(none observed yet — when the first real loop appears, record the tool
+*(no true loop observed yet — when the first real one appears, record the tool
 sequence, the comment ids, and the commits between them here)*
 
+**Near-miss (NOT a loop) — #3536, 2026-06-23.** Round-1 bugbot findings (ids
+3460160413, 3460160429 on commit `17517ed`) were fixed in commit `da7c04a`;
+bugbot re-reviewed and raised 3 *new* findings (ids 3460820452/470/474). One new
+finding (GraphQL pagination, `assess-comments.md:72-75`) sat in the same step I'd
+just edited — superficially loop-shaped. But the new findings were independent,
+pre-existing issues exposed by the file changing, **not** the same spot toggled
+back and forth or a fix being undone. Lesson for the detector: "fix → re-scan →
+new findings in an edited region" is normal iteration, **not** ping-pong. A true
+loop needs the *same concern* reopened, or tool A's fix re-triggering tool B in a
+cycle.
+
 ### Resolved-but-still-broken
 *(none confirmed yet — record any thread marked resolved whose bug was still
 present in the code)*

From b76371830235081a06c4d326364f71495c8e17e4 Mon Sep 17 00:00:00 2001
From: Andy Stark <andrew.stark@redis.com>
Date: Tue, 23 Jun 2026 16:26:41 +0100
Subject: [PATCH 4/7] DOC-6771 skill self-improvement

---
 .claude/state/assess-comments.coverage.md | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/.claude/state/assess-comments.coverage.md b/.claude/state/assess-comments.coverage.md
index fd481f22f5..09a497c50b 100644
--- a/.claude/state/assess-comments.coverage.md
+++ b/.claude/state/assess-comments.coverage.md
@@ -64,6 +64,11 @@ new findings in an edited region" is normal iteration, **not** ping-pong. A true
 loop needs the *same concern* reopened, or tool A's fix re-triggering tool B in a
 cycle.
 
+*Confirmed convergence (round 3):* the fixes for those 3 new findings were
+pushed, and bugbot's next re-scan came back **clean — no comments**. A real loop
+would have spawned another round; this settled. So the "not a loop" judgement is
+borne out by what happened next: assess → fix → re-scan reached a fixed point.
+
 ### Resolved-but-still-broken
 *(none confirmed yet — record any thread marked resolved whose bug was still
 present in the code)*

From 916dd3ecc039b13dc37020f6f9e17694611533a0 Mon Sep 17 00:00:00 2001
From: Andy Stark <andrew.stark@redis.com>
Date: Tue, 23 Jun 2026 16:40:01 +0100
Subject: [PATCH 5/7] DOC-6771 more Bugbot fixes

---
 .claude/commands/docs/assess-comments.md  | 33 +++++++++++++++++------
 .claude/state/assess-comments.coverage.md | 14 +++++++++-
 2 files changed, 38 insertions(+), 9 deletions(-)

diff --git a/.claude/commands/docs/assess-comments.md b/.claude/commands/docs/assess-comments.md
index 32c1395a23..7af5e52658 100644
--- a/.claude/commands/docs/assess-comments.md
+++ b/.claude/commands/docs/assess-comments.md
@@ -92,10 +92,17 @@ exactly what lets you adjudicate the tool findings.
    would silently disable that safety check. In filtered mode, run step 8
    against the full comment set, not the filtered one.
 
-   If no comment matches `$ARGUMENTS`, stop and tell me.
-
-3. **Tag each comment by its source role.** You're reconciling *roles*, not
-   usernames. Classify each author as one of:
+   If no comment **or review verdict** matches `$ARGUMENTS`, stop and tell me — a
+   reviewer who left only an approve/request-changes verdict still counts as a
+   match.
+
+3. **Tag each comment _and review verdict_ by its source role.** You're
+   reconciling *roles*, not usernames. Reviews are first-class here: a
+   **request-changes** (or any review whose body raises specific, actionable
+   points) is a *finding* from its author's role, not just an approve/reject
+   signal for step 8 — tag and carry it forward like any comment. (A bare
+   approval with no actionable body stays a step-8 signal only.) Classify each
+   author as one of:
 
    - **bugbot** — author `login` is `cursor` / `cursor[bot]`. **Tag by author,
      not body text:** a *human* reply that quotes or mentions "Bugbot" is still a
@@ -113,10 +120,10 @@ exactly what lets you adjudicate the tool findings.
    healthy. Don't raise them as issues.
 
 4. **Cluster findings by what they touch, not by who said them.** Group the
-   actual findings (from bugbot, security, history, humans) by file + line, or
-   by shared theme when they're not line-specific. One cluster may hold comments
-   from several tools pointing at the same code — that overlap is the whole
-   point.
+   actual findings — from bugbot, security, history, humans, **and actionable
+   review-verdict bodies (step 3)** — by file + line, or by shared theme when
+   they're not line-specific. One cluster may hold comments from several tools
+   pointing at the same code — that overlap is the whole point.
 
 5. **Split clusters into OPEN and RESOLVED — and treat them differently.** Using
    the resolution state from step 2, mark each cluster open or resolved. An
@@ -184,6 +191,16 @@ exactly what lets you adjudicate the tool findings.
      That's normal iteration; treat those as fresh clusters, not a loop. (Real
      example of the non-loop case in the coverage ledger's near-miss log.) When
      it *is* a true loop, flag it rather than extending it.
+   - **Subsystem churn** — distinct from ping-pong, and the *earlier* warning
+     sign. Look across fix rounds: if several findings (even new, independent
+     ones) keep landing on the **same area of code you keep patching**, the
+     symptom isn't any single finding — it's that the subsystem is
+     under-specified and each patch just exposes the next adjacent gap.
+     Incrementally fixing the latest one tends to trigger another round. When you
+     see this, **recommend a single consolidated redesign of that subsystem
+     rather than another patch** — that's the move that actually ends the churn.
+     (Worked example in the ledger: PR #3536's review-handling findings across
+     rounds 1 and 4.)
    - **Solo** — a single source, no corroboration → judge it on its merits and
      say how confident you are. A solo finding you've *verified against the code
      and data* is high-confidence even with one source.
diff --git a/.claude/state/assess-comments.coverage.md b/.claude/state/assess-comments.coverage.md
index 09a497c50b..89c0ef5455 100644
--- a/.claude/state/assess-comments.coverage.md
+++ b/.claude/state/assess-comments.coverage.md
@@ -37,7 +37,8 @@ whether to commit the change.
 | "Resolved ≠ fixed" flag — **still-broken** variant | ❓ untested | 0 | — | never confirmed a resolved thread that was actually still broken |
 | Cross-tool **agreement** | 🟡 seen once | 1 | 2026-06-23 | #3374 (Claude + bugbot independently on `num_docs`) |
 | **Contradiction** detection | 🟡 seen once | 1 | 2026-06-23 | #3415 (approval vs open bugbot finding). *(#3507 bugbot-vs-author was an off-branch manual demo — illustrative, not counted toward encounters.)* |
-| **Ping-pong loop** detection | ❓ untested | 0 | 2026-06-23 | still no real loop. #3536 was the first *post-fix re-scan* tested: round-1 fixes (commit `da7c04a`) re-triggered a bugbot re-review that raised 3 new findings, but they were independent/pre-existing, not a back-and-forth — correctly judged NOT a loop (near-miss recorded below) |
+| **Ping-pong loop** detection | ❓ untested | 0 | 2026-06-23 | still no real loop across 4 bugbot rounds on #3536. Rounds 2 & 4 each raised new post-fix findings but none was a reopened concern or A↔B cycle — correctly judged NOT a loop both times. Round 4 instead revealed *subsystem churn* (next row) |
+| **Subsystem churn** detection (repeated findings on one patched area) | 🟡 seen once | 1 | 2026-06-23 | #3536 — findings 429 (r1), 862 + 874 (r4) all cluster on `$ARGUMENTS` filter + review-verdict handling; recommended a consolidated redesign over another patch (worked example below) |
 | Approval-over-open-finding cross-check | 🟢 corroborated | 3 | 2026-06-23 | #3415 (dwdougherty), #3374 (dwdougherty low-confidence over open HIGH), #3536 (dwdougherty high-confidence — tested — over 2 open Mediums: benign variant) |
 | Depth cap / prioritisation under load | 🟡 seen once | 1 | 2026-06-23 | #3374 (19 candidate findings → 4 deep-verified) |
 | Mandatory deep-verify of resolved+not-outdated HIGH | ❓ untested | 0 | — | rule added 2026-06-23; not yet fired on a fresh run |
@@ -73,6 +74,17 @@ borne out by what happened next: assess → fix → re-scan reached a fixed poin
 *(none confirmed yet — record any thread marked resolved whose bug was still
 present in the code)*
 
+### Subsystem churn (not a loop, but the precursor)
+- **#3536 review-handling, 2026-06-23.** Bugbot finding 429 (round 1) flagged the
+  `$ARGUMENTS` filter dropping cross-check data; it was patched. Two rounds later
+  bugbot raised 862 ("filter abort ignores reviews") and 874 ("review bodies skip
+  clustering") — *new, independent* findings (so not ping-pong), but all three
+  cluster on the **same under-specified subsystem**: how review verdicts flow
+  through collect → filter → tag → cluster → abort. Root cause: reviews were
+  second-class vs comments. Signature: each incremental patch exposed the next
+  adjacent gap. Response that ended it: one **consolidated** fix promoting reviews
+  to first-class across steps 2–4, rather than patching 862/874 individually.
+
 ### Cross-tool agreement
 - **#3374 `num_docs`** — Claude (Critical #2, top-level review) and bugbot
   ("Wrong FT.INFO document count field", inline, 06-17) independently flagged

From ba7921d4013c1bda6b6b525b2c4a2bae6644076f Mon Sep 17 00:00:00 2001
From: Andy Stark <andrew.stark@redis.com>
Date: Tue, 23 Jun 2026 16:46:42 +0100
Subject: [PATCH 6/7] DOC-6771 yet more from the Bugbot

---
 .claude/commands/docs/assess-comments.md  | 13 ++++++++++---
 .claude/state/assess-comments.coverage.md | 10 +++++++++-
 2 files changed, 19 insertions(+), 4 deletions(-)

diff --git a/.claude/commands/docs/assess-comments.md b/.claude/commands/docs/assess-comments.md
index 7af5e52658..e0b5255112 100644
--- a/.claude/commands/docs/assess-comments.md
+++ b/.claude/commands/docs/assess-comments.md
@@ -169,6 +169,12 @@ exactly what lets you adjudicate the tool findings.
    **ping-pong loops**, and **resolved-but-not-outdated High/Critical findings**
    (step 5). Everything else competes for the remaining slots.
 
+   One assessment runs *outside* the cap entirely: **subsystem churn** (step 7).
+   Churn is a pattern across findings — including ones you deferred this round
+   and ones from prior rounds — so judge it against the **full** open+recent set,
+   never just the capped deep set. If a deferred cluster turns out to be part of
+   a churn pattern, pull it into deep-verify; the cap must not hide churn.
+
 7. **Adjudicate the chosen clusters — and verify, don't trust.** For each
    cluster in your deep set, open the file it points at (and the underlying data
    it depends on) and judge for yourself; these tools have false positives, and
@@ -224,7 +230,7 @@ exactly what lets you adjudicate the tool findings.
    | Cluster | File / line | Sources | Verdict | Recommended action |
    |---|---|---|---|---|
 
-   - **Verdict**: agreement / contradiction / ping-pong / solo / stale.
+   - **Verdict**: agreement / contradiction / ping-pong / churn / solo / stale.
    - **Recommended action**: one of — *safe to fix* (uncontested and low-risk;
      hand **bugbot-sourced** clusters to `/docs:bugbot`, but apply safe fixes
      from any other source — human, Codex, security, history — directly, since
@@ -233,8 +239,9 @@ exactly what lets you adjudicate the tool findings.
      say why); *already handled* (stale).
 
    Then, in this order:
-   - the **headline** — contradictions, ping-pong loops, and any approval-over-
-     open-finding from step 8 (these need a human) first;
+   - the **headline** — contradictions, ping-pong loops, **subsystem churn (with
+     its consolidate-don't-patch recommendation)**, and any approval-over-open-
+     finding from step 8 (these need a human) first;
    - the **safe-to-fix** list;
    - what you're **dismissing** and why, so I can sanity-check your reasoning;
    - a short **resolved-threads** note (from step 5) — only what's worth saying:
diff --git a/.claude/state/assess-comments.coverage.md b/.claude/state/assess-comments.coverage.md
index 89c0ef5455..153b596781 100644
--- a/.claude/state/assess-comments.coverage.md
+++ b/.claude/state/assess-comments.coverage.md
@@ -38,7 +38,7 @@ whether to commit the change.
 | Cross-tool **agreement** | 🟡 seen once | 1 | 2026-06-23 | #3374 (Claude + bugbot independently on `num_docs`) |
 | **Contradiction** detection | 🟡 seen once | 1 | 2026-06-23 | #3415 (approval vs open bugbot finding). *(#3507 bugbot-vs-author was an off-branch manual demo — illustrative, not counted toward encounters.)* |
 | **Ping-pong loop** detection | ❓ untested | 0 | 2026-06-23 | still no real loop across 4 bugbot rounds on #3536. Rounds 2 & 4 each raised new post-fix findings but none was a reopened concern or A↔B cycle — correctly judged NOT a loop both times. Round 4 instead revealed *subsystem churn* (next row) |
-| **Subsystem churn** detection (repeated findings on one patched area) | 🟡 seen once | 1 | 2026-06-23 | #3536 — findings 429 (r1), 862 + 874 (r4) all cluster on `$ARGUMENTS` filter + review-verdict handling; recommended a consolidated redesign over another patch (worked example below) |
+| **Subsystem churn** detection (repeated findings on one patched area) | 🟡 seen (1 PR, 2 instances) | 2 | 2026-06-23 | #3536 — (a) findings 429/862/874 on `$ARGUMENTS` filter + review handling; (b) round-5 findings 442/449 on the *churn feature itself* (added r4 in step 7, not threaded into steps 6 & 9). Both met with a consolidated fix, not a patch. Stays 🟡 — both on one PR; needs a 2nd PR for 🟢. Worked examples below |
 | Approval-over-open-finding cross-check | 🟢 corroborated | 3 | 2026-06-23 | #3415 (dwdougherty), #3374 (dwdougherty low-confidence over open HIGH), #3536 (dwdougherty high-confidence — tested — over 2 open Mediums: benign variant) |
 | Depth cap / prioritisation under load | 🟡 seen once | 1 | 2026-06-23 | #3374 (19 candidate findings → 4 deep-verified) |
 | Mandatory deep-verify of resolved+not-outdated HIGH | ❓ untested | 0 | — | rule added 2026-06-23; not yet fired on a fresh run |
@@ -84,6 +84,14 @@ present in the code)*
   second-class vs comments. Signature: each incremental patch exposed the next
   adjacent gap. Response that ended it: one **consolidated** fix promoting reviews
   to first-class across steps 2–4, rather than patching 862/874 individually.
+- **#3536 churn-on-the-churn-feature (round 5), 2026-06-23.** The *fix* for the
+  case above added "subsystem churn" to step 7 — but only there. Round 5 bugbot
+  raised 442 ("churn detection capped incorrectly" — gated behind step 6's depth
+  cap) and 449 ("report verdict omits churn" — missing from step 9's verdict list
+  and headline). Same signature as any churn: a feature added in one place,
+  adjacent integration gaps exposed next round. Cured by threading churn through
+  steps 6 and 9 in one pass. Lesson: adding a capability *is* a subsystem edit —
+  wire it through every dependent step at once, or it becomes its own churn.
 
 ### Cross-tool agreement
 - **#3374 `num_docs`** — Claude (Critical #2, top-level review) and bugbot

From d788bce3b00b5e721f8d9f45468e3e35522c5c09 Mon Sep 17 00:00:00 2001
From: Andy Stark <andrew.stark@redis.com>
Date: Tue, 23 Jun 2026 16:52:55 +0100
Subject: [PATCH 7/7] DOC-6771  more Bugbot stuff

---
 .claude/commands/docs/assess-comments.md  |  7 ++++++-
 .claude/state/assess-comments.coverage.md | 12 +++++++++++-
 2 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/.claude/commands/docs/assess-comments.md b/.claude/commands/docs/assess-comments.md
index e0b5255112..3c3d30b84a 100644
--- a/.claude/commands/docs/assess-comments.md
+++ b/.claude/commands/docs/assess-comments.md
@@ -225,7 +225,12 @@ exactly what lets you adjudicate the tool findings.
    real issue merge — exactly the gap a reconciler should catch. Surface it even
    when the PR is marked WIP / don't-merge.
 
-9. **Produce the reconciliation report.** One row per **open** cluster:
+9. **Produce the reconciliation report.** One row per **open cluster you
+   deep-verified this round** (step 7) — those are the only clusters with a
+   real verdict. Open clusters you deferred under the step-6 cap do **not** get a
+   table row or a verdict (you haven't verified them); they're listed, plainly
+   marked unverified, in the "story so far" below. The table never promises more
+   coverage than you actually did.
 
    | Cluster | File / line | Sources | Verdict | Recommended action |
    |---|---|---|---|---|
diff --git a/.claude/state/assess-comments.coverage.md b/.claude/state/assess-comments.coverage.md
index 153b596781..b54381cc05 100644
--- a/.claude/state/assess-comments.coverage.md
+++ b/.claude/state/assess-comments.coverage.md
@@ -38,7 +38,7 @@ whether to commit the change.
 | Cross-tool **agreement** | 🟡 seen once | 1 | 2026-06-23 | #3374 (Claude + bugbot independently on `num_docs`) |
 | **Contradiction** detection | 🟡 seen once | 1 | 2026-06-23 | #3415 (approval vs open bugbot finding). *(#3507 bugbot-vs-author was an off-branch manual demo — illustrative, not counted toward encounters.)* |
 | **Ping-pong loop** detection | ❓ untested | 0 | 2026-06-23 | still no real loop across 4 bugbot rounds on #3536. Rounds 2 & 4 each raised new post-fix findings but none was a reopened concern or A↔B cycle — correctly judged NOT a loop both times. Round 4 instead revealed *subsystem churn* (next row) |
-| **Subsystem churn** detection (repeated findings on one patched area) | 🟡 seen (1 PR, 2 instances) | 2 | 2026-06-23 | #3536 — (a) findings 429/862/874 on `$ARGUMENTS` filter + review handling; (b) round-5 findings 442/449 on the *churn feature itself* (added r4 in step 7, not threaded into steps 6 & 9). Both met with a consolidated fix, not a patch. Stays 🟡 — both on one PR; needs a 2nd PR for 🟢. Worked examples below |
+| **Subsystem churn** detection (repeated findings on one patched area) | 🟡 seen (1 PR, 3 instances) | 3 | 2026-06-23 | #3536 — (a) 429/862/874 on `$ARGUMENTS` filter + review handling; (b) r5 442/449 on the *churn feature*; (c) r6 3461052859 on the *cap ↔ report contract* — i.e. (b)'s consolidation was too narrow. Pattern is robust on this PR; needs a 2nd PR for 🟢. Worked examples below |
 | Approval-over-open-finding cross-check | 🟢 corroborated | 3 | 2026-06-23 | #3415 (dwdougherty), #3374 (dwdougherty low-confidence over open HIGH), #3536 (dwdougherty high-confidence — tested — over 2 open Mediums: benign variant) |
 | Depth cap / prioritisation under load | 🟡 seen once | 1 | 2026-06-23 | #3374 (19 candidate findings → 4 deep-verified) |
 | Mandatory deep-verify of resolved+not-outdated HIGH | ❓ untested | 0 | — | rule added 2026-06-23; not yet fired on a fresh run |
@@ -92,6 +92,16 @@ present in the code)*
   adjacent integration gaps exposed next round. Cured by threading churn through
   steps 6 and 9 in one pass. Lesson: adding a capability *is* a subsystem edit —
   wire it through every dependent step at once, or it becomes its own churn.
+- **#3536 churn persisted — consolidation scoped too narrowly (round 6),
+  2026-06-23.** The round-5 "consolidated" fix threaded the *churn feature*
+  through steps 6 & 9 but left the deeper contract under them unfixed, so round 6
+  raised finding 3461052859 ("open table needs full verdicts": step 9 promised a
+  verdict per open cluster while step 6 only verifies 3–5). Same area a third
+  time. Real root: the **cap ↔ report-completeness contract**, not the churn
+  feature. Fixed by making the table cover only deep-verified clusters and
+  marking deferred ones unverified. Meta-lesson: when round N+1 finds another gap
+  in an area you *just* "consolidated", your consolidation boundary was wrong —
+  widen it to the true subsystem, don't re-patch the edge.
 
 ### Cross-tool agreement
 - **#3374 `num_docs`** — Claude (Critical #2, top-level review) and bugbot