diff --git a/.claude/commands/docs/assess-comments.md b/.claude/commands/docs/assess-comments.md new file mode 100644 index 0000000000..3c3d30b84a --- /dev/null +++ b/.claude/commands/docs/assess-comments.md @@ -0,0 +1,302 @@ +--- +description: Reconcile and assess all comments on the current branch's PR across every tool and commenter — report only, no edits +argument-hint: "[commenter-login] (optional — defaults to all commenters)" +--- + +You're going to read **all** the comments on the pull request for the current +branch — from automation (Cursor Bugbot, the security scanner, the history bot, +the AI PR summary, CI) *and* from humans — and produce a single **reconciliation +report**: what each comment raises, where the tools **agree**, where they +**contradict each other**, and where they're chasing each other in circles. This +is the synthesis layer the individual tools don't have. + +**This command assesses and reports. It does not edit PR content, commit, push, +or reply** — the one and only file it may write is its own coverage ledger +(step 11). Fixing is a separate step. Note that `/docs:bugbot` only triages and +fixes **Cursor Bugbot's own** comments — so it's the right hand-off *only* for +bugbot-sourced clusters. Safe fixes that came from humans, Codex, the security +or history bots, etc. won't be touched by `/docs:bugbot`; apply those directly +(or ask me to) once you've seen the report. + +It's also meant to be run **repeatedly** as the PR evolves: the bots tend to post +fresh comments after each round of fixes, so each run is a checkpoint — verify +the few highest-impact things deeply, leave a "story so far", and re-run to pick +up the rest. Don't try to settle the whole PR in one pass. + +The command **tracks its own test coverage** in +`.claude/state/assess-comments.coverage.md`. Read it at the start of every run +(step 1) and update it at the end (step 11) — it's how the command knows, +honestly, which of its checks have actually fired on real data and which are +still unproven (ping-pong loop detection, notably). It's a *coverage* record, +not a correctness proof. + +If `$ARGUMENTS` is given, treat it as a commenter login and assess **only** that +author's comments (match case-insensitively; `cursor` matches `cursor[bot]`). +With no arguments, assess **every** commenter — bots and humans alike, because +human intent ("this is a CTF example", "deliberate UK spelling in a quote") is +exactly what lets you adjudicate the tool findings. + +1. **Load your coverage ledger, then find the branch and its PR.** First read + `.claude/state/assess-comments.coverage.md` so you know which of your own + checks are proven vs unproven this run (it also tells you to watch hardest for + the untested ones — ping-pong loops especially). Then run + `git branch --show-current`, and: + + ``` + gh pr view --json number,url,title,headRefName,body + ``` + + If there's no PR for this branch, stop and tell me — there's nothing to + assess. Keep the PR `body`: the AI summary usually lives there and is + **intent context**, not a finding to triage. + +2. **Collect every comment, from every source.** Check all of these: + + - **Inline review comments** (tied to lines): + ``` + gh api repos/{owner}/{repo}/pulls//comments --paginate + ``` + - **Top-level PR / issue comments** (summaries, bot posts, human replies): + ``` + gh api repos/{owner}/{repo}/issues//comments --paginate + ``` + - **Review verdicts** (approve / request-changes bodies): + ``` + gh api repos/{owner}/{repo}/pulls//reviews --paginate + ``` + - **Thread resolution state** — *don't skip this.* The REST `comments` + endpoint's `position`/outdated field is **not** a reliable "addressed" + signal (a resolved thread routinely still reports `position` non-null). The + authoritative signal is the GraphQL `reviewThreads`: + ``` + gh api graphql -F query='query { repository(owner:"", name:"") + { pullRequest(number:) { reviewThreads(first:100) { nodes { + isResolved isOutdated comments(first:100){ nodes { + databaseId author{login} path body } } } } } } }' + ``` + `first:100` covers any normal PR, but unlike the `--paginate`d REST calls it + does **not** page — so if a PR genuinely has >100 review threads (or a thread + with >100 comments), check `pageInfo.hasNextPage` and follow the cursor, or + you'll silently miss resolution state and skew the open/resolved split. + + Use `databaseId` to map each thread back to the inline comments from the + REST pull, so you know which are **resolved** vs **open**. + + For each comment capture: author `login`, `created_at`, the body, and for + inline comments the `path`, line, and its thread's `isResolved` / + `isOutdated`. If `$ARGUMENTS` was given, it narrows which comments become + **findings you cluster and adjudicate** (steps 4–7) to the matching author — + but **don't discard the others**. Always retain every review verdict and the + full set of open findings as *context*, because step 8 (approval-over-open- + finding) needs to see approvals and bot findings together; filtering them out + would silently disable that safety check. In filtered mode, run step 8 + against the full comment set, not the filtered one. + + If no comment **or review verdict** matches `$ARGUMENTS`, stop and tell me — a + reviewer who left only an approve/request-changes verdict still counts as a + match. + +3. **Tag each comment _and review verdict_ by its source role.** You're + reconciling *roles*, not usernames. Reviews are first-class here: a + **request-changes** (or any review whose body raises specific, actionable + points) is a *finding* from its author's role, not just an approve/reject + signal for step 8 — tag and carry it forward like any comment. (A bare + approval with no actionable body stays a step-8 signal only.) Classify each + author as one of: + + - **bugbot** — author `login` is `cursor` / `cursor[bot]`. **Tag by author, + not body text:** a *human* reply that quotes or mentions "Bugbot" is still a + human comment — don't tag it bugbot (that would mis-route it to + `/docs:bugbot`). Only treat a body mention of "Bugbot" as a signal when the + author is some other bot relaying bugbot's output. + - **security** — the security scanner + - **history** — the commit-history / semantic-similarity bot + - **summary** — the AI PR summary (usually the PR body or a bot comment) + - **ci** — `github-actions` / build / staging-link bots + - **human** — everyone else + + Treat **summary**, **ci**, and staging-build links as *context*, not + findings — they tell you what the author intended and whether the build is + healthy. Don't raise them as issues. + +4. **Cluster findings by what they touch, not by who said them.** Group the + actual findings — from bugbot, security, history, humans, **and actionable + review-verdict bodies (step 3)** — by file + line, or by shared theme when + they're not line-specific. One cluster may hold comments from several tools + pointing at the same code — that overlap is the whole point. + +5. **Split clusters into OPEN and RESOLVED — and treat them differently.** Using + the resolution state from step 2, mark each cluster open or resolved. An + active PR is almost always a mix; don't let resolved threads drown the open + ones. + + - **Open clusters** are your real work — they go through prioritise → + adjudicate (the steps below). + - **Resolved clusters** are *not* re-litigated as fresh findings. But they + aren't worthless either — do two cheap, high-value things with them: + - **Fix-quality spot-check.** For a sample of resolved findings (favour the + higher-impact ones), check that the committed change *actually addressed* + the point, not just silenced the comment — e.g. if a reviewer asked to + remove a term or fix a command, confirm it's genuinely gone/correct in + the current file. Flag any "fixed in name only." + - **"Resolved ≠ fixed" flag.** Call out any thread that is **resolved but + has no corresponding code change** (resolved, `isOutdated` false, and no + intervening commit touched the line). These are often legitimate + deferrals ("leaving as-is pending eng decision before GA") — but that's a + future obligation hiding inside a green checkmark, so surface it. + **Mandatory deep-verify:** any such thread that the bot rated **High (or + Critical) severity** must be opened and verified against the current code + *this round*, regardless of the depth cap in step 6 — a serious bug must + never hide behind a green checkmark just because the run hit its budget. + If it's genuinely fixed, say so; if it's resolved-but-still-present, + that's a headline. + - Resolved threads also feed **bot calibration**: the ratio of a bot's + findings that humans accepted-and-fixed vs dismissed tells you how much to + trust that bot's next *solo* finding. Note it when it's informative. + +6. **Prioritise, and cap how deep you go.** This command is meant to be run + **repeatedly** over a PR's life — the bots re-comment after each round of + fixes, so a single run is never the whole story. Don't try to verify every + cluster to the bottom. Rank the **open** clusters by impact — broken + `relref`/`image`/`embed-md` paths, wrong commands or code samples, incorrect + technical claims, and anything that breaks a feature the PR description calls + out, rank highest; pure style nits lowest — and pick **at most the top 3–5** + to verify in depth this round. The rest you list shallowly (see step 9) as + "not assessed this round" so the next run can pick them up. + + Three things always make the deep-verify set regardless of the cap, because + they're the failures a reconciler exists to catch: **contradictions**, + **ping-pong loops**, and **resolved-but-not-outdated High/Critical findings** + (step 5). Everything else competes for the remaining slots. + + One assessment runs *outside* the cap entirely: **subsystem churn** (step 7). + Churn is a pattern across findings — including ones you deferred this round + and ones from prior rounds — so judge it against the **full** open+recent set, + never just the capped deep set. If a deferred cluster turns out to be part of + a churn pattern, pull it into deep-verify; the cap must not hide churn. + +7. **Adjudicate the chosen clusters — and verify, don't trust.** For each + cluster in your deep set, open the file it points at (and the underlying data + it depends on) and judge for yourself; these tools have false positives, and + a single tool can be *correct about the code but wrong about impact*, or + *plausible but refuted once you read the data*. Label each: + + - **Agreement** — multiple sources flag the same thing → high confidence it's + real. + - **Contradiction** — sources disagree, *or* a finding conflicts with stated + human intent or the PR summary (e.g. security flags a sample that a human + said is a deliberate CTF example; bugbot wants a refactor the history bot + says was reverted before). Call these out loudest — they're where blind + auto-fixing does damage. + - **Ping-pong** — a genuine loop is narrower than it looks. Using `created_at` + order and the commits in between, flag it **only** when *the same concern is + reopened* (a finding you thought fixed comes back), or *a fix for tool A's + comment re-triggered tool B in a cycle* (A→fix→B→fix→A…). It is **not** + ping-pong when a fix round simply prompts a re-scan that surfaces **new, + independent findings** — even if they land in the region you just edited. + That's normal iteration; treat those as fresh clusters, not a loop. (Real + example of the non-loop case in the coverage ledger's near-miss log.) When + it *is* a true loop, flag it rather than extending it. + - **Subsystem churn** — distinct from ping-pong, and the *earlier* warning + sign. Look across fix rounds: if several findings (even new, independent + ones) keep landing on the **same area of code you keep patching**, the + symptom isn't any single finding — it's that the subsystem is + under-specified and each patch just exposes the next adjacent gap. + Incrementally fixing the latest one tends to trigger another round. When you + see this, **recommend a single consolidated redesign of that subsystem + rather than another patch** — that's the move that actually ends the churn. + (Worked example in the ledger: PR #3536's review-handling findings across + rounds 1 and 4.) + - **Solo** — a single source, no corroboration → judge it on its merits and + say how confident you are. A solo finding you've *verified against the code + and data* is high-confidence even with one source. + - **Stale / already-addressed** — the code no longer matches what the comment + describes; a later commit already handled it. + + US-vs-UK spelling counts as a real issue in this docs repo even though it's + small; pure style nits usually don't. + +8. **Cross-check review verdicts against open findings.** Separately from the + clusters, line up the PR's **review verdicts** (approve / request-changes) + against the findings still open. Flag the dangerous combination explicitly: + **a human approval (especially a low-confidence one — "skimmed it", "LGTM", + "didn't review in detail") sitting on top of an unresolved bot finding.** + Neither the reviewer nor the bot is wrong alone, but together they can let a + real issue merge — exactly the gap a reconciler should catch. Surface it even + when the PR is marked WIP / don't-merge. + +9. **Produce the reconciliation report.** One row per **open cluster you + deep-verified this round** (step 7) — those are the only clusters with a + real verdict. Open clusters you deferred under the step-6 cap do **not** get a + table row or a verdict (you haven't verified them); they're listed, plainly + marked unverified, in the "story so far" below. The table never promises more + coverage than you actually did. + + | Cluster | File / line | Sources | Verdict | Recommended action | + |---|---|---|---|---| + + - **Verdict**: agreement / contradiction / ping-pong / churn / solo / stale. + - **Recommended action**: one of — *safe to fix* (uncontested and low-risk; + hand **bugbot-sourced** clusters to `/docs:bugbot`, but apply safe fixes + from any other source — human, Codex, security, history — directly, since + `/docs:bugbot` won't touch those); *needs your call* (contradiction, + approval-over-finding, or history-bot warning); *dismiss* (false positive — + say why); *already handled* (stale). + + Then, in this order: + - the **headline** — contradictions, ping-pong loops, **subsystem churn (with + its consolidate-don't-patch recommendation)**, and any approval-over-open- + finding from step 8 (these need a human) first; + - the **safe-to-fix** list; + - what you're **dismissing** and why, so I can sanity-check your reasoning; + - a short **resolved-threads** note (from step 5) — only what's worth saying: + any "fixed in name only" or "resolved ≠ fixed / deferred" flags, and a bot- + calibration line if it's informative. Don't list cleanly-resolved threads + one by one; a count is enough; + - a **"Story so far"** line: which open clusters you deep-verified this round, + how many you deferred as "not assessed this round" (list them by one-line + title), how many threads were already resolved, and a reminder that + re-running after the next round of fixes will pick up the deferred ones plus + any new comments. This is a checkpoint, not a verdict on the whole PR. + +10. **Offer a second opinion via Codex, only if it's available.** Check whether + the Codex CLI is installed — run `command -v codex`. If it is, end by + suggesting the user get an independent review of the same diff the bots saw: + + ``` + /codex:review --base main --scope branch + ``` + + Frame it as optional and note it adds runtime, and that it's worth it mainly + for the contested or high-impact clusters — the ones where a second model + disagreeing (or agreeing) changes your confidence. Don't run it yourself: + `/codex:review` is user-invoked only, and this command is report-only. If + `command -v codex` finds nothing, say nothing about Codex. + +11. **Update your coverage ledger — automatically, but evidence-gated.** Open + `.claude/state/assess-comments.coverage.md` and update it to reflect what + *actually fired on real data this run*: + + - For each capability you genuinely exercised, bump its **encounters**, set + **last verified** to today, and add this PR (with a comment id or commit + SHA) to **Evidence**. Raise confidence per the scale — but **never raise a + confidence without a citable artifact.** No evidence, no bump. + - A capability counts as **🟢 corroborated** only on ≥2 *distinct* PRs, or + once with an independent `/codex:review` concurrence on the same finding. + - If you hit a **rare pattern for the first time** — a real ping-pong loop, a + resolved-but-still-broken thread — record its concrete signature in the + **Worked examples library** (tools involved, comment ids, commits). This is + how detection sharpens over time. + - Refresh ⏳ **decaying** entries you re-confirmed this run. + - If the run revealed that an *instruction* in this command should change + (a heuristic nearly missed something), **don't edit the command yourself** + — note the suggested refinement in your output and let me decide. + + The ledger update is the one file you *may* write. After writing it, **tell + me plainly that the command learned something and updated its own coverage + ledger**, name what changed in one line, and remind me I can include that + change in the PR or leave it out — my call. + +12. **Stop there.** Don't edit anything except the coverage ledger (step 11); + don't commit, push, or reply to the PR. Your output is the assessment — + I'll decide what to act on. diff --git a/.claude/state/assess-comments.coverage.md b/.claude/state/assess-comments.coverage.md new file mode 100644 index 0000000000..b54381cc05 --- /dev/null +++ b/.claude/state/assess-comments.coverage.md @@ -0,0 +1,109 @@ +# assess-comments — coverage ledger + +This file is the self-tracked **test coverage** of the `/docs:assess-comments` +command. It records, per capability, how confident we are that the logic has +actually *fired correctly on real data* — not how accurate its judgement is. + +**Read this honestly:** this is a *coverage* tracker, not a correctness proof. +It can only record capabilities it has *seen fire* (positive cases). It cannot +see false negatives — e.g. a ping-pong loop that was present but missed — so a +🟢 means "exercised on real PRs", never "guaranteed correct". + +The command reads this at the start of every run and updates it at the end +(step 1 and step 11). Updates are **automatic but evidence-gated**: no +confidence may rise without a citable artifact (PR number + comment id / commit +SHA). When the command updates this file it tells the user, who can choose +whether to commit the change. + +## Confidence scale +- ❓ **untested** — logic is written but has never fired on real data. +- 🟡 **seen once** — fired correctly on a single real PR. +- 🟢 **corroborated** — fired correctly on ≥2 *distinct* PRs, **or** once with an + independent `/codex:review` concurrence on the same finding. +- ⏳ **decaying** — was 🟡/🟢 but not re-confirmed in a long time (treat the API + shape, bot identities, and repo conventions as possibly drifted; re-verify). + +## Capabilities + +| Capability | Confidence | Real encounters | Last verified | Evidence | +|---|---|---|---|---| +| Branch/PR identification + arg handling | 🟢 corroborated | 4 | 2026-06-23 | #3415, #3507, #3374, #3536 | +| Multi-source collection (inline + top-level + reviews) | 🟢 corroborated | 5 | 2026-06-23 | #3415, #3507, #3510, #3374, #3536 | +| GraphQL thread-resolution pull (`isResolved`/`isOutdated`) | 🟢 corroborated | 3 | 2026-06-23 | #3510 (12/12 resolved), #3374 (15/15), #3536 (2/2 open) | +| Source-role tagging (bugbot/security/history/summary/ci/human) | 🟢 corroborated | 4 | 2026-06-23 | #3415, #3507, #3374, #3536 | +| Open/resolved split | 🟢 corroborated | 3 | 2026-06-23 | #3510, #3374, #3536 (0 resolved / 2 open) | +| Fix-quality spot-check (genuinely fixed vs silenced) | 🟢 corroborated | 2 | 2026-06-23 | #3510 (term removals landed), #3374 (`num_docs`, dropIndex landed) | +| "Resolved ≠ fixed" flag — **legitimate deferral** variant | 🟡 seen once | 1 | 2026-06-23 | #3510 (TS.BGET:122 left pending eng) | +| "Resolved ≠ fixed" flag — **still-broken** variant | ❓ untested | 0 | — | never confirmed a resolved thread that was actually still broken | +| Cross-tool **agreement** | 🟡 seen once | 1 | 2026-06-23 | #3374 (Claude + bugbot independently on `num_docs`) | +| **Contradiction** detection | 🟡 seen once | 1 | 2026-06-23 | #3415 (approval vs open bugbot finding). *(#3507 bugbot-vs-author was an off-branch manual demo — illustrative, not counted toward encounters.)* | +| **Ping-pong loop** detection | ❓ untested | 0 | 2026-06-23 | still no real loop across 4 bugbot rounds on #3536. Rounds 2 & 4 each raised new post-fix findings but none was a reopened concern or A↔B cycle — correctly judged NOT a loop both times. Round 4 instead revealed *subsystem churn* (next row) | +| **Subsystem churn** detection (repeated findings on one patched area) | 🟡 seen (1 PR, 3 instances) | 3 | 2026-06-23 | #3536 — (a) 429/862/874 on `$ARGUMENTS` filter + review handling; (b) r5 442/449 on the *churn feature*; (c) r6 3461052859 on the *cap ↔ report contract* — i.e. (b)'s consolidation was too narrow. Pattern is robust on this PR; needs a 2nd PR for 🟢. Worked examples below | +| Approval-over-open-finding cross-check | 🟢 corroborated | 3 | 2026-06-23 | #3415 (dwdougherty), #3374 (dwdougherty low-confidence over open HIGH), #3536 (dwdougherty high-confidence — tested — over 2 open Mediums: benign variant) | +| Depth cap / prioritisation under load | 🟡 seen once | 1 | 2026-06-23 | #3374 (19 candidate findings → 4 deep-verified) | +| Mandatory deep-verify of resolved+not-outdated HIGH | ❓ untested | 0 | — | rule added 2026-06-23; not yet fired on a fresh run | +| Bot calibration (fixed-vs-dismissed ratio) | 🟢 corroborated | 2 | 2026-06-23 | #3374 (bugbot signal mostly accepted); #3536 (bugbot 5/5 findings valid across 2 rounds — high trust) | +| Codex second-opinion availability gate | 🟢 corroborated | 2 | 2026-06-23 | #3415, #3374 (CLI on PATH; #3374 had a real Codex review) | + +## Worked examples library + +Concrete real-world signatures, so detection sharpens over time. Add to this +whenever a rare pattern is seen for the first time. + +### Ping-pong loops +*(no true loop observed yet — when the first real one appears, record the tool +sequence, the comment ids, and the commits between them here)* + +**Near-miss (NOT a loop) — #3536, 2026-06-23.** Round-1 bugbot findings (ids +3460160413, 3460160429 on commit `17517ed`) were fixed in commit `da7c04a`; +bugbot re-reviewed and raised 3 *new* findings (ids 3460820452/470/474). One new +finding (GraphQL pagination, `assess-comments.md:72-75`) sat in the same step I'd +just edited — superficially loop-shaped. But the new findings were independent, +pre-existing issues exposed by the file changing, **not** the same spot toggled +back and forth or a fix being undone. Lesson for the detector: "fix → re-scan → +new findings in an edited region" is normal iteration, **not** ping-pong. A true +loop needs the *same concern* reopened, or tool A's fix re-triggering tool B in a +cycle. + +*Confirmed convergence (round 3):* the fixes for those 3 new findings were +pushed, and bugbot's next re-scan came back **clean — no comments**. A real loop +would have spawned another round; this settled. So the "not a loop" judgement is +borne out by what happened next: assess → fix → re-scan reached a fixed point. + +### Resolved-but-still-broken +*(none confirmed yet — record any thread marked resolved whose bug was still +present in the code)* + +### Subsystem churn (not a loop, but the precursor) +- **#3536 review-handling, 2026-06-23.** Bugbot finding 429 (round 1) flagged the + `$ARGUMENTS` filter dropping cross-check data; it was patched. Two rounds later + bugbot raised 862 ("filter abort ignores reviews") and 874 ("review bodies skip + clustering") — *new, independent* findings (so not ping-pong), but all three + cluster on the **same under-specified subsystem**: how review verdicts flow + through collect → filter → tag → cluster → abort. Root cause: reviews were + second-class vs comments. Signature: each incremental patch exposed the next + adjacent gap. Response that ended it: one **consolidated** fix promoting reviews + to first-class across steps 2–4, rather than patching 862/874 individually. +- **#3536 churn-on-the-churn-feature (round 5), 2026-06-23.** The *fix* for the + case above added "subsystem churn" to step 7 — but only there. Round 5 bugbot + raised 442 ("churn detection capped incorrectly" — gated behind step 6's depth + cap) and 449 ("report verdict omits churn" — missing from step 9's verdict list + and headline). Same signature as any churn: a feature added in one place, + adjacent integration gaps exposed next round. Cured by threading churn through + steps 6 and 9 in one pass. Lesson: adding a capability *is* a subsystem edit — + wire it through every dependent step at once, or it becomes its own churn. +- **#3536 churn persisted — consolidation scoped too narrowly (round 6), + 2026-06-23.** The round-5 "consolidated" fix threaded the *churn feature* + through steps 6 & 9 but left the deeper contract under them unfixed, so round 6 + raised finding 3461052859 ("open table needs full verdicts": step 9 promised a + verdict per open cluster while step 6 only verifies 3–5). Same area a third + time. Real root: the **cap ↔ report-completeness contract**, not the churn + feature. Fixed by making the table cover only deep-verified clusters and + marking deferred ones unverified. Meta-lesson: when round N+1 finds another gap + in an area you *just* "consolidated", your consolidation boundary was wrong — + widen it to the true subsystem, don't re-patch the edge. + +### Cross-tool agreement +- **#3374 `num_docs`** — Claude (Critical #2, top-level review) and bugbot + ("Wrong FT.INFO document count field", inline, 06-17) independently flagged + the same `info.numDocs` → `num_docs` bug. Verified fixed in current code.