Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
302 changes: 302 additions & 0 deletions .claude/commands/docs/assess-comments.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,302 @@
---
description: Reconcile and assess all comments on the current branch's PR across every tool and commenter — report only, no edits
argument-hint: "[commenter-login] (optional — defaults to all commenters)"
---

You're going to read **all** the comments on the pull request for the current
branch — from automation (Cursor Bugbot, the security scanner, the history bot,
the AI PR summary, CI) *and* from humans — and produce a single **reconciliation
report**: what each comment raises, where the tools **agree**, where they
**contradict each other**, and where they're chasing each other in circles. This
is the synthesis layer the individual tools don't have.

**This command assesses and reports. It does not edit PR content, commit, push,
or reply** — the one and only file it may write is its own coverage ledger
(step 11). Fixing is a separate step. Note that `/docs:bugbot` only triages and
fixes **Cursor Bugbot's own** comments — so it's the right hand-off *only* for
bugbot-sourced clusters. Safe fixes that came from humans, Codex, the security
or history bots, etc. won't be touched by `/docs:bugbot`; apply those directly
(or ask me to) once you've seen the report.

It's also meant to be run **repeatedly** as the PR evolves: the bots tend to post
fresh comments after each round of fixes, so each run is a checkpoint — verify
the few highest-impact things deeply, leave a "story so far", and re-run to pick
up the rest. Don't try to settle the whole PR in one pass.

The command **tracks its own test coverage** in
`.claude/state/assess-comments.coverage.md`. Read it at the start of every run
(step 1) and update it at the end (step 11) — it's how the command knows,
honestly, which of its checks have actually fired on real data and which are
still unproven (ping-pong loop detection, notably). It's a *coverage* record,
not a correctness proof.

If `$ARGUMENTS` is given, treat it as a commenter login and assess **only** that
author's comments (match case-insensitively; `cursor` matches `cursor[bot]`).
With no arguments, assess **every** commenter — bots and humans alike, because
human intent ("this is a CTF example", "deliberate UK spelling in a quote") is
exactly what lets you adjudicate the tool findings.

1. **Load your coverage ledger, then find the branch and its PR.** First read
`.claude/state/assess-comments.coverage.md` so you know which of your own
checks are proven vs unproven this run (it also tells you to watch hardest for
the untested ones — ping-pong loops especially). Then run
`git branch --show-current`, and:

```
gh pr view --json number,url,title,headRefName,body
```

If there's no PR for this branch, stop and tell me — there's nothing to
assess. Keep the PR `body`: the AI summary usually lives there and is
**intent context**, not a finding to triage.

2. **Collect every comment, from every source.** Check all of these:

- **Inline review comments** (tied to lines):
```
gh api repos/{owner}/{repo}/pulls/<number>/comments --paginate
```
- **Top-level PR / issue comments** (summaries, bot posts, human replies):
```
gh api repos/{owner}/{repo}/issues/<number>/comments --paginate
```
- **Review verdicts** (approve / request-changes bodies):
```
gh api repos/{owner}/{repo}/pulls/<number>/reviews --paginate
```
- **Thread resolution state** — *don't skip this.* The REST `comments`
endpoint's `position`/outdated field is **not** a reliable "addressed"
signal (a resolved thread routinely still reports `position` non-null). The
authoritative signal is the GraphQL `reviewThreads`:
```
gh api graphql -F query='query { repository(owner:"<owner>", name:"<repo>")
{ pullRequest(number:<number>) { reviewThreads(first:100) { nodes {
isResolved isOutdated comments(first:100){ nodes {
databaseId author{login} path body } } } } } } }'
Comment thread
cursor[bot] marked this conversation as resolved.
```
`first:100` covers any normal PR, but unlike the `--paginate`d REST calls it
does **not** page — so if a PR genuinely has >100 review threads (or a thread
with >100 comments), check `pageInfo.hasNextPage` and follow the cursor, or
you'll silently miss resolution state and skew the open/resolved split.

Use `databaseId` to map each thread back to the inline comments from the
REST pull, so you know which are **resolved** vs **open**.

For each comment capture: author `login`, `created_at`, the body, and for
inline comments the `path`, line, and its thread's `isResolved` /
`isOutdated`. If `$ARGUMENTS` was given, it narrows which comments become
**findings you cluster and adjudicate** (steps 4–7) to the matching author —
but **don't discard the others**. Always retain every review verdict and the
full set of open findings as *context*, because step 8 (approval-over-open-
finding) needs to see approvals and bot findings together; filtering them out
would silently disable that safety check. In filtered mode, run step 8
against the full comment set, not the filtered one.

If no comment **or review verdict** matches `$ARGUMENTS`, stop and tell me — a
reviewer who left only an approve/request-changes verdict still counts as a
match.

3. **Tag each comment _and review verdict_ by its source role.** You're
reconciling *roles*, not usernames. Reviews are first-class here: a
**request-changes** (or any review whose body raises specific, actionable
points) is a *finding* from its author's role, not just an approve/reject
signal for step 8 — tag and carry it forward like any comment. (A bare
approval with no actionable body stays a step-8 signal only.) Classify each
author as one of:

- **bugbot** — author `login` is `cursor` / `cursor[bot]`. **Tag by author,
not body text:** a *human* reply that quotes or mentions "Bugbot" is still a
human comment — don't tag it bugbot (that would mis-route it to
`/docs:bugbot`). Only treat a body mention of "Bugbot" as a signal when the
author is some other bot relaying bugbot's output.
- **security** — the security scanner
- **history** — the commit-history / semantic-similarity bot
- **summary** — the AI PR summary (usually the PR body or a bot comment)
- **ci** — `github-actions` / build / staging-link bots
- **human** — everyone else
Comment thread
cursor[bot] marked this conversation as resolved.

Treat **summary**, **ci**, and staging-build links as *context*, not
findings — they tell you what the author intended and whether the build is
healthy. Don't raise them as issues.

4. **Cluster findings by what they touch, not by who said them.** Group the
actual findings — from bugbot, security, history, humans, **and actionable
review-verdict bodies (step 3)** — by file + line, or by shared theme when
they're not line-specific. One cluster may hold comments from several tools
pointing at the same code — that overlap is the whole point.

5. **Split clusters into OPEN and RESOLVED — and treat them differently.** Using
the resolution state from step 2, mark each cluster open or resolved. An
active PR is almost always a mix; don't let resolved threads drown the open
ones.

- **Open clusters** are your real work — they go through prioritise →
adjudicate (the steps below).
- **Resolved clusters** are *not* re-litigated as fresh findings. But they
aren't worthless either — do two cheap, high-value things with them:
- **Fix-quality spot-check.** For a sample of resolved findings (favour the
higher-impact ones), check that the committed change *actually addressed*
the point, not just silenced the comment — e.g. if a reviewer asked to
remove a term or fix a command, confirm it's genuinely gone/correct in
the current file. Flag any "fixed in name only."
- **"Resolved ≠ fixed" flag.** Call out any thread that is **resolved but
has no corresponding code change** (resolved, `isOutdated` false, and no
intervening commit touched the line). These are often legitimate
deferrals ("leaving as-is pending eng decision before GA") — but that's a
future obligation hiding inside a green checkmark, so surface it.
**Mandatory deep-verify:** any such thread that the bot rated **High (or
Critical) severity** must be opened and verified against the current code
*this round*, regardless of the depth cap in step 6 — a serious bug must
never hide behind a green checkmark just because the run hit its budget.
If it's genuinely fixed, say so; if it's resolved-but-still-present,
that's a headline.
- Resolved threads also feed **bot calibration**: the ratio of a bot's
findings that humans accepted-and-fixed vs dismissed tells you how much to
trust that bot's next *solo* finding. Note it when it's informative.

6. **Prioritise, and cap how deep you go.** This command is meant to be run
**repeatedly** over a PR's life — the bots re-comment after each round of
fixes, so a single run is never the whole story. Don't try to verify every
cluster to the bottom. Rank the **open** clusters by impact — broken
`relref`/`image`/`embed-md` paths, wrong commands or code samples, incorrect
technical claims, and anything that breaks a feature the PR description calls
out, rank highest; pure style nits lowest — and pick **at most the top 3–5**
to verify in depth this round. The rest you list shallowly (see step 9) as
"not assessed this round" so the next run can pick them up.

Three things always make the deep-verify set regardless of the cap, because
they're the failures a reconciler exists to catch: **contradictions**,
**ping-pong loops**, and **resolved-but-not-outdated High/Critical findings**
(step 5). Everything else competes for the remaining slots.

One assessment runs *outside* the cap entirely: **subsystem churn** (step 7).
Churn is a pattern across findings — including ones you deferred this round
and ones from prior rounds — so judge it against the **full** open+recent set,
never just the capped deep set. If a deferred cluster turns out to be part of
a churn pattern, pull it into deep-verify; the cap must not hide churn.

7. **Adjudicate the chosen clusters — and verify, don't trust.** For each
cluster in your deep set, open the file it points at (and the underlying data
it depends on) and judge for yourself; these tools have false positives, and
a single tool can be *correct about the code but wrong about impact*, or
*plausible but refuted once you read the data*. Label each:

- **Agreement** — multiple sources flag the same thing → high confidence it's
real.
- **Contradiction** — sources disagree, *or* a finding conflicts with stated
human intent or the PR summary (e.g. security flags a sample that a human
said is a deliberate CTF example; bugbot wants a refactor the history bot
says was reverted before). Call these out loudest — they're where blind
auto-fixing does damage.
- **Ping-pong** — a genuine loop is narrower than it looks. Using `created_at`
order and the commits in between, flag it **only** when *the same concern is
reopened* (a finding you thought fixed comes back), or *a fix for tool A's
comment re-triggered tool B in a cycle* (A→fix→B→fix→A…). It is **not**
ping-pong when a fix round simply prompts a re-scan that surfaces **new,
independent findings** — even if they land in the region you just edited.
That's normal iteration; treat those as fresh clusters, not a loop. (Real
example of the non-loop case in the coverage ledger's near-miss log.) When
it *is* a true loop, flag it rather than extending it.
- **Subsystem churn** — distinct from ping-pong, and the *earlier* warning
sign. Look across fix rounds: if several findings (even new, independent
ones) keep landing on the **same area of code you keep patching**, the
symptom isn't any single finding — it's that the subsystem is
under-specified and each patch just exposes the next adjacent gap.
Incrementally fixing the latest one tends to trigger another round. When you
see this, **recommend a single consolidated redesign of that subsystem
rather than another patch** — that's the move that actually ends the churn.
(Worked example in the ledger: PR #3536's review-handling findings across
rounds 1 and 4.)
Comment thread
cursor[bot] marked this conversation as resolved.
- **Solo** — a single source, no corroboration → judge it on its merits and
say how confident you are. A solo finding you've *verified against the code
and data* is high-confidence even with one source.
- **Stale / already-addressed** — the code no longer matches what the comment
describes; a later commit already handled it.

US-vs-UK spelling counts as a real issue in this docs repo even though it's
small; pure style nits usually don't.

8. **Cross-check review verdicts against open findings.** Separately from the
clusters, line up the PR's **review verdicts** (approve / request-changes)
against the findings still open. Flag the dangerous combination explicitly:
**a human approval (especially a low-confidence one — "skimmed it", "LGTM",
"didn't review in detail") sitting on top of an unresolved bot finding.**
Neither the reviewer nor the bot is wrong alone, but together they can let a
real issue merge — exactly the gap a reconciler should catch. Surface it even
when the PR is marked WIP / don't-merge.

9. **Produce the reconciliation report.** One row per **open cluster you
deep-verified this round** (step 7) — those are the only clusters with a
real verdict. Open clusters you deferred under the step-6 cap do **not** get a
table row or a verdict (you haven't verified them); they're listed, plainly
marked unverified, in the "story so far" below. The table never promises more
coverage than you actually did.

| Cluster | File / line | Sources | Verdict | Recommended action |
|---|---|---|---|---|

- **Verdict**: agreement / contradiction / ping-pong / churn / solo / stale.
Comment thread
cursor[bot] marked this conversation as resolved.
- **Recommended action**: one of — *safe to fix* (uncontested and low-risk;
hand **bugbot-sourced** clusters to `/docs:bugbot`, but apply safe fixes
from any other source — human, Codex, security, history — directly, since
`/docs:bugbot` won't touch those); *needs your call* (contradiction,
approval-over-finding, or history-bot warning); *dismiss* (false positive —
say why); *already handled* (stale).

Then, in this order:
- the **headline** — contradictions, ping-pong loops, **subsystem churn (with
its consolidate-don't-patch recommendation)**, and any approval-over-open-
finding from step 8 (these need a human) first;
- the **safe-to-fix** list;
- what you're **dismissing** and why, so I can sanity-check your reasoning;
- a short **resolved-threads** note (from step 5) — only what's worth saying:
any "fixed in name only" or "resolved ≠ fixed / deferred" flags, and a bot-
calibration line if it's informative. Don't list cleanly-resolved threads
one by one; a count is enough;
- a **"Story so far"** line: which open clusters you deep-verified this round,
how many you deferred as "not assessed this round" (list them by one-line
title), how many threads were already resolved, and a reminder that
re-running after the next round of fixes will pick up the deferred ones plus
any new comments. This is a checkpoint, not a verdict on the whole PR.

10. **Offer a second opinion via Codex, only if it's available.** Check whether
the Codex CLI is installed — run `command -v codex`. If it is, end by
suggesting the user get an independent review of the same diff the bots saw:

```
/codex:review --base main --scope branch
```

Frame it as optional and note it adds runtime, and that it's worth it mainly
for the contested or high-impact clusters — the ones where a second model
disagreeing (or agreeing) changes your confidence. Don't run it yourself:
`/codex:review` is user-invoked only, and this command is report-only. If
`command -v codex` finds nothing, say nothing about Codex.

11. **Update your coverage ledger — automatically, but evidence-gated.** Open
`.claude/state/assess-comments.coverage.md` and update it to reflect what
*actually fired on real data this run*:

- For each capability you genuinely exercised, bump its **encounters**, set
**last verified** to today, and add this PR (with a comment id or commit
SHA) to **Evidence**. Raise confidence per the scale — but **never raise a
confidence without a citable artifact.** No evidence, no bump.
- A capability counts as **🟢 corroborated** only on ≥2 *distinct* PRs, or
once with an independent `/codex:review` concurrence on the same finding.
- If you hit a **rare pattern for the first time** — a real ping-pong loop, a
resolved-but-still-broken thread — record its concrete signature in the
**Worked examples library** (tools involved, comment ids, commits). This is
how detection sharpens over time.
- Refresh ⏳ **decaying** entries you re-confirmed this run.
- If the run revealed that an *instruction* in this command should change
(a heuristic nearly missed something), **don't edit the command yourself**
— note the suggested refinement in your output and let me decide.

The ledger update is the one file you *may* write. After writing it, **tell
me plainly that the command learned something and updated its own coverage
ledger**, name what changed in one line, and remind me I can include that
change in the PR or leave it out — my call.

12. **Stop there.** Don't edit anything except the coverage ledger (step 11);
don't commit, push, or reply to the PR. Your output is the assessment —
I'll decide what to act on.
Loading
Loading