-
Notifications
You must be signed in to change notification settings - Fork 335
DOC-6771 added skill and slash command to address PR comments #3536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
andy-stark-redis
merged 7 commits into
main
from
DOC-6771-skill-to-assess-all-specific-pr-comments
Jun 23, 2026
+411
−0
Merged
Changes from all commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
17517ed
DOC-6771 added skill to address PR comments
andy-stark-redis da7c04a
DOC-6771 bugbot fixes plus skill lessons learned
andy-stark-redis b616cf7
DOC-6771 more from the Bugbot
andy-stark-redis b763718
DOC-6771 skill self-improvement
andy-stark-redis 916dd3e
DOC-6771 more Bugbot fixes
andy-stark-redis ba7921d
DOC-6771 yet more from the Bugbot
andy-stark-redis d788bce
DOC-6771
andy-stark-redis File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,302 @@ | ||
| --- | ||
| description: Reconcile and assess all comments on the current branch's PR across every tool and commenter — report only, no edits | ||
| argument-hint: "[commenter-login] (optional — defaults to all commenters)" | ||
| --- | ||
|
|
||
| You're going to read **all** the comments on the pull request for the current | ||
| branch — from automation (Cursor Bugbot, the security scanner, the history bot, | ||
| the AI PR summary, CI) *and* from humans — and produce a single **reconciliation | ||
| report**: what each comment raises, where the tools **agree**, where they | ||
| **contradict each other**, and where they're chasing each other in circles. This | ||
| is the synthesis layer the individual tools don't have. | ||
|
|
||
| **This command assesses and reports. It does not edit PR content, commit, push, | ||
| or reply** — the one and only file it may write is its own coverage ledger | ||
| (step 11). Fixing is a separate step. Note that `/docs:bugbot` only triages and | ||
| fixes **Cursor Bugbot's own** comments — so it's the right hand-off *only* for | ||
| bugbot-sourced clusters. Safe fixes that came from humans, Codex, the security | ||
| or history bots, etc. won't be touched by `/docs:bugbot`; apply those directly | ||
| (or ask me to) once you've seen the report. | ||
|
|
||
| It's also meant to be run **repeatedly** as the PR evolves: the bots tend to post | ||
| fresh comments after each round of fixes, so each run is a checkpoint — verify | ||
| the few highest-impact things deeply, leave a "story so far", and re-run to pick | ||
| up the rest. Don't try to settle the whole PR in one pass. | ||
|
|
||
| The command **tracks its own test coverage** in | ||
| `.claude/state/assess-comments.coverage.md`. Read it at the start of every run | ||
| (step 1) and update it at the end (step 11) — it's how the command knows, | ||
| honestly, which of its checks have actually fired on real data and which are | ||
| still unproven (ping-pong loop detection, notably). It's a *coverage* record, | ||
| not a correctness proof. | ||
|
|
||
| If `$ARGUMENTS` is given, treat it as a commenter login and assess **only** that | ||
| author's comments (match case-insensitively; `cursor` matches `cursor[bot]`). | ||
| With no arguments, assess **every** commenter — bots and humans alike, because | ||
| human intent ("this is a CTF example", "deliberate UK spelling in a quote") is | ||
| exactly what lets you adjudicate the tool findings. | ||
|
|
||
| 1. **Load your coverage ledger, then find the branch and its PR.** First read | ||
| `.claude/state/assess-comments.coverage.md` so you know which of your own | ||
| checks are proven vs unproven this run (it also tells you to watch hardest for | ||
| the untested ones — ping-pong loops especially). Then run | ||
| `git branch --show-current`, and: | ||
|
|
||
| ``` | ||
| gh pr view --json number,url,title,headRefName,body | ||
| ``` | ||
|
|
||
| If there's no PR for this branch, stop and tell me — there's nothing to | ||
| assess. Keep the PR `body`: the AI summary usually lives there and is | ||
| **intent context**, not a finding to triage. | ||
|
|
||
| 2. **Collect every comment, from every source.** Check all of these: | ||
|
|
||
| - **Inline review comments** (tied to lines): | ||
| ``` | ||
| gh api repos/{owner}/{repo}/pulls/<number>/comments --paginate | ||
| ``` | ||
| - **Top-level PR / issue comments** (summaries, bot posts, human replies): | ||
| ``` | ||
| gh api repos/{owner}/{repo}/issues/<number>/comments --paginate | ||
| ``` | ||
| - **Review verdicts** (approve / request-changes bodies): | ||
| ``` | ||
| gh api repos/{owner}/{repo}/pulls/<number>/reviews --paginate | ||
| ``` | ||
| - **Thread resolution state** — *don't skip this.* The REST `comments` | ||
| endpoint's `position`/outdated field is **not** a reliable "addressed" | ||
| signal (a resolved thread routinely still reports `position` non-null). The | ||
| authoritative signal is the GraphQL `reviewThreads`: | ||
| ``` | ||
| gh api graphql -F query='query { repository(owner:"<owner>", name:"<repo>") | ||
| { pullRequest(number:<number>) { reviewThreads(first:100) { nodes { | ||
| isResolved isOutdated comments(first:100){ nodes { | ||
| databaseId author{login} path body } } } } } } }' | ||
| ``` | ||
| `first:100` covers any normal PR, but unlike the `--paginate`d REST calls it | ||
| does **not** page — so if a PR genuinely has >100 review threads (or a thread | ||
| with >100 comments), check `pageInfo.hasNextPage` and follow the cursor, or | ||
| you'll silently miss resolution state and skew the open/resolved split. | ||
|
|
||
| Use `databaseId` to map each thread back to the inline comments from the | ||
| REST pull, so you know which are **resolved** vs **open**. | ||
|
|
||
| For each comment capture: author `login`, `created_at`, the body, and for | ||
| inline comments the `path`, line, and its thread's `isResolved` / | ||
| `isOutdated`. If `$ARGUMENTS` was given, it narrows which comments become | ||
| **findings you cluster and adjudicate** (steps 4–7) to the matching author — | ||
| but **don't discard the others**. Always retain every review verdict and the | ||
| full set of open findings as *context*, because step 8 (approval-over-open- | ||
| finding) needs to see approvals and bot findings together; filtering them out | ||
| would silently disable that safety check. In filtered mode, run step 8 | ||
| against the full comment set, not the filtered one. | ||
|
|
||
| If no comment **or review verdict** matches `$ARGUMENTS`, stop and tell me — a | ||
| reviewer who left only an approve/request-changes verdict still counts as a | ||
| match. | ||
|
|
||
| 3. **Tag each comment _and review verdict_ by its source role.** You're | ||
| reconciling *roles*, not usernames. Reviews are first-class here: a | ||
| **request-changes** (or any review whose body raises specific, actionable | ||
| points) is a *finding* from its author's role, not just an approve/reject | ||
| signal for step 8 — tag and carry it forward like any comment. (A bare | ||
| approval with no actionable body stays a step-8 signal only.) Classify each | ||
| author as one of: | ||
|
|
||
| - **bugbot** — author `login` is `cursor` / `cursor[bot]`. **Tag by author, | ||
| not body text:** a *human* reply that quotes or mentions "Bugbot" is still a | ||
| human comment — don't tag it bugbot (that would mis-route it to | ||
| `/docs:bugbot`). Only treat a body mention of "Bugbot" as a signal when the | ||
| author is some other bot relaying bugbot's output. | ||
| - **security** — the security scanner | ||
| - **history** — the commit-history / semantic-similarity bot | ||
| - **summary** — the AI PR summary (usually the PR body or a bot comment) | ||
| - **ci** — `github-actions` / build / staging-link bots | ||
| - **human** — everyone else | ||
|
cursor[bot] marked this conversation as resolved.
|
||
|
|
||
| Treat **summary**, **ci**, and staging-build links as *context*, not | ||
| findings — they tell you what the author intended and whether the build is | ||
| healthy. Don't raise them as issues. | ||
|
|
||
| 4. **Cluster findings by what they touch, not by who said them.** Group the | ||
| actual findings — from bugbot, security, history, humans, **and actionable | ||
| review-verdict bodies (step 3)** — by file + line, or by shared theme when | ||
| they're not line-specific. One cluster may hold comments from several tools | ||
| pointing at the same code — that overlap is the whole point. | ||
|
|
||
| 5. **Split clusters into OPEN and RESOLVED — and treat them differently.** Using | ||
| the resolution state from step 2, mark each cluster open or resolved. An | ||
| active PR is almost always a mix; don't let resolved threads drown the open | ||
| ones. | ||
|
|
||
| - **Open clusters** are your real work — they go through prioritise → | ||
| adjudicate (the steps below). | ||
| - **Resolved clusters** are *not* re-litigated as fresh findings. But they | ||
| aren't worthless either — do two cheap, high-value things with them: | ||
| - **Fix-quality spot-check.** For a sample of resolved findings (favour the | ||
| higher-impact ones), check that the committed change *actually addressed* | ||
| the point, not just silenced the comment — e.g. if a reviewer asked to | ||
| remove a term or fix a command, confirm it's genuinely gone/correct in | ||
| the current file. Flag any "fixed in name only." | ||
| - **"Resolved ≠ fixed" flag.** Call out any thread that is **resolved but | ||
| has no corresponding code change** (resolved, `isOutdated` false, and no | ||
| intervening commit touched the line). These are often legitimate | ||
| deferrals ("leaving as-is pending eng decision before GA") — but that's a | ||
| future obligation hiding inside a green checkmark, so surface it. | ||
| **Mandatory deep-verify:** any such thread that the bot rated **High (or | ||
| Critical) severity** must be opened and verified against the current code | ||
| *this round*, regardless of the depth cap in step 6 — a serious bug must | ||
| never hide behind a green checkmark just because the run hit its budget. | ||
| If it's genuinely fixed, say so; if it's resolved-but-still-present, | ||
| that's a headline. | ||
| - Resolved threads also feed **bot calibration**: the ratio of a bot's | ||
| findings that humans accepted-and-fixed vs dismissed tells you how much to | ||
| trust that bot's next *solo* finding. Note it when it's informative. | ||
|
|
||
| 6. **Prioritise, and cap how deep you go.** This command is meant to be run | ||
| **repeatedly** over a PR's life — the bots re-comment after each round of | ||
| fixes, so a single run is never the whole story. Don't try to verify every | ||
| cluster to the bottom. Rank the **open** clusters by impact — broken | ||
| `relref`/`image`/`embed-md` paths, wrong commands or code samples, incorrect | ||
| technical claims, and anything that breaks a feature the PR description calls | ||
| out, rank highest; pure style nits lowest — and pick **at most the top 3–5** | ||
| to verify in depth this round. The rest you list shallowly (see step 9) as | ||
| "not assessed this round" so the next run can pick them up. | ||
|
|
||
| Three things always make the deep-verify set regardless of the cap, because | ||
| they're the failures a reconciler exists to catch: **contradictions**, | ||
| **ping-pong loops**, and **resolved-but-not-outdated High/Critical findings** | ||
| (step 5). Everything else competes for the remaining slots. | ||
|
|
||
| One assessment runs *outside* the cap entirely: **subsystem churn** (step 7). | ||
| Churn is a pattern across findings — including ones you deferred this round | ||
| and ones from prior rounds — so judge it against the **full** open+recent set, | ||
| never just the capped deep set. If a deferred cluster turns out to be part of | ||
| a churn pattern, pull it into deep-verify; the cap must not hide churn. | ||
|
|
||
| 7. **Adjudicate the chosen clusters — and verify, don't trust.** For each | ||
| cluster in your deep set, open the file it points at (and the underlying data | ||
| it depends on) and judge for yourself; these tools have false positives, and | ||
| a single tool can be *correct about the code but wrong about impact*, or | ||
| *plausible but refuted once you read the data*. Label each: | ||
|
|
||
| - **Agreement** — multiple sources flag the same thing → high confidence it's | ||
| real. | ||
| - **Contradiction** — sources disagree, *or* a finding conflicts with stated | ||
| human intent or the PR summary (e.g. security flags a sample that a human | ||
| said is a deliberate CTF example; bugbot wants a refactor the history bot | ||
| says was reverted before). Call these out loudest — they're where blind | ||
| auto-fixing does damage. | ||
| - **Ping-pong** — a genuine loop is narrower than it looks. Using `created_at` | ||
| order and the commits in between, flag it **only** when *the same concern is | ||
| reopened* (a finding you thought fixed comes back), or *a fix for tool A's | ||
| comment re-triggered tool B in a cycle* (A→fix→B→fix→A…). It is **not** | ||
| ping-pong when a fix round simply prompts a re-scan that surfaces **new, | ||
| independent findings** — even if they land in the region you just edited. | ||
| That's normal iteration; treat those as fresh clusters, not a loop. (Real | ||
| example of the non-loop case in the coverage ledger's near-miss log.) When | ||
| it *is* a true loop, flag it rather than extending it. | ||
| - **Subsystem churn** — distinct from ping-pong, and the *earlier* warning | ||
| sign. Look across fix rounds: if several findings (even new, independent | ||
| ones) keep landing on the **same area of code you keep patching**, the | ||
| symptom isn't any single finding — it's that the subsystem is | ||
| under-specified and each patch just exposes the next adjacent gap. | ||
| Incrementally fixing the latest one tends to trigger another round. When you | ||
| see this, **recommend a single consolidated redesign of that subsystem | ||
| rather than another patch** — that's the move that actually ends the churn. | ||
| (Worked example in the ledger: PR #3536's review-handling findings across | ||
| rounds 1 and 4.) | ||
|
cursor[bot] marked this conversation as resolved.
|
||
| - **Solo** — a single source, no corroboration → judge it on its merits and | ||
| say how confident you are. A solo finding you've *verified against the code | ||
| and data* is high-confidence even with one source. | ||
| - **Stale / already-addressed** — the code no longer matches what the comment | ||
| describes; a later commit already handled it. | ||
|
|
||
| US-vs-UK spelling counts as a real issue in this docs repo even though it's | ||
| small; pure style nits usually don't. | ||
|
|
||
| 8. **Cross-check review verdicts against open findings.** Separately from the | ||
| clusters, line up the PR's **review verdicts** (approve / request-changes) | ||
| against the findings still open. Flag the dangerous combination explicitly: | ||
| **a human approval (especially a low-confidence one — "skimmed it", "LGTM", | ||
| "didn't review in detail") sitting on top of an unresolved bot finding.** | ||
| Neither the reviewer nor the bot is wrong alone, but together they can let a | ||
| real issue merge — exactly the gap a reconciler should catch. Surface it even | ||
| when the PR is marked WIP / don't-merge. | ||
|
|
||
| 9. **Produce the reconciliation report.** One row per **open cluster you | ||
| deep-verified this round** (step 7) — those are the only clusters with a | ||
| real verdict. Open clusters you deferred under the step-6 cap do **not** get a | ||
| table row or a verdict (you haven't verified them); they're listed, plainly | ||
| marked unverified, in the "story so far" below. The table never promises more | ||
| coverage than you actually did. | ||
|
|
||
| | Cluster | File / line | Sources | Verdict | Recommended action | | ||
| |---|---|---|---|---| | ||
|
|
||
| - **Verdict**: agreement / contradiction / ping-pong / churn / solo / stale. | ||
|
cursor[bot] marked this conversation as resolved.
|
||
| - **Recommended action**: one of — *safe to fix* (uncontested and low-risk; | ||
| hand **bugbot-sourced** clusters to `/docs:bugbot`, but apply safe fixes | ||
| from any other source — human, Codex, security, history — directly, since | ||
| `/docs:bugbot` won't touch those); *needs your call* (contradiction, | ||
| approval-over-finding, or history-bot warning); *dismiss* (false positive — | ||
| say why); *already handled* (stale). | ||
|
|
||
| Then, in this order: | ||
| - the **headline** — contradictions, ping-pong loops, **subsystem churn (with | ||
| its consolidate-don't-patch recommendation)**, and any approval-over-open- | ||
| finding from step 8 (these need a human) first; | ||
| - the **safe-to-fix** list; | ||
| - what you're **dismissing** and why, so I can sanity-check your reasoning; | ||
| - a short **resolved-threads** note (from step 5) — only what's worth saying: | ||
| any "fixed in name only" or "resolved ≠ fixed / deferred" flags, and a bot- | ||
| calibration line if it's informative. Don't list cleanly-resolved threads | ||
| one by one; a count is enough; | ||
| - a **"Story so far"** line: which open clusters you deep-verified this round, | ||
| how many you deferred as "not assessed this round" (list them by one-line | ||
| title), how many threads were already resolved, and a reminder that | ||
| re-running after the next round of fixes will pick up the deferred ones plus | ||
| any new comments. This is a checkpoint, not a verdict on the whole PR. | ||
|
|
||
| 10. **Offer a second opinion via Codex, only if it's available.** Check whether | ||
| the Codex CLI is installed — run `command -v codex`. If it is, end by | ||
| suggesting the user get an independent review of the same diff the bots saw: | ||
|
|
||
| ``` | ||
| /codex:review --base main --scope branch | ||
| ``` | ||
|
|
||
| Frame it as optional and note it adds runtime, and that it's worth it mainly | ||
| for the contested or high-impact clusters — the ones where a second model | ||
| disagreeing (or agreeing) changes your confidence. Don't run it yourself: | ||
| `/codex:review` is user-invoked only, and this command is report-only. If | ||
| `command -v codex` finds nothing, say nothing about Codex. | ||
|
|
||
| 11. **Update your coverage ledger — automatically, but evidence-gated.** Open | ||
| `.claude/state/assess-comments.coverage.md` and update it to reflect what | ||
| *actually fired on real data this run*: | ||
|
|
||
| - For each capability you genuinely exercised, bump its **encounters**, set | ||
| **last verified** to today, and add this PR (with a comment id or commit | ||
| SHA) to **Evidence**. Raise confidence per the scale — but **never raise a | ||
| confidence without a citable artifact.** No evidence, no bump. | ||
| - A capability counts as **🟢 corroborated** only on ≥2 *distinct* PRs, or | ||
| once with an independent `/codex:review` concurrence on the same finding. | ||
| - If you hit a **rare pattern for the first time** — a real ping-pong loop, a | ||
| resolved-but-still-broken thread — record its concrete signature in the | ||
| **Worked examples library** (tools involved, comment ids, commits). This is | ||
| how detection sharpens over time. | ||
| - Refresh ⏳ **decaying** entries you re-confirmed this run. | ||
| - If the run revealed that an *instruction* in this command should change | ||
| (a heuristic nearly missed something), **don't edit the command yourself** | ||
| — note the suggested refinement in your output and let me decide. | ||
|
|
||
| The ledger update is the one file you *may* write. After writing it, **tell | ||
| me plainly that the command learned something and updated its own coverage | ||
| ledger**, name what changed in one line, and remind me I can include that | ||
| change in the PR or leave it out — my call. | ||
|
|
||
| 12. **Stop there.** Don't edit anything except the coverage ledger (step 11); | ||
| don't commit, push, or reply to the PR. Your output is the assessment — | ||
| I'll decide what to act on. | ||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.