fix(pr-digest): retry transient 5xx on gh + Anthropic + Slack calls#295
Merged
Conversation
GitHub's GraphQL endpoint (which `gh pr list` uses internally) flaps with 502/503 periodically — especially at the top of the hour when cron workflows fire en masse. A single failure aborted the whole digest. Same risk on the Anthropic and Slack POSTs. Wraps all three external calls in retry-with-exponential-backoff: - `gh pr list` : 5 attempts, 2s/4s/8s/16s/32s - `api.anthropic.com` : 4 attempts, 2s/4s/8s/16s - `slack.com` (main + thread): 4/3 attempts, 2s/4s/8s Same pattern we already use in scripts/fetch_historical_prs.py in the companion pr-digest-model repo (which retries fine). Caught after the 9:07 cron failed silently this week with the now- familiar HTTP 502: https://github.com/monta-app/monorepo-typescript/actions/runs/26748322383
There was a problem hiding this comment.
Pull request overview
Adds retry-with-exponential-backoff around external calls in the pr-digest reusable workflow to make the digest more resilient to transient failures (GitHub GraphQL via gh, Anthropic messages API, and Slack posting).
Changes:
- Wrap
gh pr listin an exponential-backoff retry loop and surface failures with warnings/errors. - Add retry loops for the Anthropic call (non-blocking step) and Slack
chat.postMessagecalls (main + thread), with warnings per retry.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Bump max_attempts so the actual sleeps match the documented backoff schedule. Was off-by-one: 5 attempts only yields 4 sleeps (2/4/8/16s), never the 32s the comment claimed. · gh pr list: 5 → 6 (4 → 5 sleeps, max 32s) · Anthropic: 4 → 5 (3 → 4 sleeps, max 16s) · Slack main: 4 → 5 (3 → 4 sleeps, max 16s) · Slack thread: 3 → 4 (2 → 3 sleeps, max 8s) - Anthropic success check: don't treat unparseable responses (e.g. an edge-returned HTML 502 page) as success. The old `! jq -e '.error'` was true both when (a) the response was JSON without an error field, AND when (b) jq could not parse the response at all (jq exits non-zero either way). On an HTML 502 from the Anthropic edge we'd break out of the retry loop, then fail downstream on `.content[0].text`. New check: explicitly require the response to be a JSON object with a non-empty `.content[0].text`. Anything else triggers retry/backoff, then graceful fallback to deterministic curation.
…ection limit The thread breakdown rendered all stale PRs for one area into a single Slack section's mrkdwn text. With ~22 stale PRs in monorepo-typescript and ~150 chars per prLine, that's ~3300 chars — over Slack's per-section hard limit of 3000 chars. Slack rejected the post with `invalid_blocks`. Fix: chunk PR lines into groups of 15 per section (~2250 chars max, comfortably under 3000). Each chunk becomes its own section block; the first chunk per area carries the "*`<area>`* — N stale" header, subsequent chunks are bare continuations. Also caps total block count at 48 (Slack's per-message hard limit is 50), with a graceful "…and N more" context block if we'd otherwise exceed. Caught after 2026-06-01 manual trigger: https://github.com/monta-app/monorepo-typescript/actions/runs/26750162432
…uoted jq block The "section's" and "we'd" apostrophes inside comment lines were breaking shellcheck's parser — apostrophes inside a bash '...' string terminate the quoted region. shellcheck (and bash itself) saw the single quote close the jq program early. In practice bash slurped the jq program back together at runtime, but actionlint correctly flagged this as a real ambiguity. Moved the explanatory comments above the jq call (where they're bash comments, not inside the quoted string) and rephrased to avoid stray apostrophes in the remaining jq context block.
joscdk
approved these changes
Jun 1, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What?
Wraps the three external API calls in the
pr-digestreusable workflow with retry-with-exponential-backoff:The same retry pattern is already used in `scripts/fetch_historical_prs.py` in the companion `pr-digest-model` project (which retries fine).
Why?
GitHub's GraphQL endpoint flaps with 502/503 periodically — particularly at the top of the hour when many cron workflows fire at once. From GitHub's own docs:
A single `HTTP 502` from `api.github.com/graphql` was aborting the whole digest. We hit it during initial smoke-testing and again on this week's manual trigger:
Anthropic and Slack POSTs got the same defensive treatment for the same reason.
Behaviour