fix(pr-digest): retry transient 5xx on gh + Anthropic + Slack calls by gabrielanhaia · Pull Request #295 · monta-app/github-workflows

gabrielanhaia · 2026-06-01T10:11:33Z

What?

Wraps the three external API calls in the pr-digest reusable workflow with retry-with-exponential-backoff:

Call	Attempts	Backoff
`gh pr list` (uses GitHub GraphQL internally)	5	2s / 4s / 8s / 16s / 32s
`POST api.anthropic.com/v1/messages`	4	2s / 4s / 8s / 16s
`POST slack.com/api/chat.postMessage` (main message)	4	2s / 4s / 8s / 16s
`POST slack.com/api/chat.postMessage` (thread reply)	3	2s / 4s / 8s

The same retry pattern is already used in `scripts/fetch_historical_prs.py` in the companion `pr-digest-model` project (which retries fine).

Why?

GitHub's GraphQL endpoint flaps with 502/503 periodically — particularly at the top of the hour when many cron workflows fire at once. From GitHub's own docs:

Scheduled workflows may be delayed during periods of high loads of GitHub Actions workflow runs. High load times include the start of every hour.

A single `HTTP 502` from `api.github.com/graphql` was aborting the whole digest. We hit it during initial smoke-testing and again on this week's manual trigger:

This week's failure: https://github.com/monta-app/monorepo-typescript/actions/runs/26748322383
Original smoke-test failures (May 30): runs 26661895893 + 26661969276

Anthropic and Slack POSTs got the same defensive treatment for the same reason.

Behaviour

All retries log a `::warning::` with attempt count so transient blips are visible in the run log.
`gh pr list` and the main Slack post are hard-fail after retries (they're required for a meaningful digest).
Anthropic is non-blocking (the job already has `continue-on-error: true`); after retries exhausted, the digest falls back to deterministic curation.
Thread breakdown is non-blocking too (always exits 0 — the main digest already posted).

GitHub's GraphQL endpoint (which `gh pr list` uses internally) flaps with 502/503 periodically — especially at the top of the hour when cron workflows fire en masse. A single failure aborted the whole digest. Same risk on the Anthropic and Slack POSTs. Wraps all three external calls in retry-with-exponential-backoff: - `gh pr list` : 5 attempts, 2s/4s/8s/16s/32s - `api.anthropic.com` : 4 attempts, 2s/4s/8s/16s - `slack.com` (main + thread): 4/3 attempts, 2s/4s/8s Same pattern we already use in scripts/fetch_historical_prs.py in the companion pr-digest-model repo (which retries fine). Caught after the 9:07 cron failed silently this week with the now- familiar HTTP 502: https://github.com/monta-app/monorepo-typescript/actions/runs/26748322383

Copilot

Pull request overview

Adds retry-with-exponential-backoff around external calls in the pr-digest reusable workflow to make the digest more resilient to transient failures (GitHub GraphQL via gh, Anthropic messages API, and Slack posting).

Changes:

Wrap gh pr list in an exponential-backoff retry loop and surface failures with warnings/errors.
Add retry loops for the Anthropic call (non-blocking step) and Slack chat.postMessage calls (main + thread), with warnings per retry.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Bump max_attempts so the actual sleeps match the documented backoff schedule. Was off-by-one: 5 attempts only yields 4 sleeps (2/4/8/16s), never the 32s the comment claimed. · gh pr list: 5 → 6 (4 → 5 sleeps, max 32s) · Anthropic: 4 → 5 (3 → 4 sleeps, max 16s) · Slack main: 4 → 5 (3 → 4 sleeps, max 16s) · Slack thread: 3 → 4 (2 → 3 sleeps, max 8s) - Anthropic success check: don't treat unparseable responses (e.g. an edge-returned HTML 502 page) as success. The old `! jq -e '.error'` was true both when (a) the response was JSON without an error field, AND when (b) jq could not parse the response at all (jq exits non-zero either way). On an HTML 502 from the Anthropic edge we'd break out of the retry loop, then fail downstream on `.content[0].text`. New check: explicitly require the response to be a JSON object with a non-empty `.content[0].text`. Anything else triggers retry/backoff, then graceful fallback to deterministic curation.

…ection limit The thread breakdown rendered all stale PRs for one area into a single Slack section's mrkdwn text. With ~22 stale PRs in monorepo-typescript and ~150 chars per prLine, that's ~3300 chars — over Slack's per-section hard limit of 3000 chars. Slack rejected the post with `invalid_blocks`. Fix: chunk PR lines into groups of 15 per section (~2250 chars max, comfortably under 3000). Each chunk becomes its own section block; the first chunk per area carries the "*`<area>`* — N stale" header, subsequent chunks are bare continuations. Also caps total block count at 48 (Slack's per-message hard limit is 50), with a graceful "…and N more" context block if we'd otherwise exceed. Caught after 2026-06-01 manual trigger: https://github.com/monta-app/monorepo-typescript/actions/runs/26750162432

…uoted jq block The "section's" and "we'd" apostrophes inside comment lines were breaking shellcheck's parser — apostrophes inside a bash '...' string terminate the quoted region. shellcheck (and bash itself) saw the single quote close the jq program early. In practice bash slurped the jq program back together at runtime, but actionlint correctly flagged this as a real ambiguity. Moved the explanatory comments above the jq call (where they're bash comments, not inside the quoted string) and rephrased to avoid stray apostrophes in the remaining jq context block.

gabrielanhaia marked this pull request as ready for review June 1, 2026 10:20

gabrielanhaia requested a review from a team as a code owner June 1, 2026 10:20

gabrielanhaia requested review from maoanran and removed request for a team June 1, 2026 10:20

gabrielanhaia self-assigned this Jun 1, 2026

gabrielanhaia requested a review from Copilot June 1, 2026 10:32

Copilot started reviewing on behalf of gabrielanhaia June 1, 2026 10:32 View session

Copilot AI reviewed Jun 1, 2026

View reviewed changes

Comment thread .github/workflows/pr-digest.yml Outdated

Comment thread .github/workflows/pr-digest.yml

gabrielanhaia added 3 commits June 1, 2026 12:42

joscdk approved these changes Jun 1, 2026

View reviewed changes

gabrielanhaia merged commit 814e768 into main Jun 1, 2026
1 check passed

gabrielanhaia deleted the fix/pr-digest-retries branch June 1, 2026 11:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(pr-digest): retry transient 5xx on gh + Anthropic + Slack calls#295

fix(pr-digest): retry transient 5xx on gh + Anthropic + Slack calls#295
gabrielanhaia merged 4 commits into
mainfrom
fix/pr-digest-retries

gabrielanhaia commented Jun 1, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

gabrielanhaia commented Jun 1, 2026

What?

Why?

Behaviour

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants