Skip to content

fix(pr-digest): retry transient 5xx on gh + Anthropic + Slack calls#295

Merged
gabrielanhaia merged 4 commits into
mainfrom
fix/pr-digest-retries
Jun 1, 2026
Merged

fix(pr-digest): retry transient 5xx on gh + Anthropic + Slack calls#295
gabrielanhaia merged 4 commits into
mainfrom
fix/pr-digest-retries

Conversation

@gabrielanhaia

Copy link
Copy Markdown
Member

What?

Wraps the three external API calls in the pr-digest reusable workflow with retry-with-exponential-backoff:

Call Attempts Backoff
`gh pr list` (uses GitHub GraphQL internally) 5 2s / 4s / 8s / 16s / 32s
`POST api.anthropic.com/v1/messages` 4 2s / 4s / 8s / 16s
`POST slack.com/api/chat.postMessage` (main message) 4 2s / 4s / 8s / 16s
`POST slack.com/api/chat.postMessage` (thread reply) 3 2s / 4s / 8s

The same retry pattern is already used in `scripts/fetch_historical_prs.py` in the companion `pr-digest-model` project (which retries fine).

Why?

GitHub's GraphQL endpoint flaps with 502/503 periodically — particularly at the top of the hour when many cron workflows fire at once. From GitHub's own docs:

Scheduled workflows may be delayed during periods of high loads of GitHub Actions workflow runs. High load times include the start of every hour.

A single `HTTP 502` from `api.github.com/graphql` was aborting the whole digest. We hit it during initial smoke-testing and again on this week's manual trigger:

Anthropic and Slack POSTs got the same defensive treatment for the same reason.

Behaviour

  • All retries log a `::warning::` with attempt count so transient blips are visible in the run log.
  • `gh pr list` and the main Slack post are hard-fail after retries (they're required for a meaningful digest).
  • Anthropic is non-blocking (the job already has `continue-on-error: true`); after retries exhausted, the digest falls back to deterministic curation.
  • Thread breakdown is non-blocking too (always exits 0 — the main digest already posted).

GitHub's GraphQL endpoint (which `gh pr list` uses internally) flaps
with 502/503 periodically — especially at the top of the hour when
cron workflows fire en masse. A single failure aborted the whole
digest. Same risk on the Anthropic and Slack POSTs.

Wraps all three external calls in retry-with-exponential-backoff:

- `gh pr list`              : 5 attempts, 2s/4s/8s/16s/32s
- `api.anthropic.com`       : 4 attempts, 2s/4s/8s/16s
- `slack.com` (main + thread): 4/3 attempts, 2s/4s/8s

Same pattern we already use in scripts/fetch_historical_prs.py in
the companion pr-digest-model repo (which retries fine).

Caught after the 9:07 cron failed silently this week with the now-
familiar HTTP 502: https://github.com/monta-app/monorepo-typescript/actions/runs/26748322383
@gabrielanhaia gabrielanhaia marked this pull request as ready for review June 1, 2026 10:20
@gabrielanhaia gabrielanhaia requested a review from a team as a code owner June 1, 2026 10:20
@gabrielanhaia gabrielanhaia requested review from maoanran and removed request for a team June 1, 2026 10:20
@gabrielanhaia gabrielanhaia self-assigned this Jun 1, 2026
@gabrielanhaia gabrielanhaia requested a review from Copilot June 1, 2026 10:32

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds retry-with-exponential-backoff around external calls in the pr-digest reusable workflow to make the digest more resilient to transient failures (GitHub GraphQL via gh, Anthropic messages API, and Slack posting).

Changes:

  • Wrap gh pr list in an exponential-backoff retry loop and surface failures with warnings/errors.
  • Add retry loops for the Anthropic call (non-blocking step) and Slack chat.postMessage calls (main + thread), with warnings per retry.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread .github/workflows/pr-digest.yml Outdated
Comment thread .github/workflows/pr-digest.yml
- Bump max_attempts so the actual sleeps match the documented
  backoff schedule. Was off-by-one: 5 attempts only yields 4
  sleeps (2/4/8/16s), never the 32s the comment claimed.
  · gh pr list:     5 → 6  (4 → 5 sleeps, max 32s)
  · Anthropic:      4 → 5  (3 → 4 sleeps, max 16s)
  · Slack main:     4 → 5  (3 → 4 sleeps, max 16s)
  · Slack thread:   3 → 4  (2 → 3 sleeps, max  8s)

- Anthropic success check: don't treat unparseable responses
  (e.g. an edge-returned HTML 502 page) as success.

  The old `! jq -e '.error'` was true both when (a) the response
  was JSON without an error field, AND when (b) jq could not
  parse the response at all (jq exits non-zero either way). On
  an HTML 502 from the Anthropic edge we'd break out of the
  retry loop, then fail downstream on `.content[0].text`.

  New check: explicitly require the response to be a JSON object
  with a non-empty `.content[0].text`. Anything else triggers
  retry/backoff, then graceful fallback to deterministic curation.
…ection limit

The thread breakdown rendered all stale PRs for one area into a single
Slack section's mrkdwn text. With ~22 stale PRs in monorepo-typescript
and ~150 chars per prLine, that's ~3300 chars — over Slack's per-section
hard limit of 3000 chars. Slack rejected the post with `invalid_blocks`.

Fix: chunk PR lines into groups of 15 per section (~2250 chars max,
comfortably under 3000). Each chunk becomes its own section block;
the first chunk per area carries the "*`<area>`* — N stale" header,
subsequent chunks are bare continuations.

Also caps total block count at 48 (Slack's per-message hard limit is 50),
with a graceful "…and N more" context block if we'd otherwise exceed.

Caught after 2026-06-01 manual trigger: https://github.com/monta-app/monorepo-typescript/actions/runs/26750162432
…uoted jq block

The "section's" and "we'd" apostrophes inside comment lines were
breaking shellcheck's parser — apostrophes inside a bash '...' string
terminate the quoted region. shellcheck (and bash itself) saw the
single quote close the jq program early.

In practice bash slurped the jq program back together at runtime, but
actionlint correctly flagged this as a real ambiguity. Moved the
explanatory comments above the jq call (where they're bash comments,
not inside the quoted string) and rephrased to avoid stray apostrophes
in the remaining jq context block.
@gabrielanhaia gabrielanhaia merged commit 814e768 into main Jun 1, 2026
1 check passed
@gabrielanhaia gabrielanhaia deleted the fix/pr-digest-retries branch June 1, 2026 11:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants