Skip to content

ci: retry and alert on GitHub release creation failure#8944

Open
bitgo-ai-agent-dev[bot] wants to merge 2 commits into
masterfrom
vl-6353-github-release-retry
Open

ci: retry and alert on GitHub release creation failure#8944
bitgo-ai-agent-dev[bot] wants to merge 2 commits into
masterfrom
vl-6353-github-release-retry

Conversation

@bitgo-ai-agent-dev
Copy link
Copy Markdown

What

  • Replace continue-on-error: true on the Create GitHub release step in .github/workflows/npmjs-release.yml with a 3-attempt bash retry loop (30s, 60s, 90s exponential backoff). The step fails the job non-zero if all attempts fail.
  • Add a follow-up Notify on GitHub release failure step gated on failure() && steps.create-github-release.outcome == 'failure' that emits a ::error:: annotation and POSTs a Slack message to secrets.SLACK_RELEASE_WEBHOOK_URL (no-ops with a warning if the secret is unset). The retry step's exit code is what fails the job; the notify step only sends the alert.
  • Happy path is unchanged: the step still runs gh release create exactly once on success.

Why

In VL-5474, gh release create hit a transient GitHub API rate limit after npm publish had already succeeded. Because the step was continue-on-error: true, the workflow only turned yellow, no GitHub release was created, and nobody was paged. Manual remediation was required and the failure was only discovered by reading the Actions log. NPM publish is irreversible, so this class of failure must be loud, retried, and alerted.

Test plan

  • Trigger the workflow with dry-run=true and confirm the new step is skipped (matches existing behavior).
  • Trigger a real release and confirm a first-attempt success produces the same workflow result as before (no Slack post, no ::error:: annotation, job green).
  • Force a failure scenario in a fork (e.g. by creating a tag conflict or revoking the PAT scope) and confirm: 3 attempts are made with 30s/60s/90s sleeps, the job fails, the ::error:: annotation lists the run URL and version, and a Slack message is posted to the webhook channel.
  • Confirm the alert step is skipped when SLACK_RELEASE_WEBHOOK_URL is unset (logs a ::warning:: and exits 0, leaving the job-failure signal coming from the retry step).

Ticket: VL-6353

The Create GitHub release step in npmjs-release.yml runs after npm
publish, which is irreversible. Previously it was marked
continue-on-error: true, so a failure (e.g. the GitHub API rate limit
seen in VL-5474) silently turned the job yellow with no alert and no
GitHub release.

Wrap the gh release create call in a 3-attempt retry loop with 30s,
60s, 90s backoff and drop continue-on-error so a final failure fails
the job. Add a follow-up step that fires only when the release step
fails, posting a Slack notification (via SLACK_RELEASE_WEBHOOK_URL
webhook secret) that identifies the run and version requiring manual
remediation.

Ticket: VL-6353
Session-Id: b035fd16-81c0-4327-b03e-fb4e0dce6501
Task-Id: f5755602-d6f6-4284-b891-f02b996e3188
@linear-code
Copy link
Copy Markdown

linear-code Bot commented Jun 4, 2026

VL-6353

@bitgo-ai-agent-dev bitgo-ai-agent-dev Bot force-pushed the vl-6353-github-release-retry branch from 1adfd46 to 5e15341 Compare June 4, 2026 10:36
@roshan-bitgo roshan-bitgo marked this pull request as ready for review June 4, 2026 10:39
@roshan-bitgo roshan-bitgo requested review from a team as code owners June 4, 2026 10:39
Copy link
Copy Markdown
Author

@bitgo-ai-agent-dev bitgo-ai-agent-dev Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: ci/retry-and-alert-github-release-step

Overall the implementation is solid and correctly addresses the VL-6353 requirements. One must-fix bug, one should-fix, and a few nitpicks.


Must Fix

Unnecessary 90s sleep after the final retry attempt

The loop body sleeps $((attempt * 30)) seconds after every failed attempt, including attempt 3. This means the job waits an extra 90 seconds before the exit 1 that was never going to do anything useful.

.github/workflows/npmjs-release.yml — the Create GitHub release retry loop:

for attempt in 1 2 3; do
  if gh release create ...; then
    exit 0
  fi
  delay=$((attempt * 30))
  echo "Attempt $attempt failed. Retrying in ${delay}s..."
  sleep "$delay"   # runs on attempt 3 too, wasting 90s
done
exit 1

Fix: break out of the sleep when on the last attempt, e.g.:

for attempt in 1 2 3; do
  if gh release create ...; then
    exit 0
  fi
  if [ "$attempt" -lt 3 ]; then
    delay=$((attempt * 30))
    echo "Attempt $attempt failed. Retrying in ${delay}s..."
    sleep "$delay"
  fi
done

Should Fix

SLACK_RELEASE_WEBHOOK_URL is a new secret with no provisioning path documented in the repo

No other workflow in this repo uses a Slack webhook secret, so this secret doesn't exist yet. The PR description mentions it needs to be set up, but the workflow silently no-ops (with a ::warning::) if it's unset — meaning the Slack alert half of the acceptance criteria is unfulfilled until ops provisions it. At minimum, the PR should confirm the secret has been added to the npmjs-release environment, or the ::warning:: should be promoted to an ::error:: to make the gap obvious.


Optional / Nitpicks
  • The Notify on GitHub release failure step emits ::error:: and then exits 0. GitHub Actions will show this step as green (success) in the UI while the annotation appears as a red error. This is a minor UX confusion — the step name and annotation message are clear enough that it's not misleading, but having the step itself exit 1 (and relying on continue-on-error: true only on the notify step) would make the step UI match the intent more accurately. Low priority.
  • The inline comment block added above the step (# NPM publish has already happened by this point...) is genuinely useful operational context here — no objection, just noting it as a deliberate choice.

Existing org-level Slack notifications already fire on job failure for
this repo, so a dedicated SLACK_RELEASE_WEBHOOK_URL step was redundant.
Removing it eliminates the unprovisioned secret and the silent no-op
fallback. The job still fails non-zero on exhausted retries (no
continue-on-error), which is what the existing notifier hooks into.

Also guard the retry-loop sleep with `attempt -lt 3` so the loop no
longer sleeps 90s after the final failed attempt before exit 1.

Ticket: VL-6353
Session-Id: 171773d2-746b-4f79-8b23-5ad8c11f2e5e
Task-Id: d6793c61-de39-41c3-80f1-5352025fb58c
@bitgo-ai-agent-dev
Copy link
Copy Markdown
Author

Addressed in 7d4ee25:

  • Removed the dedicated Slack webhook step / SLACK_RELEASE_WEBHOOK_URL secret. Per @roshanmaind194 — release success/failure notifications are already routed to Slack at the org level. Now that continue-on-error: true is gone, an exhausted-retry failure surfaces as a non-zero job failure, which the existing notifier picks up. No new secret to provision.
  • Fixed the must-fix from the prior review. The retry loop's sleep is now guarded by if [ "$attempt" -lt 3 ], so we no longer waste 90s after the final failed attempt before exit 1.

@bitgo-ai-agent-dev bitgo-ai-agent-dev Bot force-pushed the vl-6353-github-release-retry branch from 7d4ee25 to 659778b Compare June 4, 2026 10:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant