ci: retry and alert on GitHub release creation failure#8944
ci: retry and alert on GitHub release creation failure#8944bitgo-ai-agent-dev[bot] wants to merge 2 commits into
Conversation
The Create GitHub release step in npmjs-release.yml runs after npm publish, which is irreversible. Previously it was marked continue-on-error: true, so a failure (e.g. the GitHub API rate limit seen in VL-5474) silently turned the job yellow with no alert and no GitHub release. Wrap the gh release create call in a 3-attempt retry loop with 30s, 60s, 90s backoff and drop continue-on-error so a final failure fails the job. Add a follow-up step that fires only when the release step fails, posting a Slack notification (via SLACK_RELEASE_WEBHOOK_URL webhook secret) that identifies the run and version requiring manual remediation. Ticket: VL-6353 Session-Id: b035fd16-81c0-4327-b03e-fb4e0dce6501 Task-Id: f5755602-d6f6-4284-b891-f02b996e3188
1adfd46 to
5e15341
Compare
There was a problem hiding this comment.
Review: ci/retry-and-alert-github-release-step
Overall the implementation is solid and correctly addresses the VL-6353 requirements. One must-fix bug, one should-fix, and a few nitpicks.
Must Fix
Unnecessary 90s sleep after the final retry attempt
The loop body sleeps $((attempt * 30)) seconds after every failed attempt, including attempt 3. This means the job waits an extra 90 seconds before the exit 1 that was never going to do anything useful.
.github/workflows/npmjs-release.yml — the Create GitHub release retry loop:
for attempt in 1 2 3; do
if gh release create ...; then
exit 0
fi
delay=$((attempt * 30))
echo "Attempt $attempt failed. Retrying in ${delay}s..."
sleep "$delay" # runs on attempt 3 too, wasting 90s
done
exit 1Fix: break out of the sleep when on the last attempt, e.g.:
for attempt in 1 2 3; do
if gh release create ...; then
exit 0
fi
if [ "$attempt" -lt 3 ]; then
delay=$((attempt * 30))
echo "Attempt $attempt failed. Retrying in ${delay}s..."
sleep "$delay"
fi
doneShould Fix
SLACK_RELEASE_WEBHOOK_URL is a new secret with no provisioning path documented in the repo
No other workflow in this repo uses a Slack webhook secret, so this secret doesn't exist yet. The PR description mentions it needs to be set up, but the workflow silently no-ops (with a ::warning::) if it's unset — meaning the Slack alert half of the acceptance criteria is unfulfilled until ops provisions it. At minimum, the PR should confirm the secret has been added to the npmjs-release environment, or the ::warning:: should be promoted to an ::error:: to make the gap obvious.
Optional / Nitpicks
- The
Notify on GitHub release failurestep emits::error::and then exits 0. GitHub Actions will show this step as green (success) in the UI while the annotation appears as a red error. This is a minor UX confusion — the step name and annotation message are clear enough that it's not misleading, but having the step itself exit 1 (and relying oncontinue-on-error: trueonly on the notify step) would make the step UI match the intent more accurately. Low priority. - The inline comment block added above the step (
# NPM publish has already happened by this point...) is genuinely useful operational context here — no objection, just noting it as a deliberate choice.
Existing org-level Slack notifications already fire on job failure for this repo, so a dedicated SLACK_RELEASE_WEBHOOK_URL step was redundant. Removing it eliminates the unprovisioned secret and the silent no-op fallback. The job still fails non-zero on exhausted retries (no continue-on-error), which is what the existing notifier hooks into. Also guard the retry-loop sleep with `attempt -lt 3` so the loop no longer sleeps 90s after the final failed attempt before exit 1. Ticket: VL-6353 Session-Id: 171773d2-746b-4f79-8b23-5ad8c11f2e5e Task-Id: d6793c61-de39-41c3-80f1-5352025fb58c
|
Addressed in 7d4ee25:
|
7d4ee25 to
659778b
Compare
What
continue-on-error: trueon theCreate GitHub releasestep in.github/workflows/npmjs-release.ymlwith a 3-attempt bash retry loop (30s, 60s, 90s exponential backoff). The step fails the job non-zero if all attempts fail.Notify on GitHub release failurestep gated onfailure() && steps.create-github-release.outcome == 'failure'that emits a::error::annotation and POSTs a Slack message tosecrets.SLACK_RELEASE_WEBHOOK_URL(no-ops with a warning if the secret is unset). The retry step's exit code is what fails the job; the notify step only sends the alert.gh release createexactly once on success.Why
In VL-5474,
gh release createhit a transient GitHub API rate limit after npm publish had already succeeded. Because the step wascontinue-on-error: true, the workflow only turned yellow, no GitHub release was created, and nobody was paged. Manual remediation was required and the failure was only discovered by reading the Actions log. NPM publish is irreversible, so this class of failure must be loud, retried, and alerted.Test plan
dry-run=trueand confirm the new step is skipped (matches existing behavior).::error::annotation, job green).::error::annotation lists the run URL and version, and a Slack message is posted to the webhook channel.SLACK_RELEASE_WEBHOOK_URLis unset (logs a::warning::and exits 0, leaving the job-failure signal coming from the retry step).Ticket: VL-6353