Skip to content

feat(supervisor): verify warm-start delivery, cold-start silently lost dispatches#3918

Merged
myftija merged 3 commits into
mainfrom
tri-10659-warm-start-delivery-verification
Jun 16, 2026
Merged

feat(supervisor): verify warm-start delivery, cold-start silently lost dispatches#3918
myftija merged 3 commits into
mainfrom
tri-10659-warm-start-delivery-verification

Conversation

@myftija

@myftija myftija commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

Problem

Firestarter's didWarmStart: true means the response was written to a long-poll socket — not that the runner received it. A silently dead poller (no FIN, e.g. a VM torn down mid-poll) leaves the dispatched run stuck in PENDING_EXECUTING until the run engine's heartbeat redrive, and each redrive burns a queue redelivery toward TASK_RUN_DEQUEUED_MAX_RETRIES.

Change

After a warm-start hit, the supervisor retains the DequeuedMessage (TimerWheel, default 10s), then probes the existing getLatestSnapshot API. If the run is still on the exact dequeued snapshot, no runner ever acted — it falls through to the regular cold-create path. Recovery: ~10s + cold start, no new APIs, no CLI changes.

  • Double-start safe: startRunAttempt runs under a per-run lock and 409s stale snapshot ids, so a reviving runner and the fallback workload can't both execute; the loser exits before running anything.
  • Probe errors → do nothing: healthy runners legitimately act late during platform brownouts (nested attempt-start retries), so falling back on uncertainty would stampede duplicates. The heartbeat redrive stays as the backstop (also covers supervisor restarts dropping timers).
  • Off by default: TRIGGER_WARM_START_VERIFY_ENABLED (+ TRIGGER_WARM_START_VERIFY_DELAY_MS, 1–60s, default 10s). Disabled = complete no-op. Works for all workload managers (compute/k8s/docker) since it hooks the shared dequeue path.
  • Emits warmstart.verify wide events (outcome: delivered | fallback | probe_error), making the silent-loss rate directly measurable.

@changeset-bot

changeset-bot Bot commented Jun 12, 2026

Copy link
Copy Markdown

⚠️ No Changeset found

Latest commit: 7d4c6e0

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@coderabbitai

coderabbitai Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: f00bd9cb-a56d-47d7-bfd1-5787890fe78a

📥 Commits

Reviewing files that changed from the base of the PR and between 58cef9a and 7d4c6e0.

📒 Files selected for processing (5)
  • .server-changes/warm-start-delivery-verification.md
  • apps/supervisor/src/env.ts
  • apps/supervisor/src/index.ts
  • apps/supervisor/src/services/warmStartVerificationService.test.ts
  • apps/supervisor/src/services/warmStartVerificationService.ts
✅ Files skipped from review due to trivial changes (1)
  • .server-changes/warm-start-delivery-verification.md
🚧 Files skipped from review as they are similar to previous changes (4)
  • apps/supervisor/src/env.ts
  • apps/supervisor/src/services/warmStartVerificationService.test.ts
  • apps/supervisor/src/index.ts
  • apps/supervisor/src/services/warmStartVerificationService.ts
📜 Recent review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: typecheck / typecheck
  • GitHub Check: audit
  • GitHub Check: audit
  • GitHub Check: Build and publish previews
  • GitHub Check: Analyze (javascript-typescript)

Walkthrough

This pull request adds an opt-in warm-start delivery verification feature to the supervisor. The feature validates whether warm-start dispatches reached runners and automatically falls back to cold-start workload creation if delivery is not confirmed within a configurable delay window. Configuration is gated by TRIGGER_WARM_START_VERIFY_ENABLED (default false) with a configurable probe delay between 1 and 60 seconds (default 10 seconds). The new WarmStartVerificationService uses timer-wheel scheduling and limits concurrent snapshot probes to 10. Integration into the supervisor includes conditional service initialization, scheduling verification on successful warm-start, cancellation when a run connects, and graceful shutdown ordering. A createWorkload helper was extracted to centralize cold-create validation and logging.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and concisely summarizes the main change: adding warm-start delivery verification and cold-start fallback for silently lost dispatches, which is the core objective of the PR.
Description check ✅ Passed The description is well-structured, explains the problem clearly, details the solution comprehensively, and addresses key safety considerations. However, it diverges from the template by omitting the standard checklist, testing steps, changelog section, and screenshots section.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch tri-10659-warm-start-delivery-verification

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

devin-ai-integration[bot]

This comment was marked as resolved.

@myftija myftija force-pushed the tri-10659-warm-start-delivery-verification branch from b6c35ac to 58cef9a Compare June 12, 2026 14:11
@pkg-pr-new

pkg-pr-new Bot commented Jun 12, 2026

Copy link
Copy Markdown

Open in StackBlitz

@trigger.dev/build

npm i https://pkg.pr.new/@trigger.dev/build@7d4c6e0

trigger.dev

npm i https://pkg.pr.new/trigger.dev@7d4c6e0

@trigger.dev/core

npm i https://pkg.pr.new/@trigger.dev/core@7d4c6e0

@trigger.dev/python

npm i https://pkg.pr.new/@trigger.dev/python@7d4c6e0

@trigger.dev/react-hooks

npm i https://pkg.pr.new/@trigger.dev/react-hooks@7d4c6e0

@trigger.dev/redis-worker

npm i https://pkg.pr.new/@trigger.dev/redis-worker@7d4c6e0

@trigger.dev/rsc

npm i https://pkg.pr.new/@trigger.dev/rsc@7d4c6e0

@trigger.dev/schema-to-json

npm i https://pkg.pr.new/@trigger.dev/schema-to-json@7d4c6e0

@trigger.dev/sdk

npm i https://pkg.pr.new/@trigger.dev/sdk@7d4c6e0

commit: 7d4c6e0

myftija added 3 commits June 16, 2026 13:59
Firestarter's didWarmStart: true means the response was written to a
socket, not that the runner received it. A silently dead poller (no FIN,
e.g. a VM torn down mid-poll) leaves the dispatched run stuck in
PENDING_EXECUTING until the run engine's heartbeat redrive minutes
later, burning a queue redelivery toward TASK_RUN_DEQUEUED_MAX_RETRIES
each time.

After a warm-start hit the supervisor now retains the DequeuedMessage,
waits TRIGGER_WARM_START_VERIFY_DELAY_MS (default 10s), then asks the
platform for the run's latest snapshot. If it is still the exact
snapshot that was dequeued, no runner ever started the attempt - the
run falls through to the regular cold-create path. Double-starts are
prevented by the engine: startRunAttempt runs under a per-run lock and
rejects stale snapshot ids, so a reviving runner and the fallback
workload can't both execute. On probe errors nothing happens - during
platform brownouts healthy runners legitimately act late, and falling
back on uncertainty would stampede duplicates; the heartbeat redrive
stays as the backstop.

Off by default; enable with TRIGGER_WARM_START_VERIFY_ENABLED. When
disabled the code path is a no-op. Emits warmstart.verify wide events
(outcome: delivered / fallback / probe_error). Resolves TRI-10659.
Review follow-ups: the workload-create error log now carries the run id
(fallback creates run outside the dequeue wide event, so the log was the
only attribution), and the verifier stops before the workload server and
session so its timer can't cold-create a workload mid-shutdown.
@myftija myftija force-pushed the tri-10659-warm-start-delivery-verification branch from 58cef9a to 7d4c6e0 Compare June 16, 2026 13:05
@myftija myftija merged commit 002b845 into main Jun 16, 2026
20 checks passed
@myftija myftija deleted the tri-10659-warm-start-delivery-verification branch June 16, 2026 13:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants