Skip to content

fix(observability): bound OTel provider shutdown to keep exit fast#366

Open
taitelee wants to merge 3 commits into
mainfrom
otel-shutdown-latency
Open

fix(observability): bound OTel provider shutdown to keep exit fast#366
taitelee wants to merge 3 commits into
mainfrom
otel-shutdown-latency

Conversation

@taitelee

@taitelee taitelee commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Summary

When no OpenTelemetry collector is reachable — the e2e default, and any production collector outage/misconfiguration — WaveHouse's graceful shutdown flushed the traces, metrics, and logs providers serially, each sitting in gRPC dial backoff for its full timeout before giving up. That stacked to ~15s of hang on every clean exit. It stayed hidden until #287 removed the embedded-NATS os.Exit(0) that had been short-circuiting shutdown; once WaveHouse ran its full shutdown path, the ~15s surfaced.

That overruns the e2e orchestrator's SIGINT→exit budget. e2e runs a coverage-instrumented binary and SIGINTs it so Go flushes GOCOVERDIR on a clean exit; if shutdown outruns the budget the orchestrator SIGKILLs it, and a SIGKILL'd Go process writes no coverage — surfacing as the e2e suite reporting 0.0% coverage and failing make ci. The stopgap was bumping the budget
10s→30s, so every clean e2e run then paid the full ~15s.

This bounds shutdown two ways, both required:

  • Fan the providers out concurrently (observability.InitProvider) so their dial-backoff windows overlap instead of summing.
  • Return at the caller's context deadline instead of waiting for every provider. Concurrency alone isn't enough: the experimental logs SDK's BatchProcessor.Shutdown (go.opentelemetry.io/otel/sdk/log v0.20.0) ignores ctx and blocks for the exporter's full ~10s timeout during backoff, so a plain wg.Wait() still drags shutdown to ~10s. A select on done-vs-ctx.Done() returns ctx.Err() once the deadline passes; the straggler goroutine is harmless and reaped by the imminent os.Exit. errs is read only on the done path, so the deadline path can't race a still-running goroutine.
  • Restore the budgets: the shutdown bound in cmd/wavehouse goes 5s→3s and the orchestrator's SIGINT→kill budget 30s→10s, with a comment pinning why it must stay above the bounded shutdown time.
  • Add an early no-op guard so an empty/repeat shutdown returns nil before the deadline machinery (which would otherwise misreport an already-expired context as a failure).

Behavior with a collector reachable is unchanged — flushes still complete normally, now concurrently.

Related Issues

Closes #288

@taitelee taitelee requested review from a team and EricAndrechek July 2, 2026 14:08
@github-actions github-actions Bot added go Pull requests that update go code area/observability Metrics, logs, traces, health, profiling area/infra CI, build, deploy, Docker, release labels Jul 2, 2026
@coderabbitai

coderabbitai Bot commented Jul 2, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@taitelee, you've reached your PR review limit, so we couldn't start this review.

Next review available in: 7 minutes

Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available.
You're only billed for reviews past your plan's rate limits ($0.25/file).

How can I continue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews.

How do review limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please refer docs for additional details.

Review details
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 97eeeaaa-531f-4255-858e-44e992a5b097

📥 Commits

Reviewing files that changed from the base of the PR and between af7461d and 2f44f9d.

📒 Files selected for processing (1)
  • CHANGELOG.md
📝 Walkthrough

Walkthrough

The observability provider's shutdown function now runs registered shutdown callbacks concurrently instead of sequentially, bounding completion by context cancellation, and returns early when no callbacks exist. Correspondingly, the main binary's shutdown timeout was reduced from 5s to 3s, the orchestrator's kill grace period reduced from 30s to 10s, and a new regression test validates bounded shutdown timing.

Changes

Bounded Parallel OTel Shutdown

Layer / File(s) Summary
Concurrent, context-bounded shutdown implementation
internal/observability/provider.go
InitProvider's shutdown function runs registered shutdown funcs concurrently via a WaitGroup, joins errors after all complete, returns ctx.Err() if the context is done first, and short-circuits to nil when no shutdown funcs remain.
Regression test for bounded shutdown timing
internal/observability/provider_test.go
Adds OTel log imports and a new TestInitProvider_ShutdownParallelBounded test that configures an unreachable OTLP endpoint, emits traces/metrics/logs, invokes shutdown with a 2s deadline, and asserts completion under 3 seconds.
Caller timeout adjustments
cmd/wavehouse/main.go, scripts/orchestrator/main.go
The main binary's otelShutdown deadline is reduced from 5s to 3s, and the orchestrator's post-SIGINT grace period for wavehouse-cov is reduced from 30s to 10s, with updated comments/log text.

Estimated code review effort: 3 (Moderate) | ~20 minutes

Possibly related PRs

  • Wave-RF/WaveHouse#116: Related changes to observability provider shutdown semantics for unreachable OTEL collectors.

Suggested reviewers: EricAndrechek

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly matches the main change: bounding OpenTelemetry shutdown to keep exits fast.
Description check ✅ Passed The description is directly related to the shutdown-latency fix and matches the implemented changes.
Linked Issues check ✅ Passed The PR implements the linked issue's parallel shutdown and reduced exit budgets, satisfying the acceptance target.
Out of Scope Changes check ✅ Passed The changes stay within the shutdown-latency fix and the orchestrator timeout adjustment required by the issue.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch otel-shutdown-latency
✨ Simplify code
  • Create PR with simplified code
  • Commit simplified code in branch otel-shutdown-latency

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@taitelee taitelee moved this from Backlog to In progress in WaveHouse Task Board Jul 2, 2026
@github-code-quality

github-code-quality Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Code Coverage Overview

Languages: Go

Go

The overall coverage in the branch remains at 90%, unchanged from the branch.

Show a code coverage summary of the most impacted files.
File 029fe1f 2f44f9d +/-
cmd/wavehouse/main.go 69% 69% 0%
internal/observ...ity/provider.go 79% 83% +4%

Updated July 02, 2026 15:03 UTC
Code Coverage is in Public Preview. Learn more and provide us with your feedback.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2


ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 8f6dae72-0b00-440e-a51f-8b3da46cbb8b

📥 Commits

Reviewing files that changed from the base of the PR and between 029fe1f and af7461d.

📒 Files selected for processing (4)
  • cmd/wavehouse/main.go
  • internal/observability/provider.go
  • internal/observability/provider_test.go
  • scripts/orchestrator/main.go
📜 Review details
⏰ Context from checks skipped due to timeout. (4)
  • GitHub Check: E2E tests
  • GitHub Check: Coverage
  • GitHub Check: Integration tests
  • GitHub Check: Lint
⚠️ CI failures not shown inline (2)

GitHub Actions: PR housekeeping / PR housekeeping: fix(observability): bound OTel provider shutdown to keep exit fast

Conclusion: failure

View job details

##[group]Run # Single source of truth for the rule: scripts/lint-pr-title.sh — the
 �[36;1m# Single source of truth for the rule: scripts/lint-pr-title.sh — the�[0m
 �[36;1m# SAME validator the local agent gate runs (.claude/hooks/agent-bash-gate.sh),�[0m
 �[36;1m# so CI and local can't drift. The checkout above is ref: main, so this is�[0m
 �[36;1m# always the default-branch script. Dependabot's grouped-update titles�[0m
 �[36;1m# routinely exceed the 72-char subject cap and the format isn't�[0m
 �[36;1m# configurable, so Dependabot PRs are exempt from the length check�[0m
 �[36;1m# (the format check still applies).�[0m
 �[36;1mif [[ "$PR_AUTHOR" == "dependabot[bot]" || "$PR_AUTHOR" == "app/dependabot" ]]; then�[0m
 �[36;1m  export PR_TITLE_SKIP_LENGTH=1�[0m
 �[36;1mfi�[0m
 �[36;1m�[0m
 �[36;1mif reason=$(bash scripts/lint-pr-title.sh "$PR_TITLE" 2>&1); then�[0m
 �[36;1m  echo "passed=true" >> "$GITHUB_OUTPUT"�[0m
 �[36;1m  echo "PR title OK: $PR_TITLE"�[0m
 �[36;1melse�[0m
 �[36;1m  echo "passed=false" >> "$GITHUB_OUTPUT"�[0m
 �[36;1m  printf '%s\n' "$reason"�[0m
 �[36;1m  echo "::error::$(printf '%s' "$reason" | head -1)"�[0m

GitHub Actions: PR housekeeping / 0_PR housekeeping.txt: fix(observability): bound OTel provider shutdown to keep exit fast

Conclusion: failure

View job details

##[group]Run # Single source of truth for the rule: scripts/lint-pr-title.sh — the
 �[36;1m# Single source of truth for the rule: scripts/lint-pr-title.sh — the�[0m
 �[36;1m# SAME validator the local agent gate runs (.claude/hooks/agent-bash-gate.sh),�[0m
 �[36;1m# so CI and local can't drift. The checkout above is ref: main, so this is�[0m
 �[36;1m# always the default-branch script. Dependabot's grouped-update titles�[0m
 �[36;1m# routinely exceed the 72-char subject cap and the format isn't�[0m
 �[36;1m# configurable, so Dependabot PRs are exempt from the length check�[0m
 �[36;1m# (the format check still applies).�[0m
 �[36;1mif [[ "$PR_AUTHOR" == "dependabot[bot]" || "$PR_AUTHOR" == "app/dependabot" ]]; then�[0m
 �[36;1m  export PR_TITLE_SKIP_LENGTH=1�[0m
 �[36;1mfi�[0m
 �[36;1m�[0m
 �[36;1mif reason=$(bash scripts/lint-pr-title.sh "$PR_TITLE" 2>&1); then�[0m
 �[36;1m  echo "passed=true" >> "$GITHUB_OUTPUT"�[0m
 �[36;1m  echo "PR title OK: $PR_TITLE"�[0m
 �[36;1melse�[0m
 �[36;1m  echo "passed=false" >> "$GITHUB_OUTPUT"�[0m
 �[36;1m  printf '%s\n' "$reason"�[0m
 �[36;1m  echo "::error::$(printf '%s' "$reason" | head -1)"�[0m
🧰 Additional context used
📓 Path-based instructions (2)
cmd/wavehouse/**/*.go

📄 CodeRabbit inference engine (AGENTS.md)

Wire new dependencies in cmd/wavehouse/main.go rather than relying on global state or hidden initialization.

Files:

  • cmd/wavehouse/main.go
internal/**/*.go

📄 CodeRabbit inference engine (AGENTS.md)

Core packages should be interface-first: behaviors such as cache, dedupe, publisher, and subscriber should be defined behind interfaces to keep implementations swappable.

Files:

  • internal/observability/provider.go
  • internal/observability/provider_test.go
🧠 Learnings (2)
📚 Learning: 2026-05-25T11:25:08.794Z
Learnt from: EricAndrechek
Repo: Wave-RF/WaveHouse PR: 180
File: internal/observability/instruments.go:40-117
Timestamp: 2026-05-25T11:25:08.794Z
Learning: In WaveHouse (Go), it is idiomatic to declare OpenTelemetry (OTel) metric instruments as package-level `var` values created via the OTel Go SDK global proxy pattern (e.g., `var h metric.Float64Histogram = Meter().Float64Histogram(...)`, `var c metric.Int64Counter = Meter().Int64Counter(...)`). When reviewing, do NOT flag these as “global state” violations under the AGENTS.md constructor-injection rule; that rule is intended for swappable application-level dependencies (Cache/Publisher/Subscriber/Deduplicator), not OTel proxy instrument declarations. Do not recommend refactoring these instruments behind an `Instruments` struct for dependency injection.

Applied to files:

  • internal/observability/provider.go
  • internal/observability/provider_test.go
📚 Learning: 2026-06-26T12:23:22.696Z
Learnt from: EricAndrechek
Repo: Wave-RF/WaveHouse PR: 346
File: internal/stream/subscriber_test.go:9-28
Timestamp: 2026-06-26T12:23:22.696Z
Learning: In this Go repository, prefer table-driven tests (e.g., `[]struct{...}` with `t.Run(...)`) only for tests that cover multiple scenarios/inputs and can be cleanly enumerated. Do not artificially rewrite a clear single-scenario sequential behavioral-flow test into a table-driven form just to fit the pattern; if there’s only one meaningful scenario, keep the test as a straightforward linear flow (as in `TestSubscriber_SendDeliversThenDropsWhenFull`).

Applied to files:

  • internal/observability/provider_test.go
🔇 Additional comments (3)
internal/observability/provider_test.go (1)

11-12: LGTM!

Also applies to: 59-115

cmd/wavehouse/main.go (1)

124-131: LGTM!

internal/observability/provider.go (1)

79-118: 🎯 Functional Correctness

No issue: sync.WaitGroup.Go is supported by the module’s Go 1.26.4 target.

			> Likely an incorrect or invalid review comment.

Comment thread internal/observability/provider.go
Comment thread scripts/orchestrator/main.go
@github-project-automation github-project-automation Bot moved this from In progress to In review in WaveHouse Task Board Jul 2, 2026
coderabbitai[bot]
coderabbitai Bot previously approved these changes Jul 2, 2026
@github-actions github-actions Bot added the area/docs Documentation, site/, README label Jul 2, 2026
@taitelee taitelee moved this from In review to Ready in WaveHouse Task Board Jul 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/docs Documentation, site/, README area/infra CI, build, deploy, Docker, release area/observability Metrics, logs, traces, health, profiling go Pull requests that update go code

Projects

Status: Ready

Development

Successfully merging this pull request may close these issues.

bug(observability): graceful shutdown stalls ~15s when the OTel collector is unreachable

2 participants