fix(observability): bound OTel provider shutdown to keep exit fast#366
fix(observability): bound OTel provider shutdown to keep exit fast#366taitelee wants to merge 3 commits into
Conversation
|
Warning Review limit reached
Next review available in: 7 minutes Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available. How can I continue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews. How do review limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability. For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window. Please refer docs for additional details. Review details⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Plus Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThe observability provider's shutdown function now runs registered shutdown callbacks concurrently instead of sequentially, bounding completion by context cancellation, and returns early when no callbacks exist. Correspondingly, the main binary's shutdown timeout was reduced from 5s to 3s, the orchestrator's kill grace period reduced from 30s to 10s, and a new regression test validates bounded shutdown timing. ChangesBounded Parallel OTel Shutdown
Estimated code review effort: 3 (Moderate) | ~20 minutes Possibly related PRs
Suggested reviewers: 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
✨ Simplify code
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Code Coverage OverviewLanguages: Go GoThe overall coverage in the branch remains at 90%, unchanged from the branch. Show a code coverage summary of the most impacted files.
Updated |
There was a problem hiding this comment.
Actionable comments posted: 2
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: 8f6dae72-0b00-440e-a51f-8b3da46cbb8b
📒 Files selected for processing (4)
cmd/wavehouse/main.gointernal/observability/provider.gointernal/observability/provider_test.goscripts/orchestrator/main.go
📜 Review details
⏰ Context from checks skipped due to timeout. (4)
- GitHub Check: E2E tests
- GitHub Check: Coverage
- GitHub Check: Integration tests
- GitHub Check: Lint
⚠️ CI failures not shown inline (2)
GitHub Actions: PR housekeeping / PR housekeeping: fix(observability): bound OTel provider shutdown to keep exit fast
Conclusion: failure
##[group]Run # Single source of truth for the rule: scripts/lint-pr-title.sh — the
�[36;1m# Single source of truth for the rule: scripts/lint-pr-title.sh — the�[0m
�[36;1m# SAME validator the local agent gate runs (.claude/hooks/agent-bash-gate.sh),�[0m
�[36;1m# so CI and local can't drift. The checkout above is ref: main, so this is�[0m
�[36;1m# always the default-branch script. Dependabot's grouped-update titles�[0m
�[36;1m# routinely exceed the 72-char subject cap and the format isn't�[0m
�[36;1m# configurable, so Dependabot PRs are exempt from the length check�[0m
�[36;1m# (the format check still applies).�[0m
�[36;1mif [[ "$PR_AUTHOR" == "dependabot[bot]" || "$PR_AUTHOR" == "app/dependabot" ]]; then�[0m
�[36;1m export PR_TITLE_SKIP_LENGTH=1�[0m
�[36;1mfi�[0m
�[36;1m�[0m
�[36;1mif reason=$(bash scripts/lint-pr-title.sh "$PR_TITLE" 2>&1); then�[0m
�[36;1m echo "passed=true" >> "$GITHUB_OUTPUT"�[0m
�[36;1m echo "PR title OK: $PR_TITLE"�[0m
�[36;1melse�[0m
�[36;1m echo "passed=false" >> "$GITHUB_OUTPUT"�[0m
�[36;1m printf '%s\n' "$reason"�[0m
�[36;1m echo "::error::$(printf '%s' "$reason" | head -1)"�[0m
GitHub Actions: PR housekeeping / 0_PR housekeeping.txt: fix(observability): bound OTel provider shutdown to keep exit fast
Conclusion: failure
##[group]Run # Single source of truth for the rule: scripts/lint-pr-title.sh — the
�[36;1m# Single source of truth for the rule: scripts/lint-pr-title.sh — the�[0m
�[36;1m# SAME validator the local agent gate runs (.claude/hooks/agent-bash-gate.sh),�[0m
�[36;1m# so CI and local can't drift. The checkout above is ref: main, so this is�[0m
�[36;1m# always the default-branch script. Dependabot's grouped-update titles�[0m
�[36;1m# routinely exceed the 72-char subject cap and the format isn't�[0m
�[36;1m# configurable, so Dependabot PRs are exempt from the length check�[0m
�[36;1m# (the format check still applies).�[0m
�[36;1mif [[ "$PR_AUTHOR" == "dependabot[bot]" || "$PR_AUTHOR" == "app/dependabot" ]]; then�[0m
�[36;1m export PR_TITLE_SKIP_LENGTH=1�[0m
�[36;1mfi�[0m
�[36;1m�[0m
�[36;1mif reason=$(bash scripts/lint-pr-title.sh "$PR_TITLE" 2>&1); then�[0m
�[36;1m echo "passed=true" >> "$GITHUB_OUTPUT"�[0m
�[36;1m echo "PR title OK: $PR_TITLE"�[0m
�[36;1melse�[0m
�[36;1m echo "passed=false" >> "$GITHUB_OUTPUT"�[0m
�[36;1m printf '%s\n' "$reason"�[0m
�[36;1m echo "::error::$(printf '%s' "$reason" | head -1)"�[0m
🧰 Additional context used
📓 Path-based instructions (2)
cmd/wavehouse/**/*.go
📄 CodeRabbit inference engine (AGENTS.md)
Wire new dependencies in
cmd/wavehouse/main.gorather than relying on global state or hidden initialization.
Files:
cmd/wavehouse/main.go
internal/**/*.go
📄 CodeRabbit inference engine (AGENTS.md)
Core packages should be interface-first: behaviors such as cache, dedupe, publisher, and subscriber should be defined behind interfaces to keep implementations swappable.
Files:
internal/observability/provider.gointernal/observability/provider_test.go
🧠 Learnings (2)
📚 Learning: 2026-05-25T11:25:08.794Z
Learnt from: EricAndrechek
Repo: Wave-RF/WaveHouse PR: 180
File: internal/observability/instruments.go:40-117
Timestamp: 2026-05-25T11:25:08.794Z
Learning: In WaveHouse (Go), it is idiomatic to declare OpenTelemetry (OTel) metric instruments as package-level `var` values created via the OTel Go SDK global proxy pattern (e.g., `var h metric.Float64Histogram = Meter().Float64Histogram(...)`, `var c metric.Int64Counter = Meter().Int64Counter(...)`). When reviewing, do NOT flag these as “global state” violations under the AGENTS.md constructor-injection rule; that rule is intended for swappable application-level dependencies (Cache/Publisher/Subscriber/Deduplicator), not OTel proxy instrument declarations. Do not recommend refactoring these instruments behind an `Instruments` struct for dependency injection.
Applied to files:
internal/observability/provider.gointernal/observability/provider_test.go
📚 Learning: 2026-06-26T12:23:22.696Z
Learnt from: EricAndrechek
Repo: Wave-RF/WaveHouse PR: 346
File: internal/stream/subscriber_test.go:9-28
Timestamp: 2026-06-26T12:23:22.696Z
Learning: In this Go repository, prefer table-driven tests (e.g., `[]struct{...}` with `t.Run(...)`) only for tests that cover multiple scenarios/inputs and can be cleanly enumerated. Do not artificially rewrite a clear single-scenario sequential behavioral-flow test into a table-driven form just to fit the pattern; if there’s only one meaningful scenario, keep the test as a straightforward linear flow (as in `TestSubscriber_SendDeliversThenDropsWhenFull`).
Applied to files:
internal/observability/provider_test.go
🔇 Additional comments (3)
internal/observability/provider_test.go (1)
11-12: LGTM!Also applies to: 59-115
cmd/wavehouse/main.go (1)
124-131: LGTM!internal/observability/provider.go (1)
79-118: 🎯 Functional CorrectnessNo issue:
sync.WaitGroup.Gois supported by the module’s Go 1.26.4 target.> Likely an incorrect or invalid review comment.
Summary
When no OpenTelemetry collector is reachable — the e2e default, and any production collector outage/misconfiguration — WaveHouse's graceful shutdown flushed the traces, metrics, and logs providers serially, each sitting in gRPC dial backoff for its full timeout before giving up. That stacked to ~15s of hang on every clean exit. It stayed hidden until #287 removed the embedded-NATS
os.Exit(0)that had been short-circuiting shutdown; once WaveHouse ran its full shutdown path, the ~15s surfaced.That overruns the e2e orchestrator's SIGINT→exit budget. e2e runs a coverage-instrumented binary and SIGINTs it so Go flushes
GOCOVERDIRon a clean exit; if shutdown outruns the budget the orchestrator SIGKILLs it, and a SIGKILL'd Go process writes no coverage — surfacing as the e2e suite reporting 0.0% coverage and failingmake ci. The stopgap was bumping the budget10s→30s, so every clean e2e run then paid the full ~15s.
This bounds shutdown two ways, both required:
observability.InitProvider) so their dial-backoff windows overlap instead of summing.BatchProcessor.Shutdown(go.opentelemetry.io/otel/sdk/logv0.20.0) ignoresctxand blocks for the exporter's full ~10s timeout during backoff, so a plainwg.Wait()still drags shutdown to ~10s. Aselecton done-vs-ctx.Done()returnsctx.Err()once the deadline passes; the straggler goroutine is harmless and reaped by the imminentos.Exit.errsis read only on the done path, so the deadline path can't race a still-running goroutine.cmd/wavehousegoes 5s→3s and the orchestrator's SIGINT→kill budget 30s→10s, with a comment pinning why it must stay above the bounded shutdown time.nilbefore the deadline machinery (which would otherwise misreport an already-expired context as a failure).Behavior with a collector reachable is unchanged — flushes still complete normally, now concurrently.
Related Issues
Closes #288