fix(observability): bound OTel provider shutdown to keep exit fast by taitelee · Pull Request #366 · Wave-RF/WaveHouse

taitelee · 2026-07-02T14:08:57Z

Summary

When no OpenTelemetry collector is reachable — the e2e default, and any production collector outage/misconfiguration — WaveHouse's graceful shutdown flushed the traces, metrics, and logs providers serially, each sitting in gRPC dial backoff for its full timeout before giving up. That stacked to ~15s of hang on every clean exit. It stayed hidden until #287 removed the embedded-NATS os.Exit(0) that had been short-circuiting shutdown; once WaveHouse ran its full shutdown path, the ~15s surfaced.

That overruns the e2e orchestrator's SIGINT→exit budget. e2e runs a coverage-instrumented binary and SIGINTs it so Go flushes GOCOVERDIR on a clean exit; if shutdown outruns the budget the orchestrator SIGKILLs it, and a SIGKILL'd Go process writes no coverage — surfacing as the e2e suite reporting 0.0% coverage and failing make ci. The stopgap was bumping the budget
10s→30s, so every clean e2e run then paid the full ~15s.

This bounds shutdown two ways, both required:

Fan the providers out concurrently (observability.InitProvider) so their dial-backoff windows overlap instead of summing.
Return at the caller's context deadline instead of waiting for every provider. Concurrency alone isn't enough: the experimental logs SDK's BatchProcessor.Shutdown (go.opentelemetry.io/otel/sdk/log v0.20.0) ignores ctx and blocks for the exporter's full ~10s timeout during backoff, so a plain wg.Wait() still drags shutdown to ~10s. A select on done-vs-ctx.Done() returns ctx.Err() once the deadline passes; the straggler goroutine is harmless and reaped by the imminent os.Exit. errs is read only on the done path, so the deadline path can't race a still-running goroutine.
Restore the budgets: the shutdown bound in cmd/wavehouse goes 5s→3s and the orchestrator's SIGINT→kill budget 30s→10s, with a comment pinning why it must stay above the bounded shutdown time.
Add an early no-op guard so an empty/repeat shutdown returns nil before the deadline machinery (which would otherwise misreport an already-expired context as a failure).

Behavior with a collector reachable is unchanged — flushes still complete normally, now concurrently.

Related Issues

Closes #288

…it latency

…to keep exit fast

coderabbitai · 2026-07-02T14:09:11Z

Warning

Review limit reached

@taitelee, you've reached your PR review limit, so we couldn't start this review.

Next review available in: 7 minutes

Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available.
You're only billed for reviews past your plan's rate limits ($0.25/file).

How can I continue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews.

How do review limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please refer docs for additional details.

Review details

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 97eeeaaa-531f-4255-858e-44e992a5b097

📥 Commits

Reviewing files that changed from the base of the PR and between af7461d and 2f44f9d.

📒 Files selected for processing (1)

CHANGELOG.md

📝 Walkthrough

Walkthrough

The observability provider's shutdown function now runs registered shutdown callbacks concurrently instead of sequentially, bounding completion by context cancellation, and returns early when no callbacks exist. Correspondingly, the main binary's shutdown timeout was reduced from 5s to 3s, the orchestrator's kill grace period reduced from 30s to 10s, and a new regression test validates bounded shutdown timing.

Changes

Bounded Parallel OTel Shutdown

Layer / File(s)	Summary
Concurrent, context-bounded shutdown implementation `internal/observability/provider.go`	`InitProvider`'s shutdown function runs registered shutdown funcs concurrently via a WaitGroup, joins errors after all complete, returns `ctx.Err()` if the context is done first, and short-circuits to `nil` when no shutdown funcs remain.
Regression test for bounded shutdown timing `internal/observability/provider_test.go`	Adds OTel log imports and a new `TestInitProvider_ShutdownParallelBounded` test that configures an unreachable OTLP endpoint, emits traces/metrics/logs, invokes shutdown with a 2s deadline, and asserts completion under 3 seconds.
Caller timeout adjustments `cmd/wavehouse/main.go`, `scripts/orchestrator/main.go`	The main binary's `otelShutdown` deadline is reduced from 5s to 3s, and the orchestrator's post-SIGINT grace period for `wavehouse-cov` is reduced from 30s to 10s, with updated comments/log text.

Estimated code review effort: 3 (Moderate) | ~20 minutes

Possibly related PRs

Wave-RF/WaveHouse#116: Related changes to observability provider shutdown semantics for unreachable OTEL collectors.

Suggested reviewers: EricAndrechek

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly matches the main change: bounding OpenTelemetry shutdown to keep exits fast.
Description check	✅ Passed	The description is directly related to the shutdown-latency fix and matches the implemented changes.
Linked Issues check	✅ Passed	The PR implements the linked issue's parallel shutdown and reduced exit budgets, satisfying the acceptance target.
Out of Scope Changes check	✅ Passed	The changes stay within the shutdown-latency fix and the orchestrator timeout adjustment required by the issue.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch otel-shutdown-latency

✨ Simplify code

Create PR with simplified code
Commit simplified code in branch otel-shutdown-latency

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

github-code-quality · 2026-07-02T14:12:27Z

Code Coverage Overview

Languages: Go

Go

The overall coverage in the otel-shutdown-latenc... branch remains at 90%, unchanged from the main branch.

Show a code coverage summary of the most impacted files.

File	main `029fe1f`	otel-shutdown-latenc... `2f44f9d`	+/-
`cmd/wavehouse/main.go`	69%	69%	0%
`internal/observ...ity/provider.go`	79%	83%	+4%

_{Updated July 02, 2026 15:03 UTC
Code Coverage is in Public Preview. Learn more and provide us with your feedback.}

coderabbitai

Actionable comments posted: 2

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 8f6dae72-0b00-440e-a51f-8b3da46cbb8b

📥 Commits

Reviewing files that changed from the base of the PR and between 029fe1f and af7461d.

📒 Files selected for processing (4)

cmd/wavehouse/main.go
internal/observability/provider.go
internal/observability/provider_test.go
scripts/orchestrator/main.go

📜 Review details

⏰ Context from checks skipped due to timeout. (4)

GitHub Check: E2E tests
GitHub Check: Coverage
GitHub Check: Integration tests
GitHub Check: Lint

⚠️ CI failures not shown inline (2)

GitHub Actions: PR housekeeping / PR housekeeping: fix(observability): bound OTel provider shutdown to keep exit fast

Conclusion: failure

View job details

##[group]Run # Single source of truth for the rule: scripts/lint-pr-title.sh — the
 �[36;1m# Single source of truth for the rule: scripts/lint-pr-title.sh — the�[0m
 �[36;1m# SAME validator the local agent gate runs (.claude/hooks/agent-bash-gate.sh),�[0m
 �[36;1m# so CI and local can't drift. The checkout above is ref: main, so this is�[0m
 �[36;1m# always the default-branch script. Dependabot's grouped-update titles�[0m
 �[36;1m# routinely exceed the 72-char subject cap and the format isn't�[0m
 �[36;1m# configurable, so Dependabot PRs are exempt from the length check�[0m
 �[36;1m# (the format check still applies).�[0m
 �[36;1mif [[ "$PR_AUTHOR" == "dependabot[bot]" || "$PR_AUTHOR" == "app/dependabot" ]]; then�[0m
 �[36;1m  export PR_TITLE_SKIP_LENGTH=1�[0m
 �[36;1mfi�[0m
 �[36;1m�[0m
 �[36;1mif reason=$(bash scripts/lint-pr-title.sh "$PR_TITLE" 2>&1); then�[0m
 �[36;1m  echo "passed=true" >> "$GITHUB_OUTPUT"�[0m
 �[36;1m  echo "PR title OK: $PR_TITLE"�[0m
 �[36;1melse�[0m
 �[36;1m  echo "passed=false" >> "$GITHUB_OUTPUT"�[0m
 �[36;1m  printf '%s\n' "$reason"�[0m
 �[36;1m  echo "::error::$(printf '%s' "$reason" | head -1)"�[0m

GitHub Actions: PR housekeeping / 0_PR housekeeping.txt: fix(observability): bound OTel provider shutdown to keep exit fast

Conclusion: failure

View job details

##[group]Run # Single source of truth for the rule: scripts/lint-pr-title.sh — the
 �[36;1m# Single source of truth for the rule: scripts/lint-pr-title.sh — the�[0m
 �[36;1m# SAME validator the local agent gate runs (.claude/hooks/agent-bash-gate.sh),�[0m
 �[36;1m# so CI and local can't drift. The checkout above is ref: main, so this is�[0m
 �[36;1m# always the default-branch script. Dependabot's grouped-update titles�[0m
 �[36;1m# routinely exceed the 72-char subject cap and the format isn't�[0m
 �[36;1m# configurable, so Dependabot PRs are exempt from the length check�[0m
 �[36;1m# (the format check still applies).�[0m
 �[36;1mif [[ "$PR_AUTHOR" == "dependabot[bot]" || "$PR_AUTHOR" == "app/dependabot" ]]; then�[0m
 �[36;1m  export PR_TITLE_SKIP_LENGTH=1�[0m
 �[36;1mfi�[0m
 �[36;1m�[0m
 �[36;1mif reason=$(bash scripts/lint-pr-title.sh "$PR_TITLE" 2>&1); then�[0m
 �[36;1m  echo "passed=true" >> "$GITHUB_OUTPUT"�[0m
 �[36;1m  echo "PR title OK: $PR_TITLE"�[0m
 �[36;1melse�[0m
 �[36;1m  echo "passed=false" >> "$GITHUB_OUTPUT"�[0m
 �[36;1m  printf '%s\n' "$reason"�[0m
 �[36;1m  echo "::error::$(printf '%s' "$reason" | head -1)"�[0m

🧰 Additional context used

📓 Path-based instructions (2)

cmd/wavehouse/**/*.go

📄 CodeRabbit inference engine (AGENTS.md)

Wire new dependencies in cmd/wavehouse/main.go rather than relying on global state or hidden initialization.

Files:

cmd/wavehouse/main.go

internal/**/*.go

📄 CodeRabbit inference engine (AGENTS.md)

Core packages should be interface-first: behaviors such as cache, dedupe, publisher, and subscriber should be defined behind interfaces to keep implementations swappable.

Files:

internal/observability/provider.go
internal/observability/provider_test.go

🧠 Learnings (2)

📚 Learning: 2026-05-25T11:25:08.794Z

Learnt from: EricAndrechek
Repo: Wave-RF/WaveHouse PR: 180
File: internal/observability/instruments.go:40-117
Timestamp: 2026-05-25T11:25:08.794Z
Learning: In WaveHouse (Go), it is idiomatic to declare OpenTelemetry (OTel) metric instruments as package-level `var` values created via the OTel Go SDK global proxy pattern (e.g., `var h metric.Float64Histogram = Meter().Float64Histogram(...)`, `var c metric.Int64Counter = Meter().Int64Counter(...)`). When reviewing, do NOT flag these as “global state” violations under the AGENTS.md constructor-injection rule; that rule is intended for swappable application-level dependencies (Cache/Publisher/Subscriber/Deduplicator), not OTel proxy instrument declarations. Do not recommend refactoring these instruments behind an `Instruments` struct for dependency injection.

Applied to files:

internal/observability/provider.go
internal/observability/provider_test.go

📚 Learning: 2026-06-26T12:23:22.696Z

Learnt from: EricAndrechek
Repo: Wave-RF/WaveHouse PR: 346
File: internal/stream/subscriber_test.go:9-28
Timestamp: 2026-06-26T12:23:22.696Z
Learning: In this Go repository, prefer table-driven tests (e.g., `[]struct{...}` with `t.Run(...)`) only for tests that cover multiple scenarios/inputs and can be cleanly enumerated. Do not artificially rewrite a clear single-scenario sequential behavioral-flow test into a table-driven form just to fit the pattern; if there’s only one meaningful scenario, keep the test as a straightforward linear flow (as in `TestSubscriber_SendDeliversThenDropsWhenFull`).

Applied to files:

internal/observability/provider_test.go

🔇 Additional comments (3)

internal/observability/provider_test.go (1)

11-12: LGTM!

Also applies to: 59-115

cmd/wavehouse/main.go (1)

124-131: LGTM!
internal/observability/provider.go (1)
79-118: 🎯 Functional Correctness

No issue: sync.WaitGroup.Go is supported by the module’s Go 1.26.4 target.
			> Likely an incorrect or invalid review comment.

taitelee added 2 commits July 1, 2026 22:14

fix(observability): shut down OTel providers concurrently to bound ex…

1a6e56c

…it latency

fix(observability): bound OTel provider shutdown to its ctx deadline …

af7461d

…to keep exit fast

taitelee requested review from a team and EricAndrechek July 2, 2026 14:08

github-project-automation Bot added this to WaveHouse Task Board Jul 2, 2026

github-project-automation Bot moved this to Backlog in WaveHouse Task Board Jul 2, 2026

github-actions Bot added go Pull requests that update go code area/observability Metrics, logs, traces, health, profiling area/infra CI, build, deploy, Docker, release labels Jul 2, 2026

taitelee moved this from Backlog to In progress in WaveHouse Task Board Jul 2, 2026

coderabbitai Bot requested changes Jul 2, 2026

View reviewed changes

Comment thread internal/observability/provider.go

Comment thread scripts/orchestrator/main.go

github-project-automation Bot moved this from In progress to In review in WaveHouse Task Board Jul 2, 2026

coderabbitai Bot previously approved these changes Jul 2, 2026

View reviewed changes

changelog edits

2f44f9d

taitelee dismissed coderabbitai[bot]’s stale review via 2f44f9d July 2, 2026 14:59

github-actions Bot added the area/docs Documentation, site/, README label Jul 2, 2026

taitelee moved this from In review to Ready in WaveHouse Task Board Jul 2, 2026

taitelee assigned EricAndrechek Jul 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(observability): bound OTel provider shutdown to keep exit fast#366

fix(observability): bound OTel provider shutdown to keep exit fast#366
taitelee wants to merge 3 commits into
mainfrom
otel-shutdown-latency

taitelee commented Jul 2, 2026

Uh oh!

coderabbitai Bot commented Jul 2, 2026 •

edited

Loading

Review limit reached

Walkthrough

Changes

Uh oh!

github-code-quality Bot commented Jul 2, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

taitelee commented Jul 2, 2026

Summary

Related Issues

Uh oh!

coderabbitai Bot commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Walkthrough

Changes

Uh oh!

github-code-quality Bot commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Coverage Overview

Go

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented Jul 2, 2026 •

edited

Loading

github-code-quality Bot commented Jul 2, 2026 •

edited

Loading