Skip to content

feat: opt-in DLQ on terminal failure (issue #26)#40

Merged
lesnik512 merged 3 commits into
mainfrom
feat/dlq-on-terminal-failure
May 30, 2026
Merged

feat: opt-in DLQ on terminal failure (issue #26)#40
lesnik512 merged 3 commits into
mainfrom
feat/dlq-on-terminal-failure

Conversation

@lesnik512
Copy link
Copy Markdown
Member

Closes #26.

Summary

  • Opt-in DLQ. Pass OutboxBroker(..., dlq_table=make_dlq_table(metadata)); default behavior is bit-for-bit identical to today.
  • Atomic via single Postgres CTE. OutboxClient.delete_with_lease(..., dlq_payload=...) issues WITH deleted AS (DELETE … RETURNING …) INSERT INTO <dlq> SELECT … FROM deleted so the audit row commits with the DELETE or not at all (DLQ INSERT failure rolls the whole statement back — outbox row stays leased, gets reclaimed/retried). Preserves the writer-connection autocommit fast path (single statement); preserves the lease-token guard.
  • Three failure paths, each tagged on the row via OutboxInnerMessage.terminal_failure_reason so _flush_terminal knows whether to build a DLQ payload:
    • allow_delivery False ⇒ "max_deliveries"
    • _nack exhausted (delay None) ⇒ "retry_terminal"
    • _reject"rejected" (diverges from issue text — see "Design decisions" below)
    • _ack ⇒ stays None, never touches DLQ
  • New dlq_written metric event emitted from _flush_terminal after the CTE commits; tags {queue, subscriber, deliveries_count, failure_reason, exception_type}. Skipped on lease-lost.
  • validate_schema() checks the DLQ table when configured (separate _validate_dlq_schema_sync pass alongside the existing outbox one).
  • FakeOutboxClient accumulates dlq_rows in-memory so TestOutboxBroker tests can assert audit content without Postgres.

Design decisions (worth flagging for review)

  1. _reject writes to DLQ. Issue E3: DLQ / archive on terminal failure (opt-in) #26 lists _reject as opt-out. We took the opposite call after discussion: AckPolicy.REJECT_ON_ERROR is a real failure operators want audited, and manual await msg.reject() still signals "poison, drop with audit." last_exception may be None on manual reject — DLQ column is nullable.
  2. dlq_written metric fires in this PR, not deferred. The issue's "wait for E4" note is stale; the metrics-recorder seam already exists, so the emission is a 5-line addition consistent with the other six subscriber events.
  3. failure_reason stored as String(32) literals, not int enum or PG ENUM. Human-readable in ad-hoc operator queries; storage delta is irrelevant for an audit table; no Python-side enum-mapping required.
  4. CTE composed via text() with quoted identifiers, not SQLAlchemy core composition. The table names come from application-owned Table objects and are passed through dialect.identifier_preparer.quote(); bind parameters carry the dynamic values. # noqa: S608 justified inline. Trade-off: SQLAlchemy core's delete(...).returning(...).cte() + insert().from_select() works but is harder to read for a Postgres-only library.

Test plan

  • just lint — ruff format, ruff check, ty check all green
  • just test — 383 passed, 100.00% coverage gate satisfied
  • Unit tests cover the 4 failure-reason wiring paths + 7 _flush_terminal dispatch behaviors + make_dlq_table factory
  • Fake-broker end-to-end exercises all three failure modes + unconfigured no-op + metric fire/no-fire
  • Integration tests cover atomic CTE happy path, INSERT-failure rollback, lease-lost no-op, and validate_schema with/without DLQ
  • Default path (no dlq_table kwarg) is bit-for-bit identical — existing 240+ tests continue to pass unmodified

Notes for reviewers

  • The four FakeOutboxClient.delete_with_lease overrides in test files (test_unit.py × 2, test_fake.py × 2) were updated to (*args, **kwargs) or explicit dlq_payload kwarg to track the abstract signature change.
  • One pre-existing test was extended with _warnings.catch_warnings() around the REJECT_ON_ERROR + retry_strategy misconfig warning — matches the existing pattern at tests/test_fake.py:971.
  • Docs (README.md, CLAUDE.md) intentionally not touched in this PR — happy to add a follow-up if you'd like the architecture notes in tree.

🤖 Generated with Claude Code

Adds make_dlq_table() and OutboxBroker(..., dlq_table=...) so terminal
failures copy payload + headers + last-exception summary into a DLQ
table in the same Postgres statement as the outbox DELETE (CTE), with
a dlq_written metric. Off by default; bit-for-bit identical when unset.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@lesnik512 lesnik512 self-assigned this May 30, 2026
lesnik512 and others added 2 commits May 30, 2026 11:00
…n, bundled adapters)

Address review findings on the opt-in DLQ feature:

- nacked_terminal metric now branches on terminal_failure_reason before
  last_exception, fixing manual msg.reject() being mis-tagged as "acked"
  and REJECT_ON_ERROR exceptions being mis-tagged as reason="retry_terminal".
- dlq_written event surfaced in both bundled adapters
  (faststream_outbox_dlq_written_total in Prometheus, messaging.outbox.dlq_written
  in OpenTelemetry) and added to the vocabulary docstring.
- last_exception repr() bounded at 8 KiB via _LAST_EXCEPTION_MAX_CHARS so
  pathological exceptions (MB-scale validation errors, asyncpg DataError) can't
  stall the writer round-trip.
- DLQ failure_reason column bumped from String(32) to String(64); added
  DLQFailureReason Literal type as the public contract for the reason set.
- Hoisted the inline _to_inner import in tests/test_fake.py to module top.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- New docs/usage/dlq.md page (~150 lines) covering quickstart, the three
  failure_reason values, schema reference, atomic CTE guarantees,
  last_exception truncation, validate_schema integration, the dlq_written
  metric (Prometheus + OTel), retention guidance, and a test-broker example.
- Wire it into mkdocs.yml nav and docs/index.md between Subscriber and
  Publisher (sequel to subscriber-side failure handling).
- README: one-line DLQ pointer in "How it works"; fix Acknowledgements
  paragraph that incorrectly claimed "no archive table".
- CLAUDE.md: cross-link to the new usage page from the existing
  Opt-in DLQ architecture section.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@lesnik512 lesnik512 merged commit 41d5da6 into main May 30, 2026
3 checks passed
@lesnik512 lesnik512 deleted the feat/dlq-on-terminal-failure branch May 30, 2026 08:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

E3: DLQ / archive on terminal failure (opt-in)

1 participant