feat: opt-in DLQ on terminal failure (issue #26)#40
Merged
Conversation
Adds make_dlq_table() and OutboxBroker(..., dlq_table=...) so terminal failures copy payload + headers + last-exception summary into a DLQ table in the same Postgres statement as the outbox DELETE (CTE), with a dlq_written metric. Off by default; bit-for-bit identical when unset. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…n, bundled adapters) Address review findings on the opt-in DLQ feature: - nacked_terminal metric now branches on terminal_failure_reason before last_exception, fixing manual msg.reject() being mis-tagged as "acked" and REJECT_ON_ERROR exceptions being mis-tagged as reason="retry_terminal". - dlq_written event surfaced in both bundled adapters (faststream_outbox_dlq_written_total in Prometheus, messaging.outbox.dlq_written in OpenTelemetry) and added to the vocabulary docstring. - last_exception repr() bounded at 8 KiB via _LAST_EXCEPTION_MAX_CHARS so pathological exceptions (MB-scale validation errors, asyncpg DataError) can't stall the writer round-trip. - DLQ failure_reason column bumped from String(32) to String(64); added DLQFailureReason Literal type as the public contract for the reason set. - Hoisted the inline _to_inner import in tests/test_fake.py to module top. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- New docs/usage/dlq.md page (~150 lines) covering quickstart, the three failure_reason values, schema reference, atomic CTE guarantees, last_exception truncation, validate_schema integration, the dlq_written metric (Prometheus + OTel), retention guidance, and a test-broker example. - Wire it into mkdocs.yml nav and docs/index.md between Subscriber and Publisher (sequel to subscriber-side failure handling). - README: one-line DLQ pointer in "How it works"; fix Acknowledgements paragraph that incorrectly claimed "no archive table". - CLAUDE.md: cross-link to the new usage page from the existing Opt-in DLQ architecture section. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #26.
Summary
OutboxBroker(..., dlq_table=make_dlq_table(metadata)); default behavior is bit-for-bit identical to today.OutboxClient.delete_with_lease(..., dlq_payload=...)issuesWITH deleted AS (DELETE … RETURNING …) INSERT INTO <dlq> SELECT … FROM deletedso the audit row commits with the DELETE or not at all (DLQ INSERT failure rolls the whole statement back — outbox row stays leased, gets reclaimed/retried). Preserves the writer-connection autocommit fast path (single statement); preserves the lease-token guard.OutboxInnerMessage.terminal_failure_reasonso_flush_terminalknows whether to build a DLQ payload:allow_deliveryFalse ⇒"max_deliveries"_nackexhausted (delay None) ⇒"retry_terminal"_reject⇒"rejected"(diverges from issue text — see "Design decisions" below)_ack⇒ staysNone, never touches DLQdlq_writtenmetric event emitted from_flush_terminalafter the CTE commits; tags{queue, subscriber, deliveries_count, failure_reason, exception_type}. Skipped on lease-lost.validate_schema()checks the DLQ table when configured (separate_validate_dlq_schema_syncpass alongside the existing outbox one).FakeOutboxClientaccumulatesdlq_rowsin-memory soTestOutboxBrokertests can assert audit content without Postgres.Design decisions (worth flagging for review)
_rejectwrites to DLQ. Issue E3: DLQ / archive on terminal failure (opt-in) #26 lists_rejectas opt-out. We took the opposite call after discussion:AckPolicy.REJECT_ON_ERRORis a real failure operators want audited, and manualawait msg.reject()still signals "poison, drop with audit."last_exceptionmay beNoneon manual reject — DLQ column is nullable.dlq_writtenmetric fires in this PR, not deferred. The issue's "wait for E4" note is stale; the metrics-recorder seam already exists, so the emission is a 5-line addition consistent with the other six subscriber events.failure_reasonstored asString(32)literals, not int enum or PG ENUM. Human-readable in ad-hoc operator queries; storage delta is irrelevant for an audit table; no Python-side enum-mapping required.text()with quoted identifiers, not SQLAlchemy core composition. The table names come from application-ownedTableobjects and are passed throughdialect.identifier_preparer.quote(); bind parameters carry the dynamic values.# noqa: S608justified inline. Trade-off: SQLAlchemy core'sdelete(...).returning(...).cte()+insert().from_select()works but is harder to read for a Postgres-only library.Test plan
just lint— ruff format, ruff check, ty check all greenjust test— 383 passed, 100.00% coverage gate satisfied_flush_terminaldispatch behaviors +make_dlq_tablefactoryvalidate_schemawith/without DLQdlq_tablekwarg) is bit-for-bit identical — existing 240+ tests continue to pass unmodifiedNotes for reviewers
FakeOutboxClient.delete_with_leaseoverrides in test files (test_unit.py × 2, test_fake.py × 2) were updated to(*args, **kwargs)or explicitdlq_payloadkwarg to track the abstract signature change._warnings.catch_warnings()around theREJECT_ON_ERROR + retry_strategymisconfig warning — matches the existing pattern attests/test_fake.py:971.README.md,CLAUDE.md) intentionally not touched in this PR — happy to add a follow-up if you'd like the architecture notes in tree.🤖 Generated with Claude Code