Problem
When queue messages (e.g., QueueMessage from enqueue_event) arrive in the orchestrator queue before StartOrchestration for a new instance, the runtime's orchestration dispatcher encounters a batch with no instance, no history, and no StartOrchestration/ContinueAsNew message. The current behavior:
fetch_orchestration_item returns the batch with orchestration_name="Unknown"
- The runtime logs
"completion messages for unstarted instance" and "empty effective batch"
- The runtime acks the batch, which permanently deletes the queue rows
- The events are lost forever
This was discovered via the sample_config_hot_reload_persistent_events_fs e2e test, which enqueues events before starting an orchestration.
Current Provider-Side Workaround
Both duroxide-pg and duroxide-pg-opt have implemented a provider-side fix in their fetch_orchestration_item stored procedure:
- Scan ALL messages for
StartOrchestration/ContinueAsNew (not just messages[0]), matching the SQLite provider's work_items.iter().find() behavior
- If no start item found: release locks and return nothing, leaving messages in the queue until
StartOrchestration arrives
This works but pushes responsibility to the provider, which:
- Is fragile (providers must each implement this correctly)
- Cannot add a
visible_at delay to prevent tight re-fetching (any delay risks events being lost if the orchestration completes before the delay expires)
- Relies on
LISTEN/NOTIFY for backpressure to prevent tight-looping
Proposed Runtime-Level Fix
The runtime's orchestration dispatcher should handle this case explicitly:
- When
fetch_orchestration_item returns a batch with no instance and no StartOrchestration/ContinueAsNew in the messages, the runtime should abandon the batch (not ack it)
- The abandon should use a reasonable delay (e.g., 500ms) so items become available again later
- This keeps the contract simple: providers return whatever is in the queue, and the runtime decides what to do
This would also allow removing the provider-side workarounds.
Provider Validation Test
This issue was only caught by an e2e sample test (sample_config_hot_reload_persistent_events_fs), not by any provider validation test. A dedicated test should be added to duroxide::provider_validation that:
- Enqueues one or more
QueueMessage events for an instance before calling start_orchestration
- Then starts the orchestration
- Verifies that all pre-enqueued events are delivered to the orchestration (not silently dropped)
This would ensure all provider implementations are validated against this scenario without requiring full e2e tests.
Affected Code
- Runtime:
dispatchers/orchestration.rs - the "completion messages for unstarted instance" code path
- Provider trait:
abandon_orchestration_item is already available for this purpose
- Test suite:
duroxide::provider_validation - add orphan message handling test
References
duroxide-pg-opt migration 0006_fix_orphan_queue_messages.sql
duroxide-pg migration 0016_fix_orphan_queue_messages.sql
- Test:
sample_config_hot_reload_persistent_events_fs
Problem
When queue messages (e.g.,
QueueMessagefromenqueue_event) arrive in the orchestrator queue beforeStartOrchestrationfor a new instance, the runtime's orchestration dispatcher encounters a batch with no instance, no history, and noStartOrchestration/ContinueAsNewmessage. The current behavior:fetch_orchestration_itemreturns the batch withorchestration_name="Unknown""completion messages for unstarted instance"and"empty effective batch"This was discovered via the
sample_config_hot_reload_persistent_events_fse2e test, which enqueues events before starting an orchestration.Current Provider-Side Workaround
Both
duroxide-pgandduroxide-pg-opthave implemented a provider-side fix in theirfetch_orchestration_itemstored procedure:StartOrchestration/ContinueAsNew(not justmessages[0]), matching the SQLite provider'swork_items.iter().find()behaviorStartOrchestrationarrivesThis works but pushes responsibility to the provider, which:
visible_atdelay to prevent tight re-fetching (any delay risks events being lost if the orchestration completes before the delay expires)LISTEN/NOTIFYfor backpressure to prevent tight-loopingProposed Runtime-Level Fix
The runtime's orchestration dispatcher should handle this case explicitly:
fetch_orchestration_itemreturns a batch with no instance and noStartOrchestration/ContinueAsNewin the messages, the runtime should abandon the batch (not ack it)This would also allow removing the provider-side workarounds.
Provider Validation Test
This issue was only caught by an e2e sample test (
sample_config_hot_reload_persistent_events_fs), not by any provider validation test. A dedicated test should be added toduroxide::provider_validationthat:QueueMessageevents for an instance before callingstart_orchestrationThis would ensure all provider implementations are validated against this scenario without requiring full e2e tests.
Affected Code
dispatchers/orchestration.rs- the"completion messages for unstarted instance"code pathabandon_orchestration_itemis already available for this purposeduroxide::provider_validation- add orphan message handling testReferences
duroxide-pg-optmigration0006_fix_orphan_queue_messages.sqlduroxide-pgmigration0016_fix_orphan_queue_messages.sqlsample_config_hot_reload_persistent_events_fs