A sub-orchestration reports its parent's execution id back on completion (SubOrchCompleted / SubOrchFailed) via Runtime::get_execution_id_for_instance. That resolver consults a per-Runtime in-memory map (current_execution_ids) and, on a miss, falls back to INITIAL_EXECUTION_ID (1). The map is populated only when the parent runs a turn on that same node.
This is correct on a single node, but in a multi-node deployment (multiple Runtimes sharing one provider, e.g. several AKS pods) the child may complete on a node where the parent never ran. The map misses, the fallback resolves the parent execution to 1, and for a parent past its first execution (i.e. after continue_as_new) the completion is filtered out as belonging to a stale execution — the parent then hangs awaiting a completion that never arrives.
The single-node case is covered by tests/scenarios/suborch_id_collision.rs::parent_with_suborch_survives_continue_as_new, which seeds the map at turn start. The cross-node case is not exercised: the existing multi-node tests (sessions.rs, rolling_deployment.rs, timer_tests.rs) run multiple in-process Runtimes but none drive a sub-orchestration whose parent has continued-as-new across nodes, and CI runs a single ubuntu-latest job, so there is no distributed coverage.
Likely fix: on a map miss, resolve the parent execution from the provider. Provider::latest_execution_id(instance) already exists for this; get_execution_id_for_instance previously queried it and the lookup was removed in favour of the in-memory map. A test that schedules a sub-orchestration in a continue-as-new loop while pinning parent and child to different nodes would close the coverage gap.
A sub-orchestration reports its parent's execution id back on completion (
SubOrchCompleted/SubOrchFailed) viaRuntime::get_execution_id_for_instance. That resolver consults a per-Runtimein-memory map (current_execution_ids) and, on a miss, falls back toINITIAL_EXECUTION_ID(1). The map is populated only when the parent runs a turn on that same node.This is correct on a single node, but in a multi-node deployment (multiple
Runtimes sharing one provider, e.g. several AKS pods) the child may complete on a node where the parent never ran. The map misses, the fallback resolves the parent execution to 1, and for a parent past its first execution (i.e. aftercontinue_as_new) the completion is filtered out as belonging to a stale execution — the parent then hangs awaiting a completion that never arrives.The single-node case is covered by
tests/scenarios/suborch_id_collision.rs::parent_with_suborch_survives_continue_as_new, which seeds the map at turn start. The cross-node case is not exercised: the existing multi-node tests (sessions.rs,rolling_deployment.rs,timer_tests.rs) run multiple in-processRuntimes but none drive a sub-orchestration whose parent has continued-as-new across nodes, and CI runs a singleubuntu-latestjob, so there is no distributed coverage.Likely fix: on a map miss, resolve the parent execution from the provider.
Provider::latest_execution_id(instance)already exists for this;get_execution_id_for_instancepreviously queried it and the lookup was removed in favour of the in-memory map. A test that schedules a sub-orchestration in a continue-as-new loop while pinning parent and child to different nodes would close the coverage gap.