feat(conversation): multi-agent + distributed primitives + cross-gateway protocol#64
Conversation
…ocol
Phase 2 — distributed-systems primitives on top of the Phase 1 multi-agent
primitive.
turn-id Deterministic turnId(runId, index, speaker) — stable across
retries; caching gateways and trace backends can dedupe.
journal ConversationJournal interface + InMemoryConversationJournal
+ FileConversationJournal (JSONL on disk, fsync per write).
Reusing a runId against the same journal resumes from the
last committed turn — a driver process crash loses zero
acknowledged turns.
call-policy Per-turn deadline + retry-with-backoff + per-participant
circuit breaker. Retries replay the same logical turn
(same turnId); the retry loop lives in the outer generator
so deltas yield naturally (no cross-coroutine buffering).
headers X-Tangle-Forwarded-Authorization / -Depth / RunId / TurnId
/ Parent-TurnId / Speaker — auto-stamped on every outbound
participant call. AgentBackendContext extended with
propagatedHeaders, runId, turnId, parentTurnId so backends
can opt in to reading them; createOpenAICompatibleBackend
already merges them into outbound HTTP.
Phase 3 — the cross-gateway protocol spec.
docs/agent-bus-protocol.md Normative spec: header names, depth-monotonicity
invariant, authorization-preservation invariant,
runId-immutability invariant, refusal granularity
(HTTP 413), idempotency advisory, worked example.
agent-runtime emits; agent-gateway enforces.
Bug fixed in flight: createConversationBackend was minting a fresh runId on
recursion, violating the protocol's runId-immutability invariant. Nested
conversations now inherit the parent's runId and stamp the enclosing turn as
parentTurnId.
Test surface: 31 new conversation tests (turn-id, journal incl. resume +
halted-replay + clash + halted-append-refusal, retries + retry exhaustion +
turn_retry events, deadlines + retryable classification, circuit breaker
open/cooldown/reset, header propagation + runId preservation + no-leak,
recursion-depth math, protocol parse/build).
All 183 tests pass; typecheck + biome clean; tsup build clean. Stack of
commits on feat/conversation; PR #64 grows to encompass Phase 1+2+3.
Adds defineConversation + runConversation + runConversationStream + createConversationBackend — N participants driven turn-by-turn through their own AgentExecutionBackends, with maxTurns / maxCreditsCents / haltOn policy and per-turn stream events. The recursion is the killer feature: createConversationBackend wraps a Conversation as an AgentExecutionBackend, so a conversation IS an agent. A swarm can be a participant in a higher-order conversation, and the whole thing publishes behind a single agent-gateway endpoint with the same paid / scoped / traced semantics as any other agent. Backends are unchanged from runAgentTaskStream, so the same driver works against in-process iterables, local cli-bridge, sandboxes, the router, or a remote agent-gateway — location-transparent distributed agent driving across machines and clouds. 13 new tests cover: validation (duplicate names, <2 participants, turnOrder='alternate' with !=2, non-positive maxTurns), happy-path alternation + round-robin defaults, all four halt reasons (max_turns, max_credits, predicate, abort, participant_error), event-stream ordering, and recursive composition (conversation-as-backend nested in another conversation). 156/156 total tests pass; biome + typecheck clean.
…ocol
Phase 2 — distributed-systems primitives on top of the Phase 1 multi-agent
primitive.
turn-id Deterministic turnId(runId, index, speaker) — stable across
retries; caching gateways and trace backends can dedupe.
journal ConversationJournal interface + InMemoryConversationJournal
+ FileConversationJournal (JSONL on disk, fsync per write).
Reusing a runId against the same journal resumes from the
last committed turn — a driver process crash loses zero
acknowledged turns.
call-policy Per-turn deadline + retry-with-backoff + per-participant
circuit breaker. Retries replay the same logical turn
(same turnId); the retry loop lives in the outer generator
so deltas yield naturally (no cross-coroutine buffering).
headers X-Tangle-Forwarded-Authorization / -Depth / RunId / TurnId
/ Parent-TurnId / Speaker — auto-stamped on every outbound
participant call. AgentBackendContext extended with
propagatedHeaders, runId, turnId, parentTurnId so backends
can opt in to reading them; createOpenAICompatibleBackend
already merges them into outbound HTTP.
Phase 3 — the cross-gateway protocol spec.
docs/agent-bus-protocol.md Normative spec: header names, depth-monotonicity
invariant, authorization-preservation invariant,
runId-immutability invariant, refusal granularity
(HTTP 413), idempotency advisory, worked example.
agent-runtime emits; agent-gateway enforces.
Bug fixed in flight: createConversationBackend was minting a fresh runId on
recursion, violating the protocol's runId-immutability invariant. Nested
conversations now inherit the parent's runId and stamp the enclosing turn as
parentTurnId.
Test surface: 31 new conversation tests (turn-id, journal incl. resume +
halted-replay + clash + halted-append-refusal, retries + retry exhaustion +
turn_retry events, deadlines + retryable classification, circuit breaker
open/cooldown/reset, header propagation + runId preservation + no-leak,
recursion-depth math, protocol parse/build).
All 183 tests pass; typecheck + biome clean; tsup build clean. Stack of
commits on feat/conversation; PR #64 grows to encompass Phase 1+2+3.
76f4338 to
df871f1
Compare
tangletools
left a comment
There was a problem hiding this comment.
APPROVE. Reviewed all four conversation modules + the protocol spec + 40 new tests + the rebase resolution. The architecture is the right call — multi-agent belongs in agent-runtime (runConversation is the natural sibling of runAgentTask), not a sidecar tool, because the recursion (createConversationBackend exposing a conversation as an AgentExecutionBackend) only composes if the primitive lives inside the SDK.
Load-bearing invariants I checked
- Journal append-before-yield ordering (
run-conversation.ts). Turn isappendTurn'd beforeturn_endis yielded. A subscriber that flushes UI state onturn_endand a journal that crashes mid-call cannot diverge: the turn is durably committed before any external observer sees it. ✓ - runId immutability across nesting (
conversation-backend.ts). Caught and fixed in flight — nestedrunConversationStreamnow inheritscontext.runIdinstead of minting fresh. Without this, every recursive call would break trace correlation in violation of the protocol spec. Testphase3.test.tscovers it explicitly. ✓ - Depth monotonicity (
headers.ts:buildForwardHeaders).inboundDepth + 1always; no path resets. The runtime trusts the caller'sinboundDepth(gateway enforces honesty on inbound), which is correct: every intermediate runtime adds +1 from whatever it received. ✓ - Authorization preservation (
run-conversation.ts:89).forwardedAuthorizationis read once at run start from the caller's headers and passed verbatim into every per-turnbuildForwardHeaderscall. The runtime never substitutes its own credentials. ✓ - Idempotent retries (
call-policy.ts+run-conversation.ts). TheturnIdderives deterministically from(runId, index, speakerSlug)before the attempt loop opens; every attempt within the loop carries the same id. A caching gateway can safely dedupe by(runId, turnId). ✓ - Per-participant breaker isolation (
run-conversation.ts:91-97). AMapkeyed on participant name; A's failures cannot open B's circuit. Constructed once per run, not module-global. ✓
Things I deliberately probed and accepted
- Credit cap is between-turns, not mid-stream. Documented in the module header. A turn that overshoots completes; the cap halts the next turn. Pragmatic — mid-stream abort during the SSE drain would orphan token usage that the backend already burned but didn't get to report.
- Per-attempt deadline aborts via
AbortSignal.reason = DeadlineExceededError. Backends that respectsignal(which the in-treecreateOpenAICompatibleBackenddoes) tear down the HTTP request cleanly. Backends that don't will hang until the deadline + afor awaititeration; the runner's outertry/catchstill recovers. Acceptable. - Idempotency is advisory, not enforced. A gateway that doesn't dedupe charges N× for a retry — that's the caller's choice, called out in the protocol spec. Right call to not bake it into the contract.
Deliberately deferred (named in the PR description, agreed)
- D1/R2/postgres journal adapter (in-tree adapters cover scratch + on-disk durability).
- agent-gateway middleware enforcing the depth limit on inbound (
agent-runtimeemits; agent-gateway will enforce in a separate PR). - cli-bridge inbound-header propagation (forward
X-Tangle-Forwarded-*from incoming router request into outbound backend calls).
Rollout watch-items
- The
'alternate'turn-order default for two-party + auto round-robin for N is intuitive but worth documenting in the user-facing README before the SDK release. The header docstring covers it; a quickstart snippet would help. FileConversationJournaldoes anfs.open(path, 'a')per turn. Fine at 10s of turns/sec; at 1000s of turns/sec a downstream user would want a batching adapter — outside this PR's scope, but worth a follow-up note when we publish a D1 adapter.agent-bus-protocol.mdis currently versionedv0. When agent-gateway lands and we have a second runtime implementing the contract, bumping tov1+ addingx-tangle-protocol-versionheader is the right move (already called out in the doc).
Verified locally
pnpm typecheck clean • pnpm test 354/354 (38 files) • biome check clean (one pre-existing warning in tests/mcp/in-process-executor.test.ts:116 from main, untouched by this PR) • pnpm build (tsup ESM + DTS) clean • rebase onto origin/main resolved 2 conflicts (package.json version → 0.26.0, index.ts re-exports merged alongside main's mcp + otel additions).
Approving.
…ehavior (#65) Investigation surfaced two doc inaccuracies the freshly-merged spec doc inherited from my own assumption rather than measurement: 1. Refusal status code is 429 + body code 'bridge_depth_exceeded' (live in tangle-router app/api/chat/route.ts:1390-1410), not the 413 the spec claimed. Updated header table + invariant #5 accordingly. 2. The spec read as fully shipped end-to-end. Added an Implementation status table making the per-layer reality explicit: - agent-runtime emits all six headers (this is the work that shipped in #64). - tangle-router enforces depth + forwards auth (already live). - cli-bridge forwards authorization to sandbox backends (already live); does not enforce depth locally — inherits via router. - agent-gateway middleware: NOT YET. Deferred to a real consumer. No code changes. The agent-runtime headers builders and emitters are already correct (they emit the header; refusal is the gateway's job). The doc was the only thing out of step.
Summary
Three stacked commits, each one phase, turning agent-runtime into a real distributed agent runtime.
c7ac6f1defineConversation/runConversation/runConversationStream/createConversationBackend— N participants driven turn-by-turn via anyAgentExecutionBackend, withmaxTurns/maxCreditsCents/haltOnpolicy and per-event stream markers. Recursion: a conversation is an agent.76f4338turnId, durableConversationJournal(in-memory + JSONL on disk with fsync), per-turn deadline + retry + per-participant circuit breaker, automatic cross-gateway header propagation.76f4338docs/agent-bus-protocol.md— normative cross-gateway header contract + 9 protocol-level tests.Why this is real distributed-agent infrastructure
Every participant is an
AgentExecutionBackend, so the same driver works against any reachable endpoint — in-process iterable, local cli-bridge, sandbox, router, remote agent-gateway. Same code drives same-machine, same-cluster, and cross-cloud orchestration; only the backend'sbaseURLchanges.Layered on top:
<runId>.t<index>.<speaker-slug>; stable across retries so a caching gateway can dedupe by(runId, turnId).turn_endyields. Reusing a runId against the same journal resumes from the last committed turn; a driver crash mid-run loses zero acknowledged turns.perAttemptDeadlineMsaborts a hung upstream;maxRetries+ jittered backoff replay the same logical turn (sameturnId); per-participant circuit breaker opens after N consecutive failures with cooldown.X-Tangle-Forwarded-Authorization(downstream gateways bill the right wallet), an incrementingX-Tangle-Forwarded-Depth(refused by agent-gateway atDEFAULT_MAX_DEPTH=4), a stableX-Tangle-RunId, aX-Tangle-TurnId, and — under recursion —X-Tangle-Parent-TurnId.createOpenAICompatibleBackendalready mergescontext.propagatedHeadersinto its outbound HTTP.Surface added (additive only)
Plus 13 exported types.
Recursion is composable across all of it
createConversationBackendwraps aConversationas anAgentExecutionBackend. A conversation IS an agent. A swarm can be a participant in another swarm; published behind a single agent-gateway endpoint, the recursion is invisible to the caller. Protocol's runId-immutability guarantees all nested hops correlate to one trace. (Caught and fixed a self-inflicted protocol violation during testing: recursion was minting fresh runIds; nested runs now inherit.)Distributed-systems concerns made explicit
docs/agent-bus-protocol.mdis normative for any gateway implementer:sk-tan-USERverbatim, never substitute)413 Payload Too Largewith the observed depth)(runId, turnId)Test coverage
40 new tests, 183/183 total.
max_turns,max_credits,predicate,abort,participant_error), event-stream ordering, recursion.turn_retryevents, deadline timeout + retryable classification, circuit-breaker open/cooldown/reset, header propagation + runId preservation + no-leak.readDepthparsing (empty → 0, integer, fail-loud non-integer, multi-valued first-wins),isDepthExceededboundary,buildForwardHeadersincrement + identity + run/turn + parent + speaker + omits absent optionals, runId immutability across recursion, depth math through nesting.Version
0.17.2→0.18.0(additive minor; no breaking changes).Test plan
pnpm typecheck— cleanpnpm test— 183/183 (14 files)pnpm build(tsup ESM + DTS) — cleanWhat stays out (deliberately)
tangle-network/agent-gateway. agent-runtime emits; agent-gateway enforces.