Skip to content

feat(conversation): multi-agent + distributed primitives + cross-gateway protocol#64

Merged
drewstone merged 2 commits into
mainfrom
feat/conversation
May 26, 2026
Merged

feat(conversation): multi-agent + distributed primitives + cross-gateway protocol#64
drewstone merged 2 commits into
mainfrom
feat/conversation

Conversation

@drewstone
Copy link
Copy Markdown
Contributor

@drewstone drewstone commented May 26, 2026

Summary

Three stacked commits, each one phase, turning agent-runtime into a real distributed agent runtime.

Phase Commit What
1 c7ac6f1 defineConversation / runConversation / runConversationStream / createConversationBackend — N participants driven turn-by-turn via any AgentExecutionBackend, with maxTurns / maxCreditsCents / haltOn policy and per-event stream markers. Recursion: a conversation is an agent.
2 76f4338 Deterministic turnId, durable ConversationJournal (in-memory + JSONL on disk with fsync), per-turn deadline + retry + per-participant circuit breaker, automatic cross-gateway header propagation.
3 76f4338 docs/agent-bus-protocol.md — normative cross-gateway header contract + 9 protocol-level tests.

Why this is real distributed-agent infrastructure

Every participant is an AgentExecutionBackend, so the same driver works against any reachable endpoint — in-process iterable, local cli-bridge, sandbox, router, remote agent-gateway. Same code drives same-machine, same-cluster, and cross-cloud orchestration; only the backend's baseURL changes.

Layered on top:

  • Idempotent turn ids<runId>.t<index>.<speaker-slug>; stable across retries so a caching gateway can dedupe by (runId, turnId).
  • Durable journal — every committed turn fsynced before turn_end yields. Reusing a runId against the same journal resumes from the last committed turn; a driver crash mid-run loses zero acknowledged turns.
  • Per-turn call policyperAttemptDeadlineMs aborts a hung upstream; maxRetries + jittered backoff replay the same logical turn (same turnId); per-participant circuit breaker opens after N consecutive failures with cooldown.
  • Cross-gateway header propagation — every outbound participant call carries the original user's X-Tangle-Forwarded-Authorization (downstream gateways bill the right wallet), an incrementing X-Tangle-Forwarded-Depth (refused by agent-gateway at DEFAULT_MAX_DEPTH=4), a stable X-Tangle-RunId, a X-Tangle-TurnId, and — under recursion — X-Tangle-Parent-TurnId. createOpenAICompatibleBackend already merges context.propagatedHeaders into its outbound HTTP.

Surface added (additive only)

// Phase 1
defineConversation, runConversation, runConversationStream, createConversationBackend

// Phase 2
turnId, slugifySpeaker
ConversationJournal, InMemoryConversationJournal, FileConversationJournal
BackendCallPolicy, CircuitBreakerState, CircuitBreakerConfig
CircuitOpenError, DeadlineExceededError
defaultIsRetryable, computeBackoff, makePerAttemptSignal, sleep

// Phase 3
FORWARD_HEADERS, ForwardHeaderName, DEFAULT_MAX_DEPTH
readDepth, isDepthExceeded, buildForwardHeaders, PropagatedHeaders

Plus 13 exported types.

Recursion is composable across all of it

createConversationBackend wraps a Conversation as an AgentExecutionBackend. A conversation IS an agent. A swarm can be a participant in another swarm; published behind a single agent-gateway endpoint, the recursion is invisible to the caller. Protocol's runId-immutability guarantees all nested hops correlate to one trace. (Caught and fixed a self-inflicted protocol violation during testing: recursion was minting fresh runIds; nested runs now inherit.)

Distributed-systems concerns made explicit

docs/agent-bus-protocol.md is normative for any gateway implementer:

  • Depth monotonicity (gateways MUST NOT reset the counter)
  • Authorization preservation (forward sk-tan-USER verbatim, never substitute)
  • runId immutability through nested conversations
  • 413 refusal granularity (gateways MUST refuse with 413 Payload Too Large with the observed depth)
  • Idempotency advisory: gateways MAY dedupe by (runId, turnId)

Test coverage

40 new tests, 183/183 total.

  • Phase 1 (13): validation, alternation + round-robin defaults, all five halt reasons (max_turns, max_credits, predicate, abort, participant_error), event-stream ordering, recursion.
  • Phase 2 (18): turnId determinism + slugify edge cases, journal persist+resume + halted-replay + clash + halted-append-refusal, retry success + exhaustion + turn_retry events, deadline timeout + retryable classification, circuit-breaker open/cooldown/reset, header propagation + runId preservation + no-leak.
  • Phase 3 (9): readDepth parsing (empty → 0, integer, fail-loud non-integer, multi-valued first-wins), isDepthExceeded boundary, buildForwardHeaders increment + identity + run/turn + parent + speaker + omits absent optionals, runId immutability across recursion, depth math through nesting.

Version

0.17.20.18.0 (additive minor; no breaking changes).

Test plan

  • pnpm typecheck — clean
  • pnpm test — 183/183 (14 files)
  • biome check — clean
  • pnpm build (tsup ESM + DTS) — clean

What stays out (deliberately)

  • A real D1 / R2 / postgres journal adapter — interface is small (~30 lines); in-memory + file adapters cover tests + scratch.
  • agent-gateway middleware that enforces the depth limit — separate PR against tangle-network/agent-gateway. agent-runtime emits; agent-gateway enforces.
  • A cli-bridge adapter that honors inbound forwarded headers — separate PR; agent-runtime is ready to emit them now.

drewstone added a commit that referenced this pull request May 26, 2026
…ocol

Phase 2 — distributed-systems primitives on top of the Phase 1 multi-agent
primitive.

  turn-id          Deterministic turnId(runId, index, speaker) — stable across
                   retries; caching gateways and trace backends can dedupe.
  journal          ConversationJournal interface + InMemoryConversationJournal
                   + FileConversationJournal (JSONL on disk, fsync per write).
                   Reusing a runId against the same journal resumes from the
                   last committed turn — a driver process crash loses zero
                   acknowledged turns.
  call-policy      Per-turn deadline + retry-with-backoff + per-participant
                   circuit breaker. Retries replay the same logical turn
                   (same turnId); the retry loop lives in the outer generator
                   so deltas yield naturally (no cross-coroutine buffering).
  headers          X-Tangle-Forwarded-Authorization / -Depth / RunId / TurnId
                   / Parent-TurnId / Speaker — auto-stamped on every outbound
                   participant call. AgentBackendContext extended with
                   propagatedHeaders, runId, turnId, parentTurnId so backends
                   can opt in to reading them; createOpenAICompatibleBackend
                   already merges them into outbound HTTP.

Phase 3 — the cross-gateway protocol spec.

  docs/agent-bus-protocol.md   Normative spec: header names, depth-monotonicity
                               invariant, authorization-preservation invariant,
                               runId-immutability invariant, refusal granularity
                               (HTTP 413), idempotency advisory, worked example.
                               agent-runtime emits; agent-gateway enforces.

Bug fixed in flight: createConversationBackend was minting a fresh runId on
recursion, violating the protocol's runId-immutability invariant. Nested
conversations now inherit the parent's runId and stamp the enclosing turn as
parentTurnId.

Test surface: 31 new conversation tests (turn-id, journal incl. resume +
halted-replay + clash + halted-append-refusal, retries + retry exhaustion +
turn_retry events, deadlines + retryable classification, circuit breaker
open/cooldown/reset, header propagation + runId preservation + no-leak,
recursion-depth math, protocol parse/build).

All 183 tests pass; typecheck + biome clean; tsup build clean. Stack of
commits on feat/conversation; PR #64 grows to encompass Phase 1+2+3.
@drewstone drewstone changed the title feat(conversation): multi-agent conversation primitive feat(conversation): multi-agent + distributed primitives + cross-gateway protocol May 26, 2026
drewstone added 2 commits May 27, 2026 02:25
Adds defineConversation + runConversation + runConversationStream +
createConversationBackend — N participants driven turn-by-turn through their
own AgentExecutionBackends, with maxTurns / maxCreditsCents / haltOn policy
and per-turn stream events.

The recursion is the killer feature: createConversationBackend wraps a
Conversation as an AgentExecutionBackend, so a conversation IS an agent. A
swarm can be a participant in a higher-order conversation, and the whole
thing publishes behind a single agent-gateway endpoint with the same paid /
scoped / traced semantics as any other agent.

Backends are unchanged from runAgentTaskStream, so the same driver works
against in-process iterables, local cli-bridge, sandboxes, the router, or a
remote agent-gateway — location-transparent distributed agent driving across
machines and clouds.

13 new tests cover: validation (duplicate names, <2 participants,
turnOrder='alternate' with !=2, non-positive maxTurns), happy-path
alternation + round-robin defaults, all four halt reasons (max_turns,
max_credits, predicate, abort, participant_error), event-stream ordering,
and recursive composition (conversation-as-backend nested in another
conversation). 156/156 total tests pass; biome + typecheck clean.
…ocol

Phase 2 — distributed-systems primitives on top of the Phase 1 multi-agent
primitive.

  turn-id          Deterministic turnId(runId, index, speaker) — stable across
                   retries; caching gateways and trace backends can dedupe.
  journal          ConversationJournal interface + InMemoryConversationJournal
                   + FileConversationJournal (JSONL on disk, fsync per write).
                   Reusing a runId against the same journal resumes from the
                   last committed turn — a driver process crash loses zero
                   acknowledged turns.
  call-policy      Per-turn deadline + retry-with-backoff + per-participant
                   circuit breaker. Retries replay the same logical turn
                   (same turnId); the retry loop lives in the outer generator
                   so deltas yield naturally (no cross-coroutine buffering).
  headers          X-Tangle-Forwarded-Authorization / -Depth / RunId / TurnId
                   / Parent-TurnId / Speaker — auto-stamped on every outbound
                   participant call. AgentBackendContext extended with
                   propagatedHeaders, runId, turnId, parentTurnId so backends
                   can opt in to reading them; createOpenAICompatibleBackend
                   already merges them into outbound HTTP.

Phase 3 — the cross-gateway protocol spec.

  docs/agent-bus-protocol.md   Normative spec: header names, depth-monotonicity
                               invariant, authorization-preservation invariant,
                               runId-immutability invariant, refusal granularity
                               (HTTP 413), idempotency advisory, worked example.
                               agent-runtime emits; agent-gateway enforces.

Bug fixed in flight: createConversationBackend was minting a fresh runId on
recursion, violating the protocol's runId-immutability invariant. Nested
conversations now inherit the parent's runId and stamp the enclosing turn as
parentTurnId.

Test surface: 31 new conversation tests (turn-id, journal incl. resume +
halted-replay + clash + halted-append-refusal, retries + retry exhaustion +
turn_retry events, deadlines + retryable classification, circuit breaker
open/cooldown/reset, header propagation + runId preservation + no-leak,
recursion-depth math, protocol parse/build).

All 183 tests pass; typecheck + biome clean; tsup build clean. Stack of
commits on feat/conversation; PR #64 grows to encompass Phase 1+2+3.
@drewstone drewstone force-pushed the feat/conversation branch from 76f4338 to df871f1 Compare May 26, 2026 23:27
Copy link
Copy Markdown
Contributor

@tangletools tangletools left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

APPROVE. Reviewed all four conversation modules + the protocol spec + 40 new tests + the rebase resolution. The architecture is the right call — multi-agent belongs in agent-runtime (runConversation is the natural sibling of runAgentTask), not a sidecar tool, because the recursion (createConversationBackend exposing a conversation as an AgentExecutionBackend) only composes if the primitive lives inside the SDK.

Load-bearing invariants I checked

  1. Journal append-before-yield ordering (run-conversation.ts). Turn is appendTurn'd before turn_end is yielded. A subscriber that flushes UI state on turn_end and a journal that crashes mid-call cannot diverge: the turn is durably committed before any external observer sees it. ✓
  2. runId immutability across nesting (conversation-backend.ts). Caught and fixed in flight — nested runConversationStream now inherits context.runId instead of minting fresh. Without this, every recursive call would break trace correlation in violation of the protocol spec. Test phase3.test.ts covers it explicitly. ✓
  3. Depth monotonicity (headers.ts:buildForwardHeaders). inboundDepth + 1 always; no path resets. The runtime trusts the caller's inboundDepth (gateway enforces honesty on inbound), which is correct: every intermediate runtime adds +1 from whatever it received. ✓
  4. Authorization preservation (run-conversation.ts:89). forwardedAuthorization is read once at run start from the caller's headers and passed verbatim into every per-turn buildForwardHeaders call. The runtime never substitutes its own credentials. ✓
  5. Idempotent retries (call-policy.ts + run-conversation.ts). The turnId derives deterministically from (runId, index, speakerSlug) before the attempt loop opens; every attempt within the loop carries the same id. A caching gateway can safely dedupe by (runId, turnId). ✓
  6. Per-participant breaker isolation (run-conversation.ts:91-97). A Map keyed on participant name; A's failures cannot open B's circuit. Constructed once per run, not module-global. ✓

Things I deliberately probed and accepted

  • Credit cap is between-turns, not mid-stream. Documented in the module header. A turn that overshoots completes; the cap halts the next turn. Pragmatic — mid-stream abort during the SSE drain would orphan token usage that the backend already burned but didn't get to report.
  • Per-attempt deadline aborts via AbortSignal.reason = DeadlineExceededError. Backends that respect signal (which the in-tree createOpenAICompatibleBackend does) tear down the HTTP request cleanly. Backends that don't will hang until the deadline + a for await iteration; the runner's outer try/catch still recovers. Acceptable.
  • Idempotency is advisory, not enforced. A gateway that doesn't dedupe charges N× for a retry — that's the caller's choice, called out in the protocol spec. Right call to not bake it into the contract.

Deliberately deferred (named in the PR description, agreed)

  • D1/R2/postgres journal adapter (in-tree adapters cover scratch + on-disk durability).
  • agent-gateway middleware enforcing the depth limit on inbound (agent-runtime emits; agent-gateway will enforce in a separate PR).
  • cli-bridge inbound-header propagation (forward X-Tangle-Forwarded-* from incoming router request into outbound backend calls).

Rollout watch-items

  • The 'alternate' turn-order default for two-party + auto round-robin for N is intuitive but worth documenting in the user-facing README before the SDK release. The header docstring covers it; a quickstart snippet would help.
  • FileConversationJournal does an fs.open(path, 'a') per turn. Fine at 10s of turns/sec; at 1000s of turns/sec a downstream user would want a batching adapter — outside this PR's scope, but worth a follow-up note when we publish a D1 adapter.
  • agent-bus-protocol.md is currently versioned v0. When agent-gateway lands and we have a second runtime implementing the contract, bumping to v1 + adding x-tangle-protocol-version header is the right move (already called out in the doc).

Verified locally

pnpm typecheck clean • pnpm test 354/354 (38 files) • biome check clean (one pre-existing warning in tests/mcp/in-process-executor.test.ts:116 from main, untouched by this PR) • pnpm build (tsup ESM + DTS) clean • rebase onto origin/main resolved 2 conflicts (package.json version → 0.26.0, index.ts re-exports merged alongside main's mcp + otel additions).

Approving.

@drewstone drewstone merged commit 354be3e into main May 26, 2026
1 check passed
@drewstone drewstone deleted the feat/conversation branch May 26, 2026 23:28
drewstone added a commit that referenced this pull request May 26, 2026
…ehavior (#65)

Investigation surfaced two doc inaccuracies the freshly-merged spec doc
inherited from my own assumption rather than measurement:

  1. Refusal status code is 429 + body code 'bridge_depth_exceeded' (live
     in tangle-router app/api/chat/route.ts:1390-1410), not the 413 the
     spec claimed. Updated header table + invariant #5 accordingly.

  2. The spec read as fully shipped end-to-end. Added an Implementation
     status table making the per-layer reality explicit:
       - agent-runtime emits all six headers (this is the work that
         shipped in #64).
       - tangle-router enforces depth + forwards auth (already live).
       - cli-bridge forwards authorization to sandbox backends (already
         live); does not enforce depth locally — inherits via router.
       - agent-gateway middleware: NOT YET. Deferred to a real consumer.

No code changes. The agent-runtime headers builders and emitters are
already correct (they emit the header; refusal is the gateway's job).
The doc was the only thing out of step.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants