diff --git a/docs/design/agent-workflows/open-issues.md b/docs/design/agent-workflows/open-issues.md
new file mode 100644
index 0000000000..5cd683202a
--- /dev/null
+++ b/docs/design/agent-workflows/open-issues.md
@@ -0,0 +1,93 @@
+# Open issues
+
+Deferred TODOs and open questions for the agent-workflows project. Each entry carries enough
+context and provenance to act on cold. See the `defer-todo` skill for the format.
+
+## Open issues
+
+### Supply secret values to tools during a standalone run
+
+**Status:** open
+**Added:** 2026-06-19
+**Commit:** 6a812efb95 (branch `gitbutler/workspace`)
+**Project:** [agent-workflows/sdk-local-tools](./sdk-local-tools/)
+**Source:** sdk-local-tools design review session (answering reviewer comments on plan.md, Decision 3)
+
+**The problem.** An agent's `code` tool declares the secrets it needs by name, for example
+`secrets: ["GITHUB_TOKEN"]`. An MCP server declares an env-var-to-name map. The config stores
+only the name, never the value. At run time something must turn the name into a value and
+inject it as an env var, into the sandbox subprocess for a code tool or into the server
+process for MCP. On the Agenta server path that consumer is meant to read the `custom_secret`
+entries from the project vault by name, but it is not built yet. Custom secrets are
+storage-only this iteration by design, so today a declared secret resolves to nothing and the
+tool runs without it. For a standalone run the goal is different: do not depend on the Agenta
+vault at all. The trouble is the SDK has no secret resolution of any kind today. Resolution
+lives only in the service. So a standalone agent has no way to supply `GITHUB_TOKEN` to its
+code tool.
+
+**Why it is deferred.** The first slice is offline and narrow (built-in plus code tools with
+env secrets). Env alone closes the gap for that slice, so the wider question of where secret
+values come from does not block it. The vault path also depends on work another effort owns
+(see below), so it cannot land here yet.
+
+**What to decide or do.** Pick the local source of secret values for a standalone run. Three
+options, and they are not exclusive.
+
+1. Env, the offline default. Read the declared name straight from the process environment.
+   `GITHUB_TOKEN` comes from `os.environ`. Simple, offline, and enough for the first slice.
+2. A pluggable `SecretResolver` interface. The user implements it. Env is the built-in
+   default, but they can back it with a `.env` file, a secret manager, or their own vault. A
+   small interface for a lot of flexibility.
+3. The Agenta vault over HTTP. It reads `custom_secret` entries by name. This needs the
+   future runtime consumer endpoint to be built, and it is the connected-standalone path
+   only.
+
+The lean: ship option 1 as the default and option 2 as the interface it plugs into, both in
+the first slice. Treat option 3 as later work tied to the named-secrets effort building the
+consumer. See [./sdk-local-tools/plan.md](./sdk-local-tools/plan.md) (Decision 3 and Phase
+4), [./sdk-local-tools/research.md](./sdk-local-tools/research.md) (stage 3), and
+[../vault-named-secrets/](../vault-named-secrets/).
+
+### Batch the two vault round-trips on the agent invoke path
+
+**Status:** open
+**Added:** 2026-06-19
+**Commit:** 6a812efb95 (branch `gitbutler/workspace`)
+**Project:** [agent-workflows/sdk-local-tools](./sdk-local-tools/)
+**Source:** xhigh code review of the sdk-local-tools first slice
+
+**The problem.** The service resolves a code tool's named secrets inside `resolve_tools` and an
+MCP server's named secrets inside `resolve_mcp_servers`. Each one builds its own
+`_VaultSecretResolver` and makes its own `POST /secrets/resolve`. When a config has both a code
+tool and an enabled MCP server that each declare secrets, a cold invoke makes two sequential
+vault round-trips where one batched call over the union of names would do. The two functions
+are also awaited one after another in `app.py`, not concurrently.
+
+**Why it is deferred.** MCP is flag-gated off this release, so the second round-trip does not
+happen on the default path yet. The cost only appears once MCP turns on. Fixing it now would be
+optimizing a path no one runs.
+
+**What to decide or do.** When MCP comes off the flag (sdk-local-tools Phase 5), resolve the
+union of code-tool and MCP secret names in one vault call, and consider running the independent
+resolve steps in `app.py` concurrently.
+
+### Give the resolved-tool shape one source of truth
+
+**Status:** open
+**Added:** 2026-06-19
+**Commit:** 6a812efb95 (branch `gitbutler/workspace`)
+**Project:** [agent-workflows/sdk-local-tools](./sdk-local-tools/)
+**Source:** xhigh code review of the sdk-local-tools first slice
+
+**The problem.** `ResolvedTools` (the SDK resolver's return type) carries the same four fields,
+`builtin_tools` / `custom_tools` / `tool_callback` / `mcp_servers`, that `SessionConfig` already
+declares. The two shapes can drift: when a new per-tool wire field lands, a maintainer has to
+add it in both places, and missing one silently drops the field from either the standalone path
+or the service path.
+
+**Why it is deferred.** The duplication is small and the two types serve different layers today
+(a resolver result versus the full session bundle). It is a maintainability touch-point, not a
+bug.
+
+**What to decide or do.** Decide whether the resolver should return the `SessionConfig` tool
+fields directly (or a shared sub-model both reuse), so the wire tool shape has one definition.
diff --git a/docs/design/agent-workflows/sdk-local-tools/README.md b/docs/design/agent-workflows/sdk-local-tools/README.md
new file mode 100644
index 0000000000..cab7ed26d3
--- /dev/null
+++ b/docs/design/agent-workflows/sdk-local-tools/README.md
@@ -0,0 +1,40 @@
+# SDK local tools
+
+This folder plans one feature: letting a standalone Agenta Python SDK user run an agent's
+**tools** during a fully local run, with no Agenta backend service and no rivet sidecar.
+
+It complements the sibling effort in
+[`../scratch/sdk-local-backend/status.md`](../scratch/sdk-local-backend/status.md). That one
+moves the agent runtime into the SDK and builds `LocalBackend` (the engine that runs a
+harness on the user's own machine). This one builds the tool layer on top of that engine:
+resolving the agent's tool references into runnable specs, and supplying the secrets those
+tools need, all without calling Agenta.
+
+This workspace covers the original feature plan, the organization proposal, its implementation,
+and the completed review remediation. `status.md` is the source of truth.
+
+## Read in this order
+
+1. **[context.md](context.md)**: why this matters, the standalone promise it keeps, the
+   goal, and the non-goals. Start here.
+2. **[research.md](research.md)**: the verified current-state map. Where every piece lives
+   today (with `file:line`), what works, and what is missing, broken down per tool kind.
+3. **[organization-proposal.md](organization-proposal.md)**: the recommended package,
+   naming, exception, validation, and migration structure.
+4. **[status.md](status.md)**: the source of truth for progress and the concrete first
+   steps for the next agent.
+5. **[codebase-conventions.md](codebase-conventions.md)**: repository patterns learned during
+   the organization review, including strong precedents and legacy patterns not to copy.
+6. **[plan.md](plan.md)**: the earlier phased behavior plan. Its product decisions remain
+   useful context; its proposed file and symbol names are not final.
+
+## The one-sentence goal
+
+A standalone SDK user pulls an agent config from Agenta, builds a `LocalBackend`, and runs
+the agent locally **with its tools**: built-in, code, and (later) gateway, client, and MCP.
+
+## Prerequisite
+
+This work assumes `LocalBackend` exists. It does not yet; it is a stub that raises (see
+research.md). Read `../scratch/sdk-local-backend/status.md` for that effort's state. The
+phases here are sequenced so the first tool slice lands right after `LocalBackend` does.
diff --git a/docs/design/agent-workflows/sdk-local-tools/codebase-conventions.md b/docs/design/agent-workflows/sdk-local-tools/codebase-conventions.md
new file mode 100644
index 0000000000..bccb9c1d87
--- /dev/null
+++ b/docs/design/agent-workflows/sdk-local-tools/codebase-conventions.md
@@ -0,0 +1,201 @@
+# Codebase conventions learned
+
+This is a fresh repository review. It records observed patterns and distinguishes the
+patterns worth following from legacy or inconsistent code that should not become precedent.
+
+## Organization
+
+- Growing backend domains are directories, not a sequence of suffixed files.
+  `api/oss/src/core/tools/` uses `dtos.py`, `interfaces.py`, `service.py`, `registry.py`,
+  `exceptions.py`, and provider subpackages.
+- The SDK also nests substantial subsystems. `sdks/python/agenta/sdk/engines/running/`
+  contains interfaces, registries, handlers, runners, errors, templates, and sandbox logic.
+- `sdks/python/agenta/sdk/evaluations/runtime/` is a particularly relevant shared-runtime
+  package. API core imports those SDK model classes directly from
+  `api/oss/src/core/evaluations/runtime/types.py` and defines only API-specific additions.
+- Small composition areas remain flat. `services/oss/src/agent/` has one file per concern:
+  `app.py`, `client.py`, `config.py`, `schemas.py`, `secrets.py`, `tools.py`, `tracing.py`.
+- Tests are organized by runner, test boundary, then domain. Unit tests do not require a
+  running backend; HTTP adapter behavior belongs in integration tests.
+- `__init__.py` is generally used as a curated import surface, while implementations remain
+  in role-specific modules.
+
+## File naming
+
+- Common role names in maintained API core domains are `dtos.py`, `interfaces.py`,
+  `service.py`, `types.py`, `utils.py`, and `exceptions.py`.
+- Semantic names are preferable when the role is clearer than a generic bucket:
+  `streaming.py`, `registry.py`, `delivery.py`, `context.py`, `mappings.py`.
+- Abbreviations such as `defs` are uncommon as file-role names and do not identify whether
+  values are configuration, DTOs, schemas, or runtime specifications.
+- `utils.py` exists widely but often becomes a mixed bucket. New code should use a semantic
+  filename when one is available.
+- A domain-qualified generic filename can be clear (`tools/registry.py`); the same generic
+  filename at a broad level (`agents/tool_resolution.py`) tends to accumulate unrelated
+  concerns.
+
+## Class naming
+
+- Pydantic classes are nouns that describe their lifecycle or API role:
+  `ToolConnectionCreate`, `ToolExecutionRequest`, `WorkflowRevisionData`,
+  `HarnessAgentConfig`.
+- Abstract contracts use both plain domain names in the SDK (`Backend`, `Harness`,
+  `Session`) and `*Interface` in API core (`ToolsDAOInterface`,
+  `ToolsGatewayInterface`). Consistency within one subsystem matters more than copying one
+  suffix globally.
+- Implementations identify their mechanism or provider:
+  `InProcessPiBackend`, `RivetBackend`, `ToolsGatewayRegistry`.
+- Domain exceptions generally end in `Error` and often inherit from a domain base, as in
+  `ToolsError`, `EmbedError`, and `UnsupportedHarnessError`.
+- The codebase usually capitalizes well-known acronyms in class names (`API`, `LLM`), though
+  there are mixed conventions for newer acronyms. A subsystem should choose one spelling
+  deliberately and keep it consistent.
+
+## Function and method naming
+
+- Functions use verb-first snake_case names: `resolve_json_pointer`,
+  `create_connection`, `list_integrations`, `make_oauth_state`.
+- Async methods generally use the same domain verb as sync behavior. Some older SDK modules
+  use `a*` prefixes such as `acreate`; this is legacy style and should not be copied into the
+  new agent runtime.
+- Boolean helpers read as predicates (`supports`, `is_active`, `is_valid`) or explicit
+  questions (`_mcp_enabled`).
+- Mapper functions are clearest when named for the conversion (`decode_oauth_state`,
+  `_to_harness_config`). Generic verbs such as `attach` should name the data being attached.
+- Private helpers use a leading underscore. A service importing another module's private or
+  implementation helper usually indicates the package boundary is wrong.
+
+## Function structure
+
+- New API code prefers keyword-only parameters using `*`.
+- Long signatures and calls use visual grouping with `#` separators in API core.
+- Services orchestrate dependencies; pure conversion belongs in focused helpers or mapping
+  modules.
+- Adapters own provider-specific behavior. Core orchestration depends on interfaces rather
+  than concrete HTTP or database implementations.
+- Constructors receive dependencies explicitly. `ToolsService` receives its DAO and adapter
+  registry; agent `Environment` receives a backend.
+- Return contracts are typed models when they cross layers. Positional tuples are used in
+  places but are less extensible and less self-documenting.
+- Mutating a result model after construction is not a dominant pattern for domain results;
+  constructing a complete return value is clearer.
+
+## Variables
+
+- Domain-specific names are preferred at boundaries: `project_id`, `provider_key`,
+  `integration_key`, `connection_create`.
+- Short names such as `p`, `out`, `defs`, `entry`, and `values` appear in implementation
+  code but are weaker choices when the value crosses several stages.
+- Names should preserve lifecycle: `tool_config`, `tool_spec`, `secret_names`,
+  `secret_values`, and `resolved_tools`.
+- Wire-specific camelCase is isolated to serialization or explicit wire dictionaries.
+  Python domain fields remain snake_case.
+
+## Models and DTOs
+
+- API core keeps domain contracts in `dtos.py` or `types.py`; FastAPI request and response
+  envelopes live in API-layer `models.py`.
+- Database entities do not leak through service or router contracts.
+- Discriminated Pydantic unions are already used for tool references and workflow data.
+- Request, response, config, and resolved/runtime values are usually distinguishable by
+  suffixes such as `Create`, `Request`, `Response`, `Config`, and `Data`.
+- Reusing `Dict[str, Any]` for a stable cross-language contract weakens validation and makes
+  drift harder to detect. Stable wire objects should have typed models and golden
+  serialization tests.
+- Canonical executable configuration should reject unexpected fields. Permissive legacy
+  handling belongs in a named compatibility adapter.
+- Mutable Pydantic defaults such as `items: List[...] = []` exist in older code. New models
+  should use `Field(default_factory=list)` and `Field(default_factory=dict)`.
+
+## Docstrings and comments
+
+- The explicit API guidance says comments should explain non-obvious reasons, invariants,
+  workarounds, or concurrency/security hazards, not narrate the code.
+- Good public docstrings state the contract and important behavior in one paragraph.
+- Module docstrings are useful for dependency boundaries and protocol ownership.
+- Detailed architecture history, phased decisions, and product rationale belong in
+  `docs/design/`, not in every source module.
+- Section separators are common in large modules, but needing many sections can also be a
+  signal that the file should become a package.
+- Comments around secret isolation, callback reachability, and compatibility aliases are
+  justified because they preserve security or backwards compatibility.
+
+## Exceptions
+
+- New API architecture defines domain exceptions in core and translates them at the API
+  boundary.
+- Domain bases allow broad and narrow handling (`ToolsError` and its subclasses).
+- Useful exceptions carry structured fields such as provider, connection, status, or
+  conflicting identifiers.
+- `HTTPException` should not leak into core services or SDK runtime logic.
+- Broad catches and best-effort behavior exist for cleanup and optional telemetry. They are
+  appropriate only when failure is intentionally non-fatal.
+- Broad catches around required configuration, tool resolution, or declared secrets can
+  hide correctness failures and need an explicit policy.
+- Raising generic `RuntimeError` while a typed domain error exists is inconsistent and loses
+  machine-readable context.
+
+## Dependency direction
+
+- API routing depends on core services; core services depend on interfaces; adapters
+  implement those interfaces.
+- The agent service already depends on SDK runtime contracts. The SDK must not import the
+  service.
+- The TypeScript runner consumes resolved runtime specs; it should not understand loose
+  playground compatibility shapes.
+- Compatibility parsing should happen once, before resolution.
+- HTTP gateway resolution and vault access are adapters, not properties of the tool models.
+- When the SDK owns neutral runtime models, API code can import the same classes rather than
+  maintaining a parallel union.
+
+## Testing
+
+- SDK unit tests should cover parsing, pure mapping, policy, and serialization without a
+  backend.
+- Service integration tests should cover HTTP status handling, authentication headers,
+  malformed responses, and timeouts.
+- Golden files are already used for the Python-to-TypeScript runner contract and are the
+  right mechanism for tool-spec drift.
+- Tests should be named for behavior, not internal helper names, when possible.
+- Security behavior needs direct tests: secret scoping, missing-secret handling, redaction,
+  and no ambient environment leakage.
+- Correlation and identity behavior also needs tests: duplicate names, built-in/custom
+  collisions, and provider responses returned in a different order.
+
+## Strong patterns to copy
+
+- `api/oss/src/core/tools/`: explicit domain roles and typed exceptions.
+- `sdks/python/agenta/sdk/agents/interfaces.py` plus `agents/adapters/`: ports separated from
+  implementations.
+- `sdks/python/agenta/sdk/engines/running/`: nested SDK subsystem with role-specific files.
+- `sdks/python/agenta/sdk/evaluations/runtime/`: SDK-owned neutral runtime contracts reused
+  directly by API core.
+- `api/oss/src/dbs/postgres/*/mappings.py`: conversions isolated from persistence and
+  services.
+- `agents/utils/wire.py` plus TypeScript protocol golden tests: one explicit cross-language
+  boundary.
+
+## Patterns to avoid treating as precedent
+
+- Very large multipurpose modules such as `agents/dtos.py` and API
+  `core/tools/service.py`.
+- Legacy SDK async naming with `a*` prefixes.
+- `except:` or broad `except Exception` followed by `print`.
+- Generic `utils.py` files that contain unrelated domain behavior.
+- Silent coercion or dropping as the sole parsing contract.
+- Mutable collection defaults in Pydantic models.
+- Comments that restate the following code.
+
+## Practical standard for this feature
+
+For local tools, the repository evidence supports:
+
+- one `agents/tools/` domain package.
+- lifecycle-specific names (`Config`, `Spec`, `ResolvedToolSet`).
+- explicit interfaces for injected secret and gateway dependencies.
+- typed domain errors.
+- strict default behavior with separately named compatibility behavior.
+- pure private mapping helpers.
+- one runner serialization boundary.
+- tests split by parsing, resolution, secrets, gateway HTTP, and wire contract.
+- a sibling MCP package only when MCP is an actual supported subsystem.
diff --git a/docs/design/agent-workflows/sdk-local-tools/context.md b/docs/design/agent-workflows/sdk-local-tools/context.md
new file mode 100644
index 0000000000..3cadc85881
--- /dev/null
+++ b/docs/design/agent-workflows/sdk-local-tools/context.md
@@ -0,0 +1,77 @@
+# Context
+
+## Why this work exists
+
+The agent runtime makes one promise. The
+[ports-and-adapters](../ports-and-adapters.md) page states it plainly: "Nothing in the SDK
+runtime calls the Agenta API, so the same code runs an agent standalone, with no Agenta
+backend at all." A developer should be able to pull an agent config from Agenta, then run
+that exact agent on a laptop, in a CI job, or inside another product, with no Agenta service
+in the loop.
+
+The runtime keeps that promise for everything except tools. A standalone user can build an
+`AgentConfig`, pick a harness, and (once `LocalBackend` lands) run a turn. The moment the
+agent declares a tool, the promise breaks. Tool resolution lives only in the service, behind
+HTTP calls to the Agenta API. So a standalone run either drops the tools silently or cannot
+run at all.
+
+This effort closes that gap. It gives the SDK a way to resolve an agent's tools and supply
+their secrets locally, so a standalone agent runs *with* its tools.
+
+## The target experience
+
+A standalone user should write roughly this:
+
+```python
+import agenta as ag
+
+# 1. Pull the agent config from Agenta (one network call to the public registry).
+params = ag.ConfigManager.get_from_registry(app_slug="my-agent")
+agent = ag.AgentConfig.from_params(params)
+
+# 2. Build a local engine. No Agenta service, no rivet sidecar.
+harness = ag.make_harness("pi", ag.Environment(ag.LocalBackend()))
+
+# 3. Run the agent locally, WITH its tools.
+result = await harness.prompt(session_config, messages)
+```
+
+Step 3 is impossible today, in two ways. `LocalBackend` is a stub that raises (the sibling
+effort owns that). And even with a working `LocalBackend`, nothing in the SDK turns the
+agent's tool references into runnable specs or supplies their secrets. This document plans
+the second part.
+
+## The standalone definition this hinges on
+
+"Standalone" needs a sharp edge, because the answer shapes the whole design. One open
+decision (see [plan.md](plan.md)) is whether a run that calls the Agenta public REST API to
+resolve a tool still counts as standalone. We frame the question here so the reader holds it
+throughout:
+
+- **Offline-standalone**: the run touches no network except the model provider. Every tool
+  resolves from data the user already holds. This is the strict reading.
+- **Connected-standalone**: the run uses no Agenta *service deployment* (no rivet sidecar,
+  no self-hosted agent service), but it may call the Agenta public API to resolve a tool or
+  fetch a secret. This is the loose reading.
+
+Both are useful. The design should let a user pick, not force one. We propose a two-resolver
+split that does exactly that, but the reviewer decides (plan.md, Decision 1).
+
+## Goal
+
+Let a standalone SDK user resolve an agent's tools and run them under `LocalBackend`, with a
+clear, documented answer for each tool kind: which kinds work fully offline, which need an
+opt-in network call, and which stay out of reach for now.
+
+## Non-goals
+
+- **Building `LocalBackend` itself.** The sibling effort
+  ([`../scratch/sdk-local-backend/status.md`](../scratch/sdk-local-backend/status.md)) owns
+  the engine. This effort is the tool layer on top of it and assumes it exists.
+- **Moving gateway (Composio) execution off the server.** The provider key must stay
+  server-side by design. A standalone gateway tool, if supported at all, calls back to
+  Agenta; it does not run the provider locally.
+- **A new streaming or session model.** Sessions and streaming are covered in
+  [`../sessions.md`](../sessions.md) and the agent protocol RFC. This effort changes neither.
+- **Shipping every tool kind in the first slice.** The plan sequences kinds by value and
+  difficulty; the first slice is deliberately narrow.
diff --git a/docs/design/agent-workflows/sdk-local-tools/conventions-review.md b/docs/design/agent-workflows/sdk-local-tools/conventions-review.md
new file mode 100644
index 0000000000..6aabfa8e8a
--- /dev/null
+++ b/docs/design/agent-workflows/sdk-local-tools/conventions-review.md
@@ -0,0 +1,341 @@
+# Agent tools: conventions review
+
+This reviews the agent **tools** code (SDK + Python service) against our house conventions:
+the `api/AGENTS.md` architecture rules, the SDK agents package idiom, and the existing
+`api/oss/src/core/tools/` domain. Headline verdict: the tool code is in good shape and
+mostly already matches the layer it lives in. The naming and docstrings are consistent with
+the agents package. The two real gaps are both small: `ResolvedTools` is a `@dataclass` in a
+package where every other contract is a Pydantic model, and the service raises bare
+`RuntimeError` where a typed error reads better. Everything else is keep-as-is or a
+discuss-later.
+
+## Tool code map
+
+| File | Layer | Role |
+| --- | --- | --- |
+| `sdks/python/agenta/sdk/agents/tool_defs.py` | SDK | Neutral tool vocabulary (`*ToolDef`, `McpServer`) + loose-shape parsers (`parse_tool_def`). |
+| `sdks/python/agenta/sdk/agents/tool_resolution.py` | SDK | `LocalToolResolver`, `SecretResolver`/`EnvSecretResolver`, `ResolvedTools`, the per-kind spec builders. |
+| `sdks/python/agenta/sdk/agents/dtos.py` | SDK | Holds `ToolCallback`, `SessionConfig` (the resolution output target). |
+| `services/oss/src/agent/tools.py` | Service | Server-side orchestration: gateway over HTTP, delegates local kinds to the SDK. |
+| `services/oss/src/agent/secrets.py` | Service | Vault secret resolution (`resolve_named_secrets`, `resolve_harness_secrets`). |
+| `services/oss/src/agent/client.py` | Service | Shared backend base URL + caller credential. |
+| `services/oss/src/agent/app.py` | Service | The `/invoke` handler that calls `resolve_tools` / `resolve_mcp_servers`. |
+| `services/agent/src/tools/*.ts` | TS runner | The execution side (`code.ts`, `client.ts`, `mcp-server.ts`, ...). Out of focus. |
+
+## What I learned about the codebase
+
+These are the conventions I confirmed by reading, each with a citation. Useful as a
+standalone reference.
+
+- **`api/AGENTS.md` rules are scoped to `api/`.** The file says so in its first line:
+  "Scope: everything under `api/`" (`api/AGENTS.md:3`). The SDK (`sdks/python`) and the
+  service (`services/oss`) are different layers with their own idioms. Judge tool code
+  against the layer it lives in.
+- **API domains follow a fixed folder shape**: `apis/fastapi/<domain>` + `core/<domain>` +
+  `dbs/postgres/<domain>`, with named files `router.py` / `models.py` / `service.py` /
+  `interfaces.py` / `dtos.py` / `dao.py` / `mappings.py` (`api/AGENTS.md:58-78`). The
+  canonical copy-me example is `api/oss/src/core/workflows/` (`api/AGENTS.md:74-77`).
+- **The existing tools domain follows that shape**: `api/oss/src/core/tools/` has
+  `dtos.py`, `exceptions.py`, `interfaces.py`, `service.py`, `registry.py`, `utils.py`, and
+  `providers/` (confirmed by directory listing). It is the fuller layout the SDK tool code
+  is being compared against.
+- **API domain exceptions are typed and live in the core layer**, never `HTTPException` in
+  services (`api/AGENTS.md:209-217`). The tools domain has a base `ToolsError` plus typed
+  subclasses with structured context, for example `ActionNotFoundError`
+  (`api/oss/src/core/tools/exceptions.py:4-58`). Note: this domain uses `exceptions.py`,
+  while `api/AGENTS.md:210` names `core/{domain}/types.py` or `dtos.py`. So even inside
+  `api/` the exception-file name is not uniform.
+- **API services return typed Pydantic DTOs, not dicts or tuples** (`api/AGENTS.md:298-345`).
+  The tools service proves resolution output can be a model: `resolve_agent_tools` returns
+  `AgentToolsResolution` carrying `List[ResolvedAgentTool]`
+  (`api/oss/src/core/tools/service.py:489-519`, defined at
+  `api/oss/src/core/tools/dtos.py:277-298`).
+- **API signature style is keyword-only with `#`-grouped sections** (`api/AGENTS.md:188-206`).
+  Live example: `ToolsDAOInterface.create_connection(self, *, project_id, user_id, #,
+  connection_create)` (`api/oss/src/core/tools/interfaces.py:22-30`).
+- **In-code comments are kept minimal, the why not the what** (`api/AGENTS.md:21-22`).
+- **The SDK agents package has its own explicit hexagonal vocabulary**, documented in its
+  module docstring: `dtos.py` (contracts), `interfaces.py` (ports/ABCs), `adapters/`
+  (implementations), `utils/` (plumbing) (`sdks/python/agenta/sdk/agents/__init__.py:3-11`).
+  This is the idiom the tool code must match, not the API folder shape.
+- **SDK contracts are Pydantic `BaseModel`, deliberately.** The `dtos.py` docstring states
+  "These are Pydantic models (the SDK already depends on Pydantic)"
+  (`sdks/python/agenta/sdk/agents/dtos.py:6-7`); all 12 contracts in that file subclass
+  `BaseModel` (grep count). The tool *defs* follow this too (`tool_defs.py:43-101`,
+  `McpServer` at `:117`).
+- **SDK ports are ABCs.** `Sandbox` / `Session` / `Backend` all subclass `ABC` with
+  `@abstractmethod` (`sdks/python/agenta/sdk/agents/interfaces.py:41,58,94`). `Protocol`
+  exists in the SDK but is reserved for structural typing of third-party modules, for
+  example `_LitellmModule(Protocol)` (`sdks/python/agenta/sdk/utils/lazy.py:22`), and for
+  call-shape contracts like `InvokeFn(Protocol)`
+  (`sdks/python/agenta/sdk/decorators/running.py:54`). So an owned, implemented-by-us port
+  is an ABC; a Protocol is for "shape someone else satisfies".
+- **The agents package names its errors file `errors.py`, not `exceptions.py`**, and its
+  errors subclass `RuntimeError` with structured attributes:
+  `UnsupportedHarnessError(RuntimeError)` stores `.harness` and `.backend`
+  (`sdks/python/agenta/sdk/agents/errors.py:13-21`). This is the SDK's own version of the
+  API's typed-domain-exception rule.
+- **The snake/camel boundary is handled by `to_wire()` methods on DTOs.** `ContentBlock`,
+  `TraceContext`, and `ToolCallback` each own a `to_wire()` that maps snake_case fields to
+  camelCase wire keys (`dtos.py:122-142`, `:255-262`, `:273-274`). The `/run` request is
+  assembled in `utils/wire.py` (`request_to_wire`, `sdks/python/.../utils/wire.py:24-57`).
+  The TS side mirrors the names in `services/agent/src/protocol.ts`
+  (`utils/wire.py:3-5`).
+- **The agents `__init__.py` curates `__all__` with section comments** grouping DTOs, the UI
+  codec, tool definitions, resolution, ports, errors, and adapters
+  (`sdks/python/agenta/sdk/agents/__init__.py:73-128`).
+- **The agents package uses rich module + class docstrings.** Every module opens with a
+  multi-paragraph docstring explaining the layer and the why
+  (`dtos.py:1-9`, `interfaces.py:1-16`, `tool_defs.py:1-25`). This is louder than the
+  API's "minimal comments" rule, and it is the established norm *for this package*.
+- **The SDK uses one module logger per file**: `log = get_module_logger(__name__)`
+  (`tool_resolution.py:47`, `secrets.py:22`, `tools.py:46`).
+- **Tests split unit vs integration by directory and gate gateway/network tests behind
+  `pytest.mark.integration`.** Unit tool tests live in
+  `sdks/python/oss/tests/pytest/unit/agents/` (`test_tool_defs.py`,
+  `test_tool_resolution.py`) and service-side in
+  `services/oss/tests/pytest/unit/agent/test_tool_refs.py`; the gateway HTTP path lives in
+  `services/oss/tests/pytest/integration/agent/test_resolve_tools_http.py` with
+  `pytestmark = pytest.mark.integration` (`test_resolve_tools_http.py:19`). Test modules
+  open with a docstring naming what they lock (`test_tool_resolution.py:1-8`).
+
+## Proposals and parallels
+
+### 1. File organization and placement
+
+- **House pattern.** API splits a domain into many small files
+  (`api/AGENTS.md:58-78`). The SDK agents package splits by *role* not by feature: one
+  `dtos.py`, one `interfaces.py`, an `adapters/` package, a `utils/` package
+  (`__init__.py:3-11`). The `tool_defs.py` / `tool_resolution.py` pair sits as two
+  sibling modules at the package root, next to `dtos.py` and `streaming.py`.
+- **Current state.** `tool_defs.py` (vocabulary + parsing) and `tool_resolution.py`
+  (resolution + secrets + spec builders) are two modules. `tool_resolution.py` is doing
+  three jobs: secret sources (`SecretResolver`/`EnvSecretResolver`), free-function spec
+  builders, and the `LocalToolResolver`. It is 225 lines, cohesive, and reads cleanly.
+- **Proposal: Keep.** Two modules for ~460 lines is right for this package. A `tools/`
+  subpackage would be over-built next to a single-file `dtos.py` that holds 12 models. The
+  one thing worth flagging: `SecretResolver` is a generic seam (a `code` tool, an MCP
+  server, and tomorrow other consumers use it). If a second resolver-shaped abstraction
+  lands, lifting `SecretResolver`/`EnvSecretResolver` into the package's `interfaces.py`
+  (where ports already live) would match the idiom better than keeping it in
+  `tool_resolution.py`. Not now. **Worth discussing** only when a second consumer appears.
+
+### 2. File naming
+
+- **House pattern.** Dominant names are `dtos.py`, `service.py`, `interfaces.py`,
+  `utils.py`, `enums.py` (`api/AGENTS.md:62-72`). But the agents package already chose a
+  different file set on purpose: `dtos.py`, `interfaces.py`, `streaming.py`,
+  `tool_defs.py`, plus adapter subpackages such as `adapters/vercel/` for protocol-specific
+  edge code. Descriptive feature names are the norm *here* when the name marks a real seam.
+- **Current state.** `tool_defs.py` and `tool_resolution.py` follow the agents package's
+  own descriptive-name idiom, not the API's `dtos.py`/`service.py` idiom.
+- **Proposal: Keep.** Renaming to `dtos.py`/`service.py` would collide with the package's
+  existing `dtos.py` and would not be more informative. `tool_defs.py` and
+  `tool_resolution.py` say exactly what they hold. The names are good.
+
+### 3. Class naming
+
+- **House pattern.** API uses `*Service`, `*DAOInterface`, `*DTO` suffixes
+  (`ToolsService`, `ToolsDAOInterface`, `SecretResponseDTO`). The SDK agents package uses
+  bare role nouns for ports (`Backend`, `Harness`, `Sandbox`, `Session`,
+  `interfaces.py:41-198`) and bare nouns for DTOs (`Message`, `AgentResult`).
+- **Current state.** `LocalToolResolver`, `SecretResolver`, `EnvSecretResolver`,
+  `ResolvedTools`, and the `*ToolDef` family. These read as role nouns, matching the agents
+  package. `EnvSecretResolver` as a concrete adapter of `SecretResolver` mirrors
+  `InProcessPiBackend` adapting `Backend`.
+- **Proposal: Keep.** The names match the package's own port/adapter naming. `*ToolDef`
+  parallels the API's `Agent*Tool` discriminated union
+  (`api/oss/src/core/tools/dtos.py:253-274`) closely enough that a reader crossing the two
+  layers will not be surprised.
+
+### 4. Function naming and signature style
+
+- **House pattern.** API prefers keyword-only params with a leading `*` and `#`-grouped
+  sections (`api/AGENTS.md:188-206`; live at
+  `api/oss/src/core/tools/interfaces.py:22-30`). The SDK agents package follows this on its
+  *ports*: `Session.prompt(self, messages, *, on_event=None)`
+  (`interfaces.py:67-72`), `Backend.create_session(self, sandbox, config, *, harness, ...)`
+  (`interfaces.py:120-129`).
+- **Current state.** `LocalToolResolver.__init__(self, secret_resolver=None, *,
+  enable_mcp=False)` already uses keyword-only for the flag
+  (`tool_resolution.py:166-171`), matching the ports. But the module-level spec builders are
+  positional: `code_tool_spec(tool, env)`, `mcp_server_spec(server, values)`,
+  `attach_orthogonal(entry, tool)` (`tool_resolution.py:88,98,124`).
+- **Proposal: Small change (optional).** These free functions take two args each and are
+  internal builders called from one place. Positional is defensible and the call sites are
+  clear (`tool_resolution.py:208-209`, `services/oss/src/agent/tools.py:133`). If you want
+  strict parity with the package's keyword-only norm, make the second arg keyword-only
+  (`code_tool_spec(tool, *, env)`). Low value, low risk. I would **keep** unless you are
+  doing a parity sweep.
+
+### 5. Function structure
+
+- **House pattern.** Small, single-responsibility functions with early returns; the
+  split-then-build pattern (collect, then transform). The API resolver does exactly this
+  (`api/oss/src/core/tools/service.py:489-519`: split builtins vs custom, build each).
+- **Current state.** `LocalToolResolver.resolve_defs` splits defs by type with three
+  comprehensions, then builds (`tool_resolution.py:198-211`). `resolve_tools` in the service
+  does the same split (`tools.py:152-180`). The functions are short, each does one thing,
+  and they early-return on empty (`tools.py:153-154`). The parsers in `tool_defs.py` use
+  early returns throughout (`parse_tool_def`, `:182-208`).
+- **Proposal: Keep.** This is the cleanest dimension. The structure matches the house
+  split-then-build pattern precisely.
+
+### 6. Docstrings
+
+- **House pattern.** Two competing norms. `api/AGENTS.md:21-22` says keep in-code comments
+  minimal (why not what). The SDK agents package, by contrast, opens every module with a
+  multi-paragraph docstring (`dtos.py:1-9`, `interfaces.py:1-16`, `tool_defs.py:1-25`) and
+  gives most classes a why-focused docstring.
+- **Current state.** `tool_defs.py` and `tool_resolution.py` carry rich module docstrings
+  and per-class/per-function docstrings in the package's house voice. They explain *why*
+  (the three-axis split, why gateway can't resolve offline, why MCP is flag-gated), not
+  *what*.
+- **Proposal: Keep.** The docstrings match the agents package norm, which is the relevant
+  one. They are on the heavy side by `api/AGENTS.md` standards, but that rule does not own
+  this package, and the surrounding files set the heavier bar. Consistency wins. One nit:
+  `EnvSecretResolver.resolve` has a why-comment about the single read
+  (`tool_resolution.py:74`) that is genuinely a why, so it is fine.
+
+### 7. Variable naming
+
+- **House pattern.** snake_case Python locals; camelCase only at the wire boundary, and the
+  boundary is owned by `to_wire()` methods on DTOs (`dtos.py:122-142`,
+  `ToolCallback.to_wire`, `:273-274`).
+- **Current state.** Locals are snake_case throughout. The camelCase keys
+  (`needsApproval`, `inputSchema`, `callRef`, `mcpServers`) appear only inside the
+  dict-literal builders (`tool_resolution.py:100-106`, `services/oss/src/agent/tools.py:126-132`).
+  The boundary is clean and intentional: snake_case in, camelCase out.
+- **Proposal: Keep**, with one observation flagged in dimension 8: the builders emit
+  camelCase from a free function returning `Dict[str, Any]`, whereas the rest of the package
+  emits camelCase from a typed `to_wire()`. The variable naming itself is correct; the
+  *mechanism* is the discussion (see below).
+
+### 8. Typing and data contracts
+
+This is the most substantive finding.
+
+- **House pattern.** API services return typed Pydantic DTOs, not dicts or tuples
+  (`api/AGENTS.md:298-345`), and the tools domain already returns
+  `AgentToolsResolution` / `ResolvedAgentTool` for resolution output
+  (`api/oss/src/core/tools/dtos.py:277-298`). The SDK agents package makes contracts Pydantic
+  `BaseModel` on purpose (`dtos.py:6-7`), and wire serialization lives in a `to_wire()` on
+  the model (`ContentBlock.to_wire`, `dtos.py:122`).
+- **Current state — three sub-points:**
+  1. `ResolvedTools` is a `@dataclass` (`tool_resolution.py:149-156`). It is the **only**
+     `@dataclass` in the agents package; everything else there is a `BaseModel`. (The SDK
+     uses `@dataclass` only in `utils/types.py` and `evaluations/preview/utils.py`, not in a
+     contract layer.)
+  2. `ResolvedTools` duplicates four fields of `SessionConfig` verbatim: `builtin_tools`,
+     `custom_tools`, `tool_callback`, `mcp_servers` (`tool_resolution.py:153-156` vs
+     `dtos.py:512-517`). The resolver builds a `ResolvedTools`, then the service unpacks it
+     field-by-field onto `SessionConfig` (`app.py:91-103` via the `resolve_tools` tuple).
+  3. The wire specs are untyped `Dict[str, Any]` built by free functions
+     (`code_tool_spec` etc.), not a typed model with `to_wire()`.
+- **Proposals:**
+  - **`ResolvedTools` -> Pydantic `BaseModel`: Small change, do it.** It is a one-line swap
+    (`@dataclass` to `BaseModel`, `field(default_factory=...)` to `Field(default_factory=...)`)
+    and it removes the lone odd-one-out in a Pydantic package. Cheap, lowers surprise.
+  - **The `SessionConfig` field duplication: Worth discussing, lean keep.** `ResolvedTools`
+    is the resolver's *output contract*; `SessionConfig` is the run's *input bundle*. They
+    overlap by design because resolution feeds the session. The current service path returns
+    a bare tuple `(builtins, custom_tools, callback)` from `resolve_tools`
+    (`tools.py:142-180`), which is exactly the tuple anti-pattern `api/AGENTS.md:341-344`
+    warns against. The higher-leverage fix is to have the service `resolve_tools` return the
+    `ResolvedTools` model (once it is a model) instead of a 3-tuple, and have `app.py` spread
+    it onto `SessionConfig`. That kills the tuple, keeps the two contracts distinct, and does
+    not force a premature merge. Note this crosses the layer boundary: `resolve_tools` lives
+    in `services/oss` but would return an SDK type, which is fine and already the dependency
+    direction (`service -> SDK`).
+  - **Typed wire specs with `to_wire()`: Worth discussing, defer.** Turning each
+    `code_tool_spec`/`client_tool_spec`/`mcp_server_spec` dict into a small Pydantic model
+    with `to_wire()` would match `ContentBlock`/`ToolCallback`. But the wire shape is shared
+    three ways (SDK builder, service gateway path, TS `protocol.ts`) and is pinned by the
+    golden contract tests. The dict-literal builders are readable and the camelCase keys sit
+    right next to the snake_case source. I would defer this until the tool wire shape needs
+    to grow; the payoff today is mostly aesthetic.
+  - **`SecretResolver` ABC vs Protocol vs callable: Keep the ABC.** It is an owned port with
+    a single method, implemented by us (`EnvSecretResolver`, the service's
+    `_VaultSecretResolver`, the test `_DictSecrets`). That is exactly when the package uses
+    an ABC (`interfaces.py:41-129`), not a Protocol (which it reserves for third-party module
+    shapes, `utils/lazy.py:22`). A bare `Callable[[Sequence[str]], Awaitable[Dict[str,str]]]`
+    would lose the name and the docstring seam. The ABC is the right call.
+
+### 9. Exception handling
+
+- **House pattern.** API defines typed domain exceptions in the core layer and never raises
+  `HTTPException` from services (`api/AGENTS.md:209-217`); the tools domain has the full
+  `ToolsError` hierarchy (`api/oss/src/core/tools/exceptions.py`). The SDK agents package has
+  its own version: typed errors in `errors.py` subclassing `RuntimeError` with structured
+  attributes (`errors.py:13-21`). The anti-pattern is bare `Exception`/`ValueError`/error
+  dicts for domain errors (`api/AGENTS.md:291-296`).
+- **Current state.** The service's gateway path raises bare `RuntimeError` four times
+  (`services/oss/src/agent/tools.py:84,102,112,122`): missing API base, HTTP error, count
+  mismatch, incomplete spec. The SDK side raises nothing (it skips and warns,
+  `tool_resolution.py:184-188`). The tests assert on `RuntimeError` + a message substring
+  (`test_resolve_tools_http.py:43,89,97,111`).
+- **Proposal: Small change, worth doing.** This is the clearest convention miss. The
+  service already imports from the SDK and the SDK already has `errors.py`. Define a small
+  typed error there, for example `ToolResolutionError(RuntimeError)` carrying structured
+  context (the failing ref count, the HTTP status), and raise it in `tools.py` instead of
+  four bare `RuntimeError`s. It subclasses `RuntimeError`, so the existing
+  `pytest.raises(RuntimeError, ...)` tests still pass, and callers can catch narrowly. Keep
+  it in the SDK `errors.py` next to `UnsupportedHarnessError`, since the failure is about
+  tool resolution (an SDK concept) and the service is the caller. Caveat: the service is not
+  a FastAPI router with an `@intercept_exceptions` boundary the way `api/` is; it runs inside
+  `ag.workflow`. So this is about *type clarity and structured context*, not about HTTP
+  mapping. Do not over-engineer a full hierarchy; one typed error is enough today.
+
+### 10. Module exports / `__init__.py`
+
+- **House pattern.** The agents `__init__.py` curates `__all__` with `#` section comments
+  grouping DTOs, the UI codec, tool definitions, resolution, ports, errors, adapters
+  (`__init__.py:73-128`).
+- **Current state.** The new tool exports are grouped under two labelled sections: "Tool
+  definitions (the neutral tool vocabulary)" and "Local tool resolution (offline; secrets
+  pluggable, env by default)" (`__init__.py:97-111`). Every new name (`AgentToolDef`,
+  `LocalToolResolver`, `SecretResolver`, `EnvSecretResolver`, `ResolvedTools`, the parsers)
+  is exported.
+- **Proposal: Keep.** The exports are consistent with the file's own pattern, the section
+  comments are accurate, and the grouping reads well.
+
+### 11. Tests
+
+- **House pattern.** Unit vs integration split by directory; network/gateway behind
+  `pytest.mark.integration`; module docstrings naming what is locked.
+- **Current state.** Clean. Unit tool tests sit in
+  `sdks/python/oss/tests/pytest/unit/agents/test_tool_defs.py` and `test_tool_resolution.py`
+  and in `services/oss/tests/pytest/unit/agent/test_tool_refs.py`. The gateway HTTP path is
+  isolated in `services/oss/tests/pytest/integration/agent/test_resolve_tools_http.py` with
+  `pytestmark = pytest.mark.integration` (`:19`). Fixtures follow the package style: a small
+  in-test `SecretResolver` subclass `_DictSecrets` (`test_tool_resolution.py:21-28`), and an
+  `install_http` fixture for the integration mock. Each test module opens with a docstring
+  (`test_tool_resolution.py:1-8`, `test_tool_refs.py:1-8`).
+- **Proposal: Keep.** This dimension is exemplary. The offline/online split mirrors the
+  code's own offline/gateway split, the fixtures use the real ABCs, and the docstrings make
+  each file self-explaining. One tiny note: `test_tool_refs.py` imports the private
+  `_gateway_ref` (`test_tool_refs.py:15`); testing a private is fine here since it is the
+  unit under test, but flag it so nobody "fixes" it by exporting the symbol.
+
+## Prioritized recommendation
+
+What a CTO would push on, highest leverage and lowest risk first. Be honest: most of this
+code is already right.
+
+1. **Make `ResolvedTools` a Pydantic `BaseModel`.** One-line change. Removes the only
+   `@dataclass` in a Pydantic package and matches `api/AGENTS.md`'s typed-DTO rule. Do now.
+2. **Replace the bare `RuntimeError`s in `services/oss/src/agent/tools.py` with one typed
+   error** (`ToolResolutionError(RuntimeError)` in the SDK `errors.py`) carrying structured
+   context. Subclassing `RuntimeError` keeps the existing tests green. Do now.
+3. **Have the service `resolve_tools` return the `ResolvedTools` model instead of a
+   3-tuple.** This kills the tuple anti-pattern (`api/AGENTS.md:341-344`) and the
+   field-by-field unpack in `app.py`. Depends on item 1. Do next.
+4. **Defer: typed wire specs with `to_wire()`.** Real consistency win, but the wire shape is
+   shared three ways and pinned by golden tests; the payoff today is aesthetic. Revisit when
+   the tool wire shape grows.
+5. **Defer / optional: keyword-only on the free spec builders, and lifting `SecretResolver`
+   into `interfaces.py`.** Both are parity polish. Do them only as part of a deliberate
+   sweep or when a second `SecretResolver` consumer lands.
+
+Everything else (file placement, file names, class names, function structure, docstrings,
+variable naming, exports, tests) already matches the layer's conventions. Leave it alone.
diff --git a/docs/design/agent-workflows/sdk-local-tools/organization-proposal.md b/docs/design/agent-workflows/sdk-local-tools/organization-proposal.md
new file mode 100644
index 0000000000..1450e4fbbb
--- /dev/null
+++ b/docs/design/agent-workflows/sdk-local-tools/organization-proposal.md
@@ -0,0 +1,612 @@
+# Tools code organization proposal
+
+> Implemented on 2026-06-19. The package, naming, validation, service-adapter, MCP,
+> API-model-sharing, test-organization, and runner-file recommendations below are now reflected
+> in the working tree.
+
+## Recommendation
+
+Do not keep growing `agents/tool_defs.py` and `agents/tool_resolution.py`.
+
+Move the SDK-owned tool runtime into a small `agents/tools/` package, name declarative
+objects `*Config`, name runnable objects `*Spec`, keep compatibility parsing separate from
+resolution, and make unsupported tools and missing secrets explicit policy decisions rather
+than silent skips.
+
+Keep MCP out of that package while MCP remains out of release scope. Either remove the
+dormant MCP flow from this slice or place it in a sibling `agents/mcp/` package with its own
+capability and rollout decisions.
+
+This is an organization and naming proposal. It does not require changing the persisted
+discriminator values (`builtin`, `gateway`, `code`, `client`) or the runner wire keys in the
+first refactor.
+
+## Pre-refactor inventory
+
+### Python SDK
+
+The local-tools implementation currently lives in:
+
+- `sdks/python/agenta/sdk/agents/tool_defs.py`
+  - Pydantic tool-definition models.
+  - MCP server model.
+  - compatibility parsing for loose and legacy config shapes.
+- `sdks/python/agenta/sdk/agents/tool_resolution.py`
+  - secret-source interface and environment implementation.
+  - config-to-wire mapping.
+  - local resolution orchestration.
+  - gateway skip policy.
+  - MCP feature-gate policy.
+- `sdks/python/agenta/sdk/agents/dtos.py`
+  - raw `AgentConfig.tools` and `AgentConfig.mcp_servers`.
+  - resolved tool fields on `SessionConfig`.
+  - harness-facing tool and MCP delivery fields.
+- `sdks/python/agenta/sdk/agents/errors.py`
+  - `ToolResolutionError`, currently not used by the resolver or service gateway path.
+- `sdks/python/agenta/sdk/agents/adapters/harnesses.py`
+  - maps already-resolved tool specs into harness configuration.
+- `sdks/python/agenta/sdk/agents/utils/wire.py`
+  - serializes harness configuration to the TypeScript runner contract.
+- `sdks/python/agenta/sdk/utils/types.py`
+  - imports the tool models to generate the editable agent-config schema.
+- `sdks/python/agenta/sdk/agents/__init__.py`
+  - re-exports most tool models, parsers, resolvers, and secret interfaces.
+
+SDK tests currently live in:
+
+- `sdks/python/oss/tests/pytest/unit/agents/test_tool_defs.py`
+- `sdks/python/oss/tests/pytest/unit/agents/test_tool_resolution.py`
+- `sdks/python/oss/tests/pytest/unit/agents/test_wire_contract.py`
+
+### Python agent service
+
+Server-side composition currently lives in:
+
+- `services/oss/src/agent/tools.py`
+  - partitions gateway and locally derivable tools.
+  - implements the HTTP gateway resolver.
+  - adapts gateway responses to the runner wire shape.
+  - provides the vault-backed secret resolver.
+  - reads the MCP feature flag.
+- `services/oss/src/agent/secrets.py`
+  - resolves model-provider keys.
+  - resolves named tool secrets through the future vault consumer endpoint.
+- `services/oss/src/agent/app.py`
+  - calls the resolvers and fills `SessionConfig`.
+
+Service tests currently live in:
+
+- `services/oss/tests/pytest/unit/agent/test_tool_refs.py`
+- `services/oss/tests/pytest/integration/agent/test_resolve_tools_http.py`
+- `services/oss/tests/pytest/integration/agent/test_resolve_secrets_http.py`
+
+### TypeScript runner
+
+Runtime execution lives in:
+
+- `services/agent/src/protocol.ts`
+  - runner wire types, including `ResolvedToolSpec` and `McpServerConfig`.
+- `services/agent/src/tools/execute.ts`
+  - dispatches resolved tools by executor kind.
+- `services/agent/src/tools/code.ts`
+  - code-tool subprocess execution and environment isolation.
+- `services/agent/src/tools/client.ts`
+  - calls the Agenta callback endpoint; despite the filename, this is not a client-fulfilled
+    tool implementation.
+- `services/agent/src/tools/relay.ts`
+  - file relay for callbacks from isolated sandboxes.
+- `services/agent/src/tools/mcp-server.ts`
+  - exposes resolved tools through MCP.
+- `services/agent/src/tools/mcp-bridge.ts`
+  - MCP transport bridge.
+
+Runner tests live in `services/agent/test/`, notably `tool-dispatch.test.ts`,
+`tool-bridge.test.ts`, and `mcp-servers.test.ts`.
+
+### Platform API
+
+The existing platform tools domain is already organized as a package:
+
+- `api/oss/src/apis/fastapi/tools/` - HTTP models, router, boundary parsing.
+- `api/oss/src/core/tools/` - DTOs, interfaces, service, registry, exceptions, providers.
+- `api/oss/src/dbs/postgres/tools/` - database entities, mappings, DAO.
+
+This domain covers the platform catalog, provider connections, and remote execution. It is
+related to, but not identical to, the SDK runtime-tool concern.
+
+## Problems in the current SDK shape
+
+### The files are split by chronology, not responsibility
+
+`tool_defs.py` means “the first tool file” and `tool_resolution.py` means “everything added
+after parsing.” The latter currently owns interfaces, an implementation, pure mappers,
+orchestration, logging behavior, unsupported-tool policy, and MCP rollout policy.
+
+Those concerns change for different reasons and should not share one module.
+
+### `Def` does not explain the lifecycle stage
+
+The important distinction is:
+
+- configuration: persisted, declarative, may contain compatibility shapes.
+- runtime specification: validated and ready for the runner.
+
+`BuiltinToolDef` and `AgentToolDef` do not communicate that distinction. `*Config` and
+`*Spec` do.
+
+### “Local” is already inaccurate
+
+`LocalToolResolver` is used by the service with `_VaultSecretResolver`, which performs an
+HTTP-backed secret lookup. The class really resolves the locally derivable tool kinds; it
+does not guarantee that all dependencies are local.
+
+### Untyped dictionaries erase the contract
+
+`custom_tools: List[Dict[str, Any]]` and helper return values such as
+`code_tool_spec(...) -> Dict[str, Any]` make required fields, executor-specific fields, and
+snake_case-to-camelCase conversion implicit.
+
+The TypeScript side already has `ResolvedToolSpec`. Python should have the matching typed
+model and serialize aliases at the wire boundary.
+
+### The canonical tool configuration is duplicated
+
+The SDK defines `AgentToolDef`; API core separately defines `AgentToolReference`,
+`AgentBuiltinTool`, and `AgentComposioTool`. Their discriminators and supported variants
+already differ.
+
+The evaluations runtime demonstrates the preferred precedent:
+`api/oss/src/core/evaluations/runtime/types.py` imports the SDK runtime classes directly and
+defines only API-specific additions. The platform tools API should reuse the SDK-owned
+neutral tool config and keep only catalog, connection, and provider-execution DTOs in API
+core.
+
+### Parsing failures are silent
+
+`parse_tool_def` catches `ValidationError` and returns `None`. `parse_tool_defs` then drops
+the entry. This is useful for a narrow compatibility adapter, but unsafe as the default
+runtime behavior: a configured tool can disappear without telling the caller which entry
+was invalid.
+
+### Resolution policy is hidden inside mechanics
+
+Current policy choices are embedded in implementation:
+
+- gateway tools are logged and skipped offline.
+- missing named secrets are omitted.
+- MCP declarations become an empty list when a flag is off.
+- unsupported gateway providers are logged and skipped.
+
+These affect correctness and security. They need typed errors or explicit policy arguments.
+
+### MCP is a sibling concept but is folded into one resolver result
+
+The config itself treats `mcp_servers` as a sibling of `tools`, but `ResolvedTools` and
+`LocalToolResolver` own both. Feature gating also happens during resolution even though the
+actual constraint is a harness/backend execution capability.
+
+### Public and private surfaces are unclear
+
+The top-level `agenta.sdk.agents` package exports models, parsers, resolvers, and secret
+interfaces together. The service imports `attach_orthogonal` from the implementation module.
+That helper is an internal mapper, not an SDK concept.
+
+### Gateway correlation depends on position
+
+Gateway results are correlated back to requested tool configs with positional `zip`. Count
+validation prevents truncation, but it does not prove that the backend preserved ordering.
+The API contract should return a stable request reference or correlation ID and the adapter
+should match on it.
+
+### Tool-name collision behavior is unspecified
+
+There is no visible validation for duplicate custom-tool names or collisions between a
+built-in and a custom tool. The runner eventually registers tools by name, so the resolver
+must define whether collisions are invalid, first-wins, or last-wins. Silent runner-level
+overwriting is not an acceptable contract.
+
+### Source docstrings contain temporary design history
+
+The new modules contain release phases, plan decisions, and extensive architecture rationale.
+That material will become stale. Source docstrings should state ownership and runtime
+contracts; this design workspace should retain the history and tradeoffs.
+
+## Repository parallels to follow
+
+| Concern | Existing parallel | Lesson |
+| --- | --- | --- |
+| Domain package | `api/oss/src/core/tools/` | Split a growing domain by role instead of adding suffixed flat files. |
+| Contracts and adapters | `sdks/python/agenta/sdk/agents/interfaces.py` + `adapters/` | Keep abstract dependencies separate from concrete implementations. |
+| Nested subsystem | `sdks/python/agenta/sdk/engines/running/` | A domain inside the SDK can justify its own package and curated exports. |
+| SDK-owned shared runtime | `sdks/python/agenta/sdk/evaluations/runtime/` + `api/oss/src/core/evaluations/runtime/types.py` | Define neutral runtime models once in the SDK; API code imports the same classes and adds only API-specific contracts. |
+| Typed domain errors | `api/oss/src/core/tools/exceptions.py`, `agents/errors.py` | Raise domain errors with structured context; translate them at boundaries. |
+| Registry/dispatch | `core/tools/registry.py`, `engines/running/registry.py` | Use a registry only when multiple implementations exist; do not begin with conditionals spread across callers. |
+| Wire contract | `agents/utils/wire.py` + `services/agent/src/protocol.ts` | Keep Python names idiomatic and perform camelCase serialization at one boundary. |
+| Tests by boundary | `docs/designs/testing/testing.structure.specs.md` | Keep pure parsing/resolution tests in SDK unit tests and HTTP gateway behavior in service integration tests. |
+
+Patterns not worth copying include the 622-line `agents/dtos.py`, the 569-line API
+`core/tools/service.py`, broad `utils.py` modules, and legacy SDK modules that print HTTP
+errors or catch exceptions without a domain contract.
+
+## Proposed SDK package
+
+```text
+sdks/python/agenta/sdk/agents/tools/
+  __init__.py
+  models.py
+  interfaces.py
+  parsing.py
+  compat.py
+  resolver.py
+  wire.py
+  errors.py
+```
+
+### `models.py`
+
+Own declarative config and resolved runtime models:
+
+- `ToolConfigBase`
+- `BuiltinToolConfig`
+- `GatewayToolConfig`
+- `CodeToolConfig`
+- `ClientToolConfig`
+- `ToolConfig` - discriminated union.
+- `CallbackToolSpec`
+- `CodeToolSpec`
+- `ClientToolSpec`
+- `ToolSpec` - discriminated union for the runner-facing contract.
+- `ResolvedToolSet`
+
+Keep persisted discriminator values unchanged. Use Pydantic aliases for runner keys such as
+`inputSchema`, `callRef`, and `needsApproval`.
+
+Canonical models should use `ConfigDict(extra="forbid")`. Compatibility belongs before
+model validation, not in permissive canonical models.
+
+MCP types do not belong here while MCP is out of scope. If retained, place them under a
+sibling `agents/mcp/` package:
+
+```text
+sdks/python/agenta/sdk/agents/mcp/
+  __init__.py
+  models.py
+  interfaces.py
+  resolver.py
+  errors.py
+```
+
+### `parsing.py`
+
+Own strict conversion from already-supported mappings into canonical models:
+
+- `parse_tool_config`
+- `parse_tool_configs`
+
+Raise `ToolConfigError` with the item index and Pydantic cause. It should never silently
+drop executable configuration.
+
+### `compat.py`
+
+Own old playground and persisted payload shapes:
+
+- bare built-in names.
+- `{"name": ...}` built-ins.
+- the legacy `composio` discriminator.
+- OpenAI function-picker gateway slugs.
+
+Return canonical `ToolConfig` values plus structured diagnostics. Callers may select strict
+or best-effort compatibility behavior explicitly; best-effort must not be the only API.
+
+### `interfaces.py`
+
+Own narrow injected dependencies:
+
+```python
+class ToolSecretProvider(Protocol):
+    async def get_many(self, names: Sequence[str]) -> Mapping[str, str]: ...
+
+
+class GatewayToolResolver(Protocol):
+    async def resolve(
+        self,
+        tools: Sequence[GatewayToolConfig],
+    ) -> GatewayToolResolution: ...
+```
+
+`ToolSecretProvider` is intentionally scoped. A generic `SecretResolver` name collides with
+provider credentials, vault management, and other SDK secret concepts.
+
+### `resolver.py`
+
+Own one orchestration path:
+
+```python
+class ToolResolver:
+    def __init__(
+        self,
+        *,
+        secret_provider: ToolSecretProvider,
+        gateway_resolver: GatewayToolResolver | None = None,
+        missing_secret_policy: MissingSecretPolicy = MissingSecretPolicy.ERROR,
+    ) -> None: ...
+
+    async def resolve(
+        self,
+        tool_configs: Sequence[ToolConfig],
+    ) -> ResolvedToolSet: ...
+```
+
+The resolver should:
+
+1. partition configs by kind.
+2. collect declared secret names once.
+3. fetch secrets once.
+4. build typed specs with small private pure functions.
+5. invoke the injected gateway resolver when gateway tools exist.
+6. return a newly constructed immutable result.
+
+It should not parse raw payloads, read feature flags, silently skip unsupported kinds, or
+mutate the result after construction.
+
+The environment implementation can live here while it remains tiny:
+
+- `EnvironmentToolSecretProvider`
+
+Move it to `secrets.py` only if a second built-in implementation appears.
+
+### `wire.py`
+
+Own the one conversion from typed, snake_case Python models to the TypeScript runner
+contract. Harness adapters should consume typed specs and should not normalize arbitrary
+dictionaries.
+
+### `errors.py`
+
+Use one domain base with structured context:
+
+- `ToolError`
+- `ToolConfigError`
+- `ToolResolutionError`
+- `UnsupportedToolProviderError`
+- `GatewayToolResolutionError`
+- `MissingToolSecretError`
+
+HTTP status and response excerpts belong on the service adapter error or as structured cause
+context, not as generic `RuntimeError` strings in SDK orchestration.
+
+### `__init__.py`
+
+Export the intended public SDK:
+
+- config models.
+- `ToolResolver`.
+- resolver/provider protocols intended for user extension.
+- domain errors.
+
+Do not export private mappers or compatibility-only helpers from `agenta.sdk.agents`.
+
+## Naming proposal
+
+| Current | Proposed | Reason |
+| --- | --- | --- |
+| `AgentToolDef` | `ToolConfig` | It is persisted declarative configuration, not an arbitrary definition. |
+| `_ToolDefBase` | `ToolConfigBase` | Names the lifecycle stage. Keep private if users never instantiate it. |
+| `BuiltinToolDef` | `BuiltinToolConfig` | Distinguishes config from resolved spec. |
+| `GatewayToolDef` | `GatewayToolConfig` | Same distinction; keep the discriminator value stable. |
+| `CodeToolDef` | `CodeToolConfig` | Same distinction. |
+| `ClientToolDef` | `ClientToolConfig` | Same distinction. |
+| `McpServer` | `MCPServerConfig` | It is configuration, not a running server. Matches repository acronym style (`API`, `LLM`). |
+| `ResolvedTools` | `ResolvedToolSet` | Communicates an aggregate value, not a resolver action. |
+| `SecretResolver` | `ToolSecretProvider` | Narrows scope and describes an injected source. |
+| `EnvSecretResolver` | `EnvironmentToolSecretProvider` | Explicit source and scope. |
+| `LocalToolResolver` | `ToolResolver` | The resolver can use local or remote injected providers. |
+| `parse_tool_def(s)` | `parse_tool_config(s)` | Matches the renamed model and avoids abbreviation. |
+| `attach_orthogonal` | `_apply_tool_metadata` | “Orthogonal” describes a design argument, not the function's behavior. |
+| `code_tool_spec` | `_build_code_tool_spec` | Verb-first private mapper. |
+| `client_tool_spec` | `_build_client_tool_spec` | Verb-first private mapper. |
+| `mcp_server_spec` | `_build_mcp_server_spec` | Verb-first private mapper. |
+| `defs` | `tool_configs` | Avoid unexplained abbreviation. |
+| `entry` | `tool_spec` | Name the actual value. |
+| `values` | `secret_values` | Name the domain content. |
+| `custom_tools` internally | `tool_specs` | Keep `customTools` only as the legacy wire field name. |
+| `builtins` internally | `builtin_names` | Distinguishes names from built-in config objects. |
+
+Use `McpServerConfig` instead of `MCPServerConfig` only if cross-language spelling is
+prioritized over the Python codebase's existing acronym convention. Pick one spelling and
+apply it in Python docs, schema titles, and tests.
+
+## Function and file structure
+
+### Functions
+
+- Use keyword-only parameters when a function has multiple same-shaped values.
+- Keep parsing, resolution, mapping, and I/O in separate functions.
+- Keep mapper helpers private and pure.
+- Construct return objects once; avoid assigning fields after creation.
+- Use one clear noun per variable: `tool_config`, `tool_spec`, `secret_names`,
+  `secret_values`, `gateway_tools`.
+- Do not expose an internal `resolve_defs` method solely so the service can bypass another
+  public method. Inject the gateway implementation into the same resolver.
+- Validate duplicate resolved names and built-in/custom collisions before returning.
+
+### Docstrings and comments
+
+The current modules read partly like design documents. Follow the repository's API comment
+guidance:
+
+- module docstring: purpose and dependency boundary in a few lines.
+- public class/function docstring: contract, important policy, raised errors.
+- no docstring for obvious Pydantic data containers.
+- comments explain security constraints or compatibility reasons, not the next line of code.
+- keep architecture rationale in this design folder.
+
+### Exceptions
+
+- Core SDK code raises tool-domain exceptions.
+- The service gateway adapter wraps `httpx` errors as `GatewayToolResolutionError`.
+- HTTP route code translates domain errors to HTTP responses.
+- Runner execution failures remain tool-result failures when the model loop is expected to
+  continue.
+- Missing declared secrets should fail before tool execution by default. A best-effort mode
+  can exist for backward compatibility, but it must be explicit and tested.
+- Gateway results should correlate by stable ID/reference, not response position.
+
+## Service-side proposal
+
+Use a service-side adapter package. The current file already contains composition, gateway
+HTTP, vault adaptation, and rollout policy:
+
+```text
+services/oss/src/agent/
+  tools/
+    __init__.py  # curated service entry point
+    resolver.py  # build the SDK ToolResolver for one request
+    gateway.py   # Agenta HTTP GatewayToolResolver
+    secrets.py   # VaultToolSecretProvider
+  secrets.py     # model-provider/harness secrets only
+  client.py      # shared Agenta HTTP client configuration
+```
+
+Move `_VaultSecretResolver` to `tools/secrets.py` and rename it
+`VaultToolSecretProvider`.
+
+Return `ResolvedToolSet` from service resolution instead of a three-item tuple. This removes
+positional unpacking from `app.py` and makes future additions non-breaking.
+
+Feature flags should be read in the service composition/config layer. MCP capability checks
+belong in harness/backend selection or delivery, not in the generic tool resolver.
+
+## TypeScript proposal
+
+The TypeScript runner already has the clearest tool directory. Keep that structure, with two
+targeted naming fixes:
+
+- rename `tools/execute.ts` to `tools/dispatch.ts`; the module selects an executor, while
+  `code.ts` performs execution.
+- rename `tools/client.ts` to `tools/callback.ts` or `tools/agenta-callback.ts`; “client” is
+  confused with the `kind: "client"` tool that the browser fulfils.
+
+Keep wire types in `protocol.ts` until the protocol itself is split. Moving only tool types
+would create cross-file protocol ownership without enough benefit.
+
+## Test organization proposal
+
+Once the SDK source becomes a package, mirror the domain:
+
+```text
+sdks/python/oss/tests/pytest/unit/agents/tools/
+  test_models.py
+  test_parsing.py
+  test_resolution.py
+  test_secrets.py
+```
+
+Service tests should separate pure composition from network behavior:
+
+```text
+services/oss/tests/pytest/unit/agent/tools/
+  test_resolution.py
+  test_gateway_mapping.py
+
+services/oss/tests/pytest/integration/agent/tools/
+  test_gateway_http.py
+  test_vault_secrets_http.py
+```
+
+Tests should assert:
+
+- strict invalid-config errors include the failing index and cause.
+- compatibility parsing reports dropped entries.
+- canonical models reject unexpected fields.
+- declared missing secrets follow the selected policy.
+- unsupported gateway providers raise a typed error.
+- gateway count mismatches and incomplete responses retain structured context.
+- gateway responses are matched by stable correlation value rather than position.
+- duplicate tool names and built-in/custom collisions have a deterministic error.
+- no secret value appears in logs or exception messages.
+- Python typed specs serialize exactly to the TypeScript golden wire contract.
+
+## Migration plan
+
+### Phase 1: package-only refactor
+
+- Add `agents/tools/`.
+- Move code without changing persisted values or wire output.
+- Keep temporary imports from `tool_defs.py` and `tool_resolution.py` that emit deprecation
+  warnings only if these modules have shipped publicly.
+- Update internal imports to the new package.
+- Reuse SDK tool-config models from API core and remove the duplicate API agent-tool union
+  only after API compatibility tests are in place.
+
+### Phase 2: type the resolved contract
+
+- Add `ToolSpec` variants and `ResolvedToolSet`.
+- Serialize runner aliases in one place.
+- Replace `List[Dict[str, Any]]` and tuple returns.
+
+### Phase 3: make policy explicit
+
+- Add strict parsing and diagnostics.
+- Add missing-secret policy.
+- Raise typed errors for unsupported gateway configuration.
+- Add duplicate-name and collision validation.
+- Remove silent skip behavior from the default path.
+
+### Phase 4: simplify service composition
+
+- Inject `VaultToolSecretProvider` and the HTTP gateway resolver.
+- Remove `resolve_defs` and internal mapper imports.
+- Move feature-gate reads to composition.
+
+### Phase 5: isolate or defer MCP
+
+- Remove MCP from the tool resolver while it is disabled.
+- If implementation must remain, move it to `agents/mcp/` and give it independent capability
+  checks and tests.
+
+### Phase 6: optional runner rename
+
+- Rename TypeScript dispatch/callback files in a separate mechanical change.
+
+## Likely CTO review questions
+
+1. Which layer is the source of truth for tool configuration: SDK, API core, or runner
+   protocol?
+2. Why are API `AgentToolReference`, SDK `ToolConfig`, and TypeScript
+   `ResolvedToolSpec` separate, and where is each conversion tested?
+3. Why can an invalid configured tool disappear silently?
+4. Why does an offline resolver silently skip a gateway tool instead of failing the run?
+5. Why can a code tool execute when a declared secret is missing?
+6. Why is a class named `LocalToolResolver` used with an HTTP vault dependency?
+7. Why are runnable specs dictionaries instead of a discriminated model?
+8. Why is MCP feature policy inside resolution when MCP support is a runtime capability?
+9. Why is MCP grouped into `ResolvedTools` when the config calls it a sibling?
+10. Which names are public API, and what is the compatibility plan if they have shipped?
+11. Why does the service import an internal mapping helper from the SDK?
+12. Why does `ToolResolutionError` exist while production paths raise `RuntimeError`?
+13. Can secret values leak through logs, Pydantic reprs, exceptions, traces, or serialized
+    session configuration?
+14. Are gateway response ordering and one-to-one mapping guaranteed by a documented API
+    contract?
+15. What prevents Python and TypeScript tool specs from drifting?
+16. Is the first PR a behavior change or an organization change, and can reviewers verify
+    that distinction from tests?
+17. Why do SDK and API core define different versions of the same agent-tool config?
+18. How are duplicate names and built-in/custom collisions handled?
+19. How is gateway response ordering guaranteed, and why is position an acceptable key?
+20. Why is MCP implemented across config, resolution, wire, and runner while the feature is
+    declared out of scope?
+
+## Decision requested
+
+Approve the package and lifecycle naming before adding more tool kinds:
+
+- `agents/tools/`
+- `*Config` for persisted declarations.
+- `*Spec` for runnable contracts.
+- `ToolResolver` with injected providers.
+- typed errors and explicit policy.
+
+If this is accepted, the next implementation PR should be a package-only refactor with no
+wire or persisted-schema changes.
diff --git a/docs/design/agent-workflows/sdk-local-tools/plan.md b/docs/design/agent-workflows/sdk-local-tools/plan.md
new file mode 100644
index 0000000000..4ee5ddaf26
--- /dev/null
+++ b/docs/design/agent-workflows/sdk-local-tools/plan.md
@@ -0,0 +1,175 @@
+# Plan: settled decisions and phased direction
+
+The reviewer settled the four open decisions on 2026-06-19. This page records the answers,
+then the phases that follow from them. Where a decision changed the prior framing, the
+change is called out so the research and status pages stay consistent.
+
+## Settled decisions
+
+### Decision 1: Where does resolution live? Settled: split by tool kind.
+
+Resolution splits on the executor `type` (`sdks/python/agenta/sdk/agents/tool_defs.py`).
+Only one kind needs Agenta. The rest resolve locally.
+
+- **`gateway` (Composio) stays in the backend.** Resolving a gateway tool validates a
+  Composio connection and holds the provider key and OAuth, so it cannot move to the SDK.
+  The call routes back through `/tools/call` and the key never leaves the server. This is the
+  tool kind a user sometimes calls "the built-in Agenta tool," but in the taxonomy it is
+  `gateway`, not `builtin`.
+- **`builtin`, `code`, `client`, and MCP resolve locally.** A `builtin` is just a name. A
+  `code` tool's spec is built from local data. A `client` tool is name plus schema. None of
+  them needs Agenta to resolve. MCP is local too, but it ships later (Decision 2).
+
+So we build both shapes behind one interface:
+
+- An offline **`LocalToolResolver`** (the default) resolves `builtin` + `code` + `client`
+  (and MCP later) from local data, with no network call.
+- An opt-in **`AgentaToolResolver`** resolves `gateway` by posting refs to the Agenta public
+  API. A user who needs a gateway tool opts into this and accepts the network call.
+
+A user with only built-in and code tools stays fully offline. The goal the reviewer set is
+plain: you should be able to run everything except gateway without Agenta, using your own
+secrets.
+
+One caveat on `client` tools. They resolve locally, but a headless standalone run has no
+browser to fulfil them, and the in-process Pi engine skips them
+(`services/agent/src/engines/pi.ts:165`). They are locally resolvable, not locally runnable,
+unless the user supplies a fulfiller. We document this as a limitation, not a blocker.
+
+**Dependency direction is fixed regardless: `service -> SDK`, never the reverse.** The
+service already imports the SDK's tool vocabulary
+(`services/oss/src/agent/tools.py:23`). When spec-building logic moves into the SDK, the
+service consumes it. The sibling effort locks the same rule
+([`../scratch/sdk-local-backend/status.md`](../scratch/sdk-local-backend/status.md),
+"Dependency direction").
+
+### Decision 2: How do code and MCP executors run locally? Settled: code reuses the Pi engine; MCP is out of scope, wired but flag-gated off.
+
+`LocalBackend`'s Pi path is the bundled in-process Pi engine, which already executes `code`
+and `callback` tools (`services/agent/src/engines/pi.ts:144`). So code execution needs no new
+executor under Pi. It needs the bundle the sibling effort ships, plus a correctly filled code
+spec from the resolver.
+
+MCP is out of scope for the first releases. We keep the functions and flows in place (parse,
+the `mcp_servers` field on `SessionConfig`, a resolver path) but keep them a no-op behind a
+feature flag that defaults to off, for both Pi and Claude. This matches the code today: the
+in-process Pi engine already hard-codes `mcpTools: false`
+(`services/agent/src/engines/pi.ts:58`), so Pi is already a no-op for MCP. The flag
+formalizes that and gives Claude the same off-by-default treatment. The business logic stays
+dormant behind the flag until a later phase turns it on.
+
+### Decision 3: How are secrets supplied locally? Settled: env by default, behind a pluggable `SecretResolver`. The vault is a later, connected path.
+
+First, the secret model, because the prior framing called part of this a bug and it is not.
+
+Secrets live in the project vault. One table, one CRUD stack
+(`POST/GET/PUT/DELETE /secrets` under `/vault/v1`), encrypted at rest. A secret's `kind`
+decides how it behaves at run time. Two kinds matter here:
+
+- **`provider_key`**: a provider-indexed LLM credential. `data` is
+  `{kind: "openai", provider: {key: "..."}}`, where the inner `kind` is a fixed provider
+  enum. It is consumed today by `get_user_llm_providers_secrets()`
+  (`api/oss/src/core/secrets/utils.py:54`), which maps each provider to a fixed env var
+  (`openai -> OPENAI_API_KEY`). The user never names these. The identity is the provider.
+- **`custom_secret`**: a free-text `name -> value` entry. `header.name` is the user-chosen
+  name (for example `GITHUB_TOKEN`); `data` is `{secret: {key: "..."}}`, with no provider
+  enum. It is storage-only this iteration by design. Nothing reads it, there is no env-var
+  mapping, and it is not injected into completions or the agent runtime.
+
+Same endpoint, same table, same encryption and CRUD. The only difference is the `kind` and
+what consumes the value.
+
+Now the problem this effort must solve. A `code` tool names the secrets it needs
+(`secrets: ["GITHUB_TOKEN"]`), and an MCP server names an env-var-to-secret-name map. The
+config stores only the name, never the value. At run time something must turn the name into a
+value and inject it as an env var into the sandbox subprocess (code) or the server process
+(MCP). On the Agenta server path that consumer would read the `custom_secret` rows by name,
+but it is not built yet, because `custom_secret` is storage-only this iteration. So
+`resolve_named_secrets` (`services/oss/src/agent/secrets.py:75`) anticipates a consumer that
+does not exist; until it does, declared custom secrets resolve to `{}` and the tool runs
+without those env vars. This is the expected current state given the storage-only design, not
+a bug to fix in this effort.
+
+For a standalone run we do not want to depend on the vault at all. The SDK has no secret
+resolution of any kind today. So a standalone agent has no way to supply `GITHUB_TOKEN` to
+its code tool. The settled answer:
+
+- **Env is the offline default.** A code tool's declared secret name reads from the process
+  environment. `GITHUB_TOKEN` comes from `os.environ`. This is what "use your own secrets and
+  run" means.
+- **A pluggable `SecretResolver` is the interface env plugs into.** The user can back it with
+  a `.env` file, a secret manager, or their own vault. Env is just the built-in default
+  implementation. We ship the interface in the first slice.
+- **The Agenta vault over HTTP is a later, connected path.** It reads `custom_secret` rows by
+  name and depends on the future consumer endpoint being built. That work is tracked as an
+  open issue ([../open-issues.md](../open-issues.md)) and is tied to the named-secrets effort
+  (`docs/design/vault-named-secrets/`).
+
+### Decision 4: What is the minimum first slice? Settled: `LocalBackend` (Pi) + built-in + code + env secrets, offline.
+
+A standalone agent that reads files, runs bash, and runs its own code tools, with zero Agenta
+calls. It excludes gateway (server-bound), client (needs a browser), and MCP (flag-gated
+off). This implies building the `LocalBackend` Pi path, which the sibling effort owns as its
+Phase 0.
+
+## Phased direction
+
+The phases assume the settled decisions above.
+
+### Phase 0 (prerequisite, owned elsewhere): `LocalBackend` runs a tool-free agent
+
+`LocalBackend` is a stub that raises (`sdks/python/agenta/sdk/agents/adapters/local.py:30`).
+The sibling effort ([`../scratch/sdk-local-backend/status.md`](../scratch/sdk-local-backend/status.md))
+must ship at least the Pi path (the bundled JS runner) before any tool work runs end to end.
+
+### Phase 1: built-in tools, offline
+
+The smallest real step. Built-in tools are just names that run in-harness
+(`services/oss/src/agent/tools.py:183`). A standalone user lists built-in names on the
+`AgentConfig`, and the harness runs them. This is mostly a wiring check that the names flow
+onto `SessionConfig.builtin_tools` under `LocalBackend`.
+
+### Phase 2: a local tool resolver, code tools, env secrets
+
+Build the `LocalToolResolver`. It reuses the per-kind spec builders that
+`services/oss/src/agent/tools.py` already has (`_resolve_code` at `:127`, `_client_spec` at
+`:153`), moved into the SDK so the service and the standalone user call the same code, with
+the dependency pointing `service -> SDK`. Add a `SecretResolver` with an env default. At the
+end of this phase a standalone agent runs its code tools offline, because the in-process Pi
+engine already executes them (`services/agent/src/engines/pi.ts:169`).
+
+This is the first slice with real value. It is the phase to aim the first PR at.
+
+### Phase 3: opt-in Agenta-backed resolver for gateway tools
+
+Add `AgentaToolResolver`. It posts gateway refs to the Agenta public API and wires the
+`/tools/call` callback, exactly as `_resolve_gateway` (`services/oss/src/agent/tools.py:67`)
+does, but driven from the SDK. This is opt-in and network-gated, and it keeps the Composio key
+server-side. It is connected-standalone: no sidecar, but a real Agenta API call. Client tools
+surface here as a documented limitation (no browser to fulfil them in a headless local run).
+
+### Phase 4: land the vault consumer for named secrets
+
+Add the runtime consumer that resolves `custom_secret` rows by name, so the connected
+resolver can read the vault and the server path stops resolving declared secrets to `{}`.
+This is co-owned with the named-secrets effort (`docs/design/vault-named-secrets/`), whose
+current iteration is storage-only by design. Tracked in
+[../open-issues.md](../open-issues.md).
+
+### Phase 5: MCP, and Claude locally
+
+The last and least certain. Turning on local MCP needs an executor the in-process Pi engine
+does not have (`services/agent/src/engines/pi.ts:58`); it stays flag-gated off until then.
+Claude locally needs the `claude-agent-sdk` path the sibling effort owns, plus MCP-style tool
+delivery on top. Both are real work and neither blocks the value in phases 1 and 2.
+
+## Phase summary
+
+| Phase | Delivers | Network | Blocked on |
+| --- | --- | --- | --- |
+| 0 | tool-free local Pi run | none | sibling effort (`LocalBackend`) |
+| 1 | built-in tools | none | Phase 0 |
+| 2 | code tools + env secrets | none | Phase 1; `LocalToolResolver`, `SecretResolver` |
+| 3 | gateway tools (opt-in) | Agenta API | Phase 2; `AgentaToolResolver` |
+| 4 | vault named-secret consumer | Agenta API | API work; named-secrets coordination |
+| 5 | MCP + Claude locally | varies | Phases 2-4; new executor, MCP flag flipped on |
diff --git a/docs/design/agent-workflows/sdk-local-tools/research.md b/docs/design/agent-workflows/sdk-local-tools/research.md
new file mode 100644
index 0000000000..9704e3cb26
--- /dev/null
+++ b/docs/design/agent-workflows/sdk-local-tools/research.md
@@ -0,0 +1,170 @@
+# Research: the verified current state
+
+This is the map a prior review drew and this document re-verified against the code on
+2026-06-19. Every claim cites `file:line`. Read it as the ground truth the plan builds on.
+
+The headline: the tool **vocabulary** already lives in the SDK and parses standalone, but
+tool **resolution** and **secret resolution** live only in the service, behind HTTP calls to
+the Agenta API. A standalone run has the words but not the machinery.
+
+## The pipeline, end to end
+
+A configured tool travels through four stages before a harness can call it.
+
+1. **Parse**: turn a loose config entry into a typed tool def. *In the SDK already.*
+2. **Resolve**: turn a typed def into a wire-ready runnable spec (a name, a description, an
+   input schema, and either a `callRef`, a code snippet, or a client marker). *Service only.*
+3. **Supply secrets**: fetch the vault values a code tool or MCP server needs. *Service
+   only, and partly broken even there (see below).*
+4. **Execute**: actually run the call when the model invokes the tool. *Lives in the TS
+   runner; the in-process Pi engine already runs code and callback tools.*
+
+Stage 1 is portable today. Stages 2 and 3 are the gap. Stage 4 is closer than it looks.
+
+## Stage 1: parsing (already in the SDK, standalone-ready)
+
+The neutral tool vocabulary and its parser live in the published SDK, with no dependency on
+the Agenta API.
+
+- `sdks/python/agenta/sdk/agents/tool_defs.py:174`: `parse_tool_def` coerces one loose
+  config entry into a typed def. `parse_tool_defs` (`:211`) does a list;
+  `parse_mcp_servers` (`:223`) does MCP servers.
+- The tool union spans four executors, discriminated on `type`:
+  `BuiltinToolDef` (`:55`), `GatewayToolDef` (`:63`), `CodeToolDef` (`:79`),
+  `ClientToolDef` (`:93`), joined as `AgentToolDef` (`:104`).
+- Two orthogonal fields ride on every tool: `needs_approval` and `render`, both on
+  `_ToolDefBase` (`:43`).
+- `McpServer` (`:117`) is a sibling of `tools`, not a tool variant. It declares a transport
+  (`stdio` or `http`), a command, and an env-var-to-vault-secret-name map.
+
+A standalone user can already call `parse_tool_defs(agent.tools)` and get typed defs. The
+parser is permissive on input and strict on output, and it never touches the network.
+
+## Stage 2: resolution (service only)
+
+Resolution lives in the service, not the SDK.
+
+- `services/oss/src/agent/tools.py:164`: `resolve_tools` splits the parsed defs by executor
+  and resolves each kind:
+  - **gateway** (`:67`, `_resolve_gateway`): POSTs the refs to the Agenta API
+    `POST /tools/resolve` (`:89`), then sets a callback to `POST /tools/call` (`:120`). The
+    Composio key and connection auth stay server-side by design.
+  - **code** (`:127`, `_resolve_code`): builds the spec locally from the def, then resolves
+    the tool's declared secrets into a scoped `env` (`:134`).
+  - **client** (`:153`, `_client_spec`): builds a name-plus-schema spec locally, no callback.
+- `resolve_mcp_servers` (`:211`) resolves each declared server's secret env and returns
+  wire-ready server dicts.
+
+Read closely, only the **gateway** branch truly needs the network. The code and client
+branches build their specs from local data; the only network call inside them is the secret
+fetch (stage 3). That is the seam the plan exploits: most of `resolve_tools` is already
+offline-capable logic that simply lives in the wrong layer.
+
+The service feeds the resolved output onto the `SessionConfig` and never the SDK. The wiring
+is in `services/oss/src/agent/app.py`: `resolve_tools` and `resolve_mcp_servers` are
+awaited (`:91`-`:92`), then their output is placed on `SessionConfig` fields `builtin_tools`,
+`custom_tools`, `tool_callback` (`:99`-`:101`). The SDK consumes these; it never produces
+them.
+
+## Stage 3: secrets (service only, and the named-secret consumer is not built yet)
+
+Secrets live in one project vault. One table, one CRUD stack, encrypted at rest. A secret's
+`kind` decides how it behaves at run time. Two kinds matter here.
+
+- **`provider_key`**: a provider-indexed LLM credential. `data` is
+  `{kind: "openai", provider: {key: "..."}}`, the inner `kind` a fixed provider enum. The
+  service consumes it in `resolve_harness_secrets` (`services/oss/src/agent/secrets.py:38`),
+  which fetches the vault via `GET /secrets/` (`:54`) and maps each provider to its env var
+  (`openai -> OPENAI_API_KEY`). This path works. The same filter-and-map lives in the API at
+  `get_user_llm_providers_secrets` (`api/oss/src/core/secrets/utils.py:54`).
+- **`custom_secret`**: a free-text `name -> value` entry. `header.name` is the user-chosen
+  name (for example `GITHUB_TOKEN`); `data` is `{secret: {key: "..."}}`, with no provider
+  enum. It is storage-only this iteration by design. Nothing reads it, there is no env-var
+  mapping, and it is not injected into the agent runtime.
+
+Same endpoint, same table, same encryption and CRUD. The only difference is the `kind` and
+what consumes the value.
+
+A code tool and an MCP server name `custom_secret` entries. `resolve_named_secrets`
+(`services/oss/src/agent/secrets.py:75`) is what they would use: it POSTs the requested names
+to a consumer endpoint (`:97`) and reads back `{name: value}`. That consumer is not built yet.
+The vault router exposes only CRUD (`api/oss/src/apis/fastapi/vault/router.py:36`-`:74`:
+create, list, read, update, delete) and no resolve route. The named-secrets effort is
+storage-only this iteration and says so at `docs/design/vault-named-secrets/context.md:23`:
+"Do **not** inject these secrets into the agent runtime, sandbox, or any invocation."
+
+So `resolve_named_secrets` swallows the missing-endpoint response and returns `{}`
+(`secrets.py:101`-`:107`), logging a warning. A code tool or MCP server that declares a custom
+secret gets an empty env today, on every path including the server. This is the expected
+current state given the storage-only design, not a bug to fix in this effort. Building the
+consumer is later, coordinated work (the plan's Phase 4 and
+[open-issues.md](../open-issues.md)). For the standalone slice, secrets come from a local
+`SecretResolver` (env by default), so the offline path does not wait on the vault consumer at
+all.
+
+## Stage 4: execution (the TS runner, and the in-process Pi engine already does most of it)
+
+Execution lives in the TypeScript runner under `services/agent/src/tools/`:
+`code.ts` runs a code snippet, `mcp-server.ts` bridges resolved tools over MCP, `execute.ts`
+(`runResolvedTool`) is the shared dispatcher, `client.ts` and `relay.ts` round it out.
+
+The important and under-appreciated fact: the **in-process Pi engine already executes tools**.
+`services/agent/src/engines/pi.ts` builds Pi custom tools and branches on the executor
+`kind` (`pi.ts:144`):
+
+- `code`: runs the snippet in a sandbox subprocess with its scoped secret env (`pi.ts:169`).
+- `callback`: POSTs back to Agenta's `/tools/call` (`pi.ts:179`).
+- `client`: skipped in-process; there is no browser to answer (`pi.ts:165`).
+
+`code.ts`'s own header says it is "Shared by every delivery path that runs code locally:
+engines/pi.ts (in-process Pi), extensions/agenta.ts (Pi under rivet), tools/mcp-server.ts."
+This matters because `LocalBackend`'s Pi path is the **bundled in-process Pi engine**. So once
+`LocalBackend` ships that bundle, code-tool *execution* comes nearly for free. The work is to
+hand the engine a correctly resolved code spec with its env filled in, which is stages 2 and 3.
+
+What the in-process Pi engine does **not** do is MCP. `pi.ts:58` hard-codes
+`mcpTools: false`. MCP delivery and execution live only on the rivet path (`rivet.ts`, the
+capability gate at `:799`), which the standalone user does not have. MCP local support is
+therefore a later phase that needs its own executor, not a reuse of the Pi engine.
+
+## The delivery contract the SDK already speaks
+
+The resolved output rides on neutral types the SDK owns, so a local resolver has a clear
+target to produce.
+
+- `SessionConfig` (`sdks/python/agenta/sdk/agents/dtos.py:498`) carries `builtin_tools`,
+  `custom_tools`, `tool_callback`, and `mcp_servers` (`:512`-`:517`).
+- The harness adapters in `sdks/python/agenta/sdk/agents/adapters/harnesses.py` only *shape*
+  already-resolved specs; they never resolve. `_normalize_tool_specs` (`:65`) fills defaults
+  and passes the executor fields (`kind`, `runtime`, `code`, `env`, `callRef`) through to the
+  wire (`:54`, `_TOOL_SPEC_PASSTHROUGH`). `PiHarness` (`:93`) keeps built-ins and delivers
+  specs natively; `ClaudeHarness` (`:114`) drops built-ins and routes over MCP.
+
+So the local resolver's job is well defined: produce the same `builtin_tools` /
+`custom_tools` / `tool_callback` / `mcp_servers` shapes the service produces, from local data.
+Everything downstream already accepts them.
+
+## The current state, per tool kind
+
+| Kind | Resolve today | Execute under `LocalBackend` (Pi) | Standalone reach |
+| --- | --- | --- | --- |
+| **builtin** | trivial; just the name (`tools.py:183`) | yes, runs in-harness | **fully offline.** Closest to free. |
+| **code** | spec built locally; only the secret fetch is remote (`tools.py:127`) | yes, `pi.ts:169` already runs it | **offline if secrets come from env.** The spec builder must move to the SDK; secrets need a local source. |
+| **client** | spec built locally (`tools.py:153`) | skipped in-process (`pi.ts:165`) | needs a browser/fulfiller; **out of offline reach** without a UI. |
+| **gateway** | requires `POST /tools/resolve` + `/tools/call` (`tools.py:67`) | callback to Agenta (`pi.ts:179`) | **server-bound by design.** Connected-standalone only; the provider key stays server-side. |
+| **MCP** | spec built locally; secret env is remote (`tools.py:211`) | not supported in-process (`pi.ts:58`) | needs a local MCP executor; **later phase.** |
+
+## What this means for the plan
+
+Three findings shape the phasing:
+
+1. **Built-in and code tools are within reach of a first offline slice.** Built-in is just a
+   name. Code's spec builder is local logic in the wrong layer, and the in-process Pi engine
+   already executes code. The only true blocker for code is secrets, which env solves.
+2. **Named-secret consumption is not built, by design.** Custom secrets are storage-only this
+   iteration, so any path that relies on the vault for code or MCP secrets either waits on the
+   future consumer endpoint or supplies secrets another way (env, a pluggable resolver). The
+   offline slice takes the second route.
+3. **Gateway and MCP are genuinely harder.** Gateway must stay server-bound. MCP needs an
+   executor the in-process Pi engine does not have. Both belong in later phases, framed as
+   open decisions, not promised in the first slice.
diff --git a/docs/design/agent-workflows/sdk-local-tools/review/evidence/app-mcp-reassign.md b/docs/design/agent-workflows/sdk-local-tools/review/evidence/app-mcp-reassign.md
new file mode 100644
index 0000000000..e4ed98202b
--- /dev/null
+++ b/docs/design/agent-workflows/sdk-local-tools/review/evidence/app-mcp-reassign.md
@@ -0,0 +1,14 @@
+# Evidence – F-006
+
+`services/oss/src/agent/app.py:91-92`
+```python
+    resolved = await resolve_tools(agent_config.tools)
+    resolved.mcp_servers = await resolve_mcp_servers(agent_config.mcp_servers)
+```
+
+`resolve_tools` already returns mcp_servers empty by construction:
+
+`services/oss/src/agent/tools.py:147-158` (docstring + body) returns `ResolvedTools()` /
+`ResolvedTools` whose `mcp_servers` defaults to `[]`; the function never populates it. Relevance:
+the overwrite is redundant and relies on an implicit "resolve_tools never sets mcp_servers"
+contract; info-level readability/coupling note only.
diff --git a/docs/design/agent-workflows/sdk-local-tools/review/evidence/attach-orthogonal-mutation.md b/docs/design/agent-workflows/sdk-local-tools/review/evidence/attach-orthogonal-mutation.md
new file mode 100644
index 0000000000..77b299368c
--- /dev/null
+++ b/docs/design/agent-workflows/sdk-local-tools/review/evidence/attach-orthogonal-mutation.md
@@ -0,0 +1,21 @@
+# Evidence – F-005
+
+`sdks/python/agenta/sdk/agents/tool_resolution.py:89-96`
+```python
+def attach_orthogonal(entry: Dict[str, Any], tool: Any) -> Dict[str, Any]:
+    """Carry the orthogonal axes (approval, render) onto a wire spec when set."""
+    if getattr(tool, "needs_approval", False):
+        entry["needsApproval"] = True
+    render = getattr(tool, "render", None)
+    if render is not None:
+        entry["render"] = render
+    return entry
+```
+
+It is a public helper imported by the service:
+
+`services/oss/src/agent/tools.py:36` -> `from agenta.sdk.agents.tool_resolution import (... attach_orthogonal)`
+
+Mutates `entry` in place AND returns it. All current callers pass a fresh dict and use the
+return value, so no live bug. Relevance: latent footgun for a future caller that passes a shared
+dict; dual mutate-and-return contract is inconsistent with the package's other pure builders.
diff --git a/docs/design/agent-workflows/sdk-local-tools/review/evidence/description-default-inconsistency.md b/docs/design/agent-workflows/sdk-local-tools/review/evidence/description-default-inconsistency.md
new file mode 100644
index 0000000000..decca6d700
--- /dev/null
+++ b/docs/design/agent-workflows/sdk-local-tools/review/evidence/description-default-inconsistency.md
@@ -0,0 +1,29 @@
+# Evidence – F-004
+
+SDK builders default description to the tool name:
+
+`sdks/python/agenta/sdk/agents/tool_resolution.py:99-108`
+```python
+def code_tool_spec(tool: CodeToolDef, env: Dict[str, str]) -> Dict[str, Any]:
+    entry: Dict[str, Any] = {
+        "name": tool.name,
+        "description": tool.description or tool.name,
+        ...
+```
+
+Gateway remap copies the backend field, which may be None:
+
+`services/oss/src/agent/tools.py:131-133`
+```python
+        entry = {
+            "name": name,
+            "description": spec.get("description"),
+```
+
+Runner absorbs a None description downstream:
+
+`services/agent/src/engines/pi.ts:162` -> `description: spec.description ?? spec.name`
+`services/agent/src/tools/mcp-server.ts:67` -> `description: s.description ?? s.name`
+
+Relevance: same "no description" condition yields `name` for code/client but `None` for gateway
+on the wire; harmless at runtime, inconsistent on the contract.
diff --git a/docs/design/agent-workflows/sdk-local-tools/review/evidence/gateway-no-logging.md b/docs/design/agent-workflows/sdk-local-tools/review/evidence/gateway-no-logging.md
new file mode 100644
index 0000000000..b7e040c160
--- /dev/null
+++ b/docs/design/agent-workflows/sdk-local-tools/review/evidence/gateway-no-logging.md
@@ -0,0 +1,38 @@
+# Evidence – F-002
+
+`services/oss/src/agent/tools.py:97-107`
+```python
+    async with httpx.AsyncClient(timeout=TOOLS_TIMEOUT) as client:
+        response = await client.post(
+            f"{api_base}/tools/resolve",
+            json={"tools": refs},
+            headers=headers,
+        )
+    if response.status_code >= 400:
+        raise ToolResolutionError(
+            f"Tool resolution failed (HTTP {response.status_code}): {response.text[:500]}",
+            status=response.status_code,
+        )
+```
+
+No `try/except` around the POST: a transport error raises a raw `httpx` exception (not
+`ToolResolutionError`). No `log.*` on any branch in `_resolve_gateway`.
+
+Contrast the established pattern in the adjacent (out-of-scope) secret resolver:
+
+`services/oss/src/agent/secrets.py:94-111`
+```python
+    try:
+        async with httpx.AsyncClient(timeout=TOOLS_TIMEOUT) as client:
+            response = await client.post(...)
+        if response.status_code >= 400:
+            log.warning("agent: named-secret resolve HTTP %s for %s", response.status_code, names)
+            return {}
+    except Exception:  # pylint: disable=broad-except
+        log.warning("agent: named-secret resolve failed for %s", names, exc_info=True)
+        return {}
+```
+
+Relevance: the gateway path neither logs failures nor normalizes transport errors to the typed
+error, breaking the "fail-fast raises a typed `ToolResolutionError`" invariant for the
+transport-failure case.
diff --git a/docs/design/agent-workflows/sdk-local-tools/review/evidence/gateway-orthogonal-untested.md b/docs/design/agent-workflows/sdk-local-tools/review/evidence/gateway-orthogonal-untested.md
new file mode 100644
index 0000000000..3ec4b17039
--- /dev/null
+++ b/docs/design/agent-workflows/sdk-local-tools/review/evidence/gateway-orthogonal-untested.md
@@ -0,0 +1,28 @@
+# Evidence – F-001
+
+Gateway specs get the orthogonal axes via `attach_orthogonal` at the end of the gateway remap:
+
+`services/oss/src/agent/tools.py:131-138`
+```python
+        entry = {
+            "name": name,
+            "description": spec.get("description"),
+            "inputSchema": spec.get("input_schema"),
+            "callRef": call_ref,
+            "kind": "callback",
+        }
+        custom_tools.append(attach_orthogonal(entry, tool))
+```
+
+The integration fixture has no orthogonal fields, so this branch never runs with the axes set:
+
+`services/oss/tests/pytest/integration/agent/test_resolve_tools_http.py:22`
+```python
+_GATEWAY = {"function": {"name": "tools__composio__github__GET_USER__c1"}}
+```
+
+A grep for `needs_approval|needsApproval|render` combined with `gateway|composio|callback` in
+the gateway test files returns nothing — only the local code/client path asserts carry-back
+(`test_tool_refs.py` / `test_tool_resolution.py`, the `pick` client tool). Relevance: the
+gateway branch of `attach_orthogonal` is unexercised, so an approval-drop regression on
+server-side tools ships green.
diff --git a/docs/design/agent-workflows/sdk-local-tools/review/evidence/handler-resolution-error.md b/docs/design/agent-workflows/sdk-local-tools/review/evidence/handler-resolution-error.md
new file mode 100644
index 0000000000..ba93a65d91
--- /dev/null
+++ b/docs/design/agent-workflows/sdk-local-tools/review/evidence/handler-resolution-error.md
@@ -0,0 +1,20 @@
+# Evidence – F-003
+
+`services/oss/src/agent/app.py:90-114`
+```python
+    msgs = to_messages(messages or (inputs or {}).get("messages") or [])
+    resolved = await resolve_tools(agent_config.tools)        # can raise ToolResolutionError
+    resolved.mcp_servers = await resolve_mcp_servers(agent_config.mcp_servers)
+
+    session_config = SessionConfig(...)
+    harness = make_harness(selection.harness, Environment(select_backend(selection)))
+
+    if stream:
+        return _agent_stream(harness, session_config, msgs)
+```
+
+Resolution runs before the `stream` branch, so a `ToolResolutionError` is raised by the `_agent`
+coroutine itself — for the stream path that is the coroutine the endpoint awaits to *get* the
+async generator, not a part yielded inside the stream. No test drives `_agent(stream=True)` with
+a raising `resolve_tools`. Relevance: failure-path surfacing on `/messages` is unspecified and
+unpinned; it may differ from the JSON path.
diff --git a/docs/design/agent-workflows/sdk-local-tools/review/findings.md b/docs/design/agent-workflows/sdk-local-tools/review/findings.md
new file mode 100644
index 0000000000..5859999acd
--- /dev/null
+++ b/docs/design/agent-workflows/sdk-local-tools/review/findings.md
@@ -0,0 +1,282 @@
+# findings.md – Confirmed Findings
+
+Sync metadata: scan-codebase, `depth=deep`,
+`path=docs/design/agent-workflows/sdk-local-tools/review/`, branch `gitbutler/workspace`.
+Severity scheme: critical / high / medium / low / info (harness `deliverables.md`). Each
+finding also carries `Confidence` and `Status` from the shared findings schema.
+
+## Summary
+
+Six findings were identified and all six are resolved by the tools package refactor.
+
+There are no critical findings. There are no high findings.
+
+## Resolved Findings
+
+### F-001 [RESOLVED] – Gateway orthogonal-axis carry-back (needsApproval / render) is untested
+
+| Field | Value |
+|---|---|
+| Severity | medium |
+| Confidence | high |
+| Status | resolved |
+| Origin | scan |
+| Lens | verification |
+| Category | Testing, Completeness |
+| Criteria | general G-13/G-16 (Testing); Completeness (2) |
+| Location | `services/oss/src/agent/tools.py:138`; `services/oss/tests/pytest/integration/agent/test_resolve_tools_http.py:47-85` |
+| Files | tools.py, test_resolve_tools_http.py, test_tool_refs.py |
+
+**Condition:** `_resolve_gateway` calls `attach_orthogonal(entry, tool)` on every resolved
+gateway spec (`tools.py:138`), so a gateway tool that declares `needs_approval` or `render`
+should carry `needsApproval` / `render` onto its `kind="callback"` wire spec. No test sets
+those fields on a gateway tool. The integration fixture `_GATEWAY`
+(`test_resolve_tools_http.py:22`) is a bare picker slug with no orthogonal fields, and
+`test_tool_refs.py` only exercises `_gateway_ref` (which does not apply the orthogonal axes —
+those land later in `_resolve_gateway`). The gateway branch of `attach_orthogonal` is therefore
+never executed under test.
+
+**Cause:** The test that locks orthogonal carry-back
+(`test_resolve_tools_splits_builtin_code_client_offline`) covers only the local code/client
+kinds. The gateway integration tests focus on the network path, count mismatch, and
+incomplete-spec handling, and skip the approval/render axes.
+
+**Consequence:** A regression that drops `needsApproval` on a gateway tool (for example a
+human-in-the-loop Composio action) would ship green. Approval is a safety gate; silently
+dropping it on the gateway kind is the worst kind to lose, because those tools execute
+server-side with the provider key.
+
+**Evidence:** evidence/gateway-orthogonal-untested.md
+
+**Suggested Fix:**
+
+- **Option A (preferred):** Extend `test_resolves_gateway_and_remaps_to_wire_shape` to pass a
+  gateway tool dict carrying `needs_approval: true` and a `render` dict, and assert the resolved
+  callback spec carries `needsApproval` and `render`. A typed gateway dict
+  (`{"type": "gateway", ...}`) is the cleanest vehicle.
+- **Option B:** Add a focused unit test in `test_tool_refs.py` that calls `_resolve_gateway`
+  with a stubbed HTTP response and a gateway def with the axes set, asserting carry-back by
+  position.
+
+**Resolution:** `test_gateway_metadata_and_description_fallback_are_preserved` and
+`test_gateway_specs_are_joined_by_call_ref_not_position` cover metadata carry-back.
+
+---
+
+### F-002 [RESOLVED] – Gateway HTTP/timeout failures are raised but not logged
+
+| Field | Value |
+|---|---|
+| Severity | medium |
+| Confidence | high |
+| Status | resolved |
+| Origin | scan |
+| Lens | verification |
+| Category | Observability |
+| Criteria | Observability (10); services SV-26 |
+| Location | `services/oss/src/agent/tools.py:97-120` |
+| Files | tools.py |
+
+**Condition:** `_resolve_gateway` makes an outbound HTTP call to `POST /tools/resolve` with no
+`try/except` around the `httpx` call and no log on any failure path. A network error (DNS,
+connection refused, timeout) raises a raw `httpx` exception, not a `ToolResolutionError`, and
+nothing is logged. The HTTP-status, count-mismatch, and incomplete-spec paths raise
+`ToolResolutionError` but emit no log line either.
+
+**Cause:** The module logs only the unsupported-provider skip (`tools.py:165`). The gateway call
+leans on the exception message reaching the caller, but the caller (`app.py:91`) does not catch
+or log it. The adjacent secret resolver (`secrets.py`) wraps every `httpx` call in `try/except`
+with a `log.warning` plus `exc_info`; the gateway path does not follow that established pattern.
+
+**Consequence:** When gateway resolution fails in production, the operator sees a generic
+framework error with no structured breadcrumb, and a raw `httpx.ConnectError` /
+`httpx.ReadTimeout` is not normalized to the typed `ToolResolutionError` the rest of the path
+promises. The invariant "gateway fail-fast paths raise a typed `ToolResolutionError`" does not
+hold for transport-level failures.
+
+**Evidence:** evidence/gateway-no-logging.md
+
+**Suggested Fix:**
+
+- **Option A (preferred):** Wrap the `httpx` POST in `try/except httpx.HTTPError` and re-raise as
+  `ToolResolutionError("Tool resolution request failed: ...")` so transport failures match the
+  typed-error invariant. Add `log.warning("agent: gateway tool resolution failed for %d
+  ref(s)", len(gateway_defs), exc_info=True)` on the failure paths, mirroring `secrets.py`.
+- **Option B:** At minimum add a `log.warning` on the HTTP-status, count-mismatch, and
+  incomplete-spec branches so each fail-fast leaves a breadcrumb, and leave transport-error
+  normalization for a follow-up.
+
+**Resolution:** `AgentaGatewayToolResolver` logs each failure branch and wraps
+`httpx.HTTPError` as `GatewayToolResolutionError`; integration tests verify both.
+
+---
+
+### F-003 [RESOLVED] – Handler does not handle ToolResolutionError; stream path may surface it differently
+
+| Field | Value |
+|---|---|
+| Severity | medium |
+| Confidence | medium |
+| Status | resolved |
+| Origin | scan |
+| Lens | verification |
+| Category | Completeness, Soundness |
+| Criteria | Completeness (2); services SV-2 |
+| Location | `services/oss/src/agent/app.py:91-92` |
+| Files | app.py |
+
+**Condition:** `_agent` calls `resolve_tools` (and `resolve_mcp_servers`) at lines 91-92, before
+it branches on `stream`. A `ToolResolutionError` raised here propagates straight out of `_agent`.
+For the JSON `/invoke` path this is the intended fail-fast: the workflow wrapper turns the raised
+exception into an error response. For the SSE `/messages` stream path, the exception is raised
+from the coroutine the endpoint awaits to obtain the async generator, so the failure surfaces as
+a coroutine raise, not as a stream `error` part. The two paths report the same root failure
+through different channels, and there is no test asserting the stream path on a resolution
+failure.
+
+**Cause:** Tool resolution is deliberately hoisted above the `stream` branch so both paths
+fail-fast before any backend setup. That is the right call for fail-fast, but it leaves the
+stream-path error-surfacing shape unspecified and untested.
+
+**Consequence:** A client on the streaming endpoint may receive a transport-level error instead
+of a well-formed UI Message Stream `error` part, depending on how the endpoint adapter handles a
+raising coroutine. This is a UX inconsistency on the failure path, not a data-loss bug.
+
+**Evidence:** evidence/handler-resolution-error.md
+
+**Suggested Fix:**
+
+- **Option A:** Confirm with the endpoint-adapter author how a raising `_agent` coroutine is
+  rendered on the `/messages` SSE path. If it is not a clean stream `error` part, catch
+  `ToolResolutionError` for the stream branch and emit an error part.
+- **Option B:** Add a handler-level test that drives `_agent(stream=True)` with a `resolve_tools`
+  stub that raises `ToolResolutionError`, asserting the documented surfacing shape.
+
+**Resolution:** the protocol requires pre-stream failures to remain JSON. The messages route
+now preserves batch error responses before applying streaming negotiation. SDK routing and
+service handler tests cover both layers.
+
+---
+
+### F-004 [RESOLVED] – description default differs between code tools and gateway tools
+
+| Field | Value |
+|---|---|
+| Severity | low |
+| Confidence | high |
+| Status | resolved |
+| Origin | scan |
+| Lens | verification |
+| Category | Consistency |
+| Criteria | Consistency (3); general G-7 |
+| Location | `sdks/python/agenta/sdk/agents/tool_resolution.py:103`; `services/oss/src/agent/tools.py:133` |
+| Files | tool_resolution.py, tools.py |
+
+**Condition:** `code_tool_spec` sets `"description": tool.description or tool.name`
+(`tool_resolution.py:103`) and `client_tool_spec` does the same (`:117`). The gateway path sets
+`"description": spec.get("description")` (`tools.py:133`), which can be `None`. So a code/client
+tool with no description ships its name; a gateway tool with no backend description ships `None`.
+
+**Cause:** The SDK builders default to the name; the gateway remap copies the backend field
+verbatim.
+
+**Consequence:** Low. The TS runner already defaults `description ?? name` when building Pi/MCP
+tools (`services/agent/src/engines/pi.ts:162`, `mcp-server.ts:67`), so a `None` description is
+absorbed downstream. The only visible effect is a slightly different wire payload for the same
+"no description" condition across kinds, which contradicts the "identical wire specs" framing for
+the description field specifically.
+
+**Evidence:** evidence/description-default-inconsistency.md
+
+**Suggested Fix:**
+
+- **Option A:** In `_resolve_gateway`, default the description to the resolved name:
+  `"description": spec.get("description") or name`, matching the SDK builders.
+- **Option B:** Leave as-is and document that `description` may be `None` on the gateway wire
+  spec, relying on the runner default.
+
+**Resolution:** gateway specs now use `description or name`.
+
+---
+
+### F-005 [RESOLVED] – attach_orthogonal mutates its argument in place and also returns it
+
+| Field | Value |
+|---|---|
+| Severity | low |
+| Confidence | high |
+| Status | resolved |
+| Origin | scan |
+| Lens | verification |
+| Category | Soundness, Consistency |
+| Criteria | Soundness (4); sdk SK-3 |
+| Location | `sdks/python/agenta/sdk/agents/tool_resolution.py:89-96` |
+| Files | tool_resolution.py |
+
+**Condition:** `attach_orthogonal(entry, tool)` mutates `entry` in place (sets
+`entry["needsApproval"]` / `entry["render"]`) and returns the same dict. It is a public helper
+(imported by the service at `tools.py:36`). The mutate-and-return pattern is easy to misuse: a
+caller that reuses the input `entry` after the call sees it mutated, and the return value invites
+the false belief that the input is untouched.
+
+**Cause:** Shaped for the call sites that always do `return attach_orthogonal(entry, tool)` with
+a fresh `entry`, where in-place mutation is harmless.
+
+**Consequence:** Low. All three current call sites pass a freshly-built `entry` and use the
+return value, so there is no live bug. The risk is latent and the dual contract is slightly
+inconsistent with a package whose other builders return fresh dicts.
+
+**Evidence:** evidence/attach-orthogonal-mutation.md
+
+**Suggested Fix:**
+
+- **Option A:** Make it pure — `return {**entry, **extras}`. Removes the latent footgun at no
+  real cost.
+- **Option B:** Keep in-place mutation but drop the return value and name it
+  `_attach_orthogonal_inplace`, signalling the contract.
+
+**Resolution:** canonical spec construction uses immutable Pydantic models and
+`model_copy`; the legacy helper returns a copied dictionary.
+
+---
+
+### F-006 [RESOLVED] – app.py redundantly re-assigns resolved.mcp_servers
+
+| Field | Value |
+|---|---|
+| Severity | info |
+| Confidence | high |
+| Status | resolved |
+| Origin | scan |
+| Lens | verification |
+| Category | Consistency, Complexity |
+| Criteria | Complexity (5); general G-9 |
+| Location | `services/oss/src/agent/app.py:91-92` |
+| Files | app.py, tools.py |
+
+**Condition:** `_agent` does `resolved = await resolve_tools(...)` then immediately
+`resolved.mcp_servers = await resolve_mcp_servers(...)`. `resolve_tools` already returns
+`ResolvedTools` with `mcp_servers=[]` (documented to leave MCP empty), so the handler overwrites
+a field that is empty by construction, relying on an implicit contract.
+
+**Cause:** MCP resolution is intentionally a separate call (different flag gate), so the handler
+stitches the two results together.
+
+**Consequence:** None functionally. Minor readability cost and a coupling that would silently
+double-resolve if `resolve_tools` ever started populating `mcp_servers`.
+
+**Evidence:** evidence/app-mcp-reassign.md
+
+**Suggested Fix:**
+
+- **Option A:** Provide a single `resolve_all(tools, mcp_servers)` in `tools.py` that returns a
+  fully-populated `ResolvedTools`, keeping the two flag gates internal.
+- **Option B:** Leave as-is; add a one-line comment that `resolve_tools` never sets
+  `mcp_servers`, making the contract explicit at the call site.
+
+**Resolution:** `resolve_agent_resources` returns a complete `ResolvedAgentResources` value and
+the handler performs no post-resolution mutation.
+
+## Open Findings
+
+None.
diff --git a/docs/design/agent-workflows/sdk-local-tools/review/metadata.json b/docs/design/agent-workflows/sdk-local-tools/review/metadata.json
new file mode 100644
index 0000000000..e648a06cb3
--- /dev/null
+++ b/docs/design/agent-workflows/sdk-local-tools/review/metadata.json
@@ -0,0 +1,40 @@
+{
+  "review_id": "sdk-local-tools-tool-resolution-2026-06-19",
+  "created_at": "2026-06-19T00:00:00Z",
+  "updated_at": "2026-06-19T00:00:00Z",
+  "scope_ref": "scope.md",
+  "activity": "scan-codebase",
+  "depth": "deep",
+  "path": "docs/design/agent-workflows/sdk-local-tools/review/",
+  "branch": "gitbutler/workspace",
+  "rubrics": ["sdk", "services", "general", "security", "architecture"],
+  "reviewed_files": [
+    "sdks/python/agenta/sdk/agents/tool_resolution.py",
+    "sdks/python/agenta/sdk/agents/errors.py",
+    "sdks/python/agenta/sdk/agents/__init__.py",
+    "services/oss/src/agent/tools.py",
+    "services/oss/src/agent/app.py",
+    "sdks/python/oss/tests/pytest/unit/agents/test_tool_resolution.py",
+    "services/oss/tests/pytest/unit/agent/test_tool_refs.py",
+    "services/oss/tests/pytest/unit/agent/test_invoke_handler.py",
+    "services/oss/tests/pytest/integration/agent/test_resolve_tools_http.py"
+  ],
+  "files_in_scope": 9,
+  "files_reviewed": 9,
+  "coverage_pct": 100,
+  "finding_counts": {
+    "critical": 0,
+    "high": 0,
+    "medium": 3,
+    "low": 2,
+    "info": 1
+  },
+  "open_risks": 1,
+  "open_questions": 2,
+  "tests": {
+    "sdk_agents_unit": "118 passed",
+    "service_agent_unit_integration": "33 passed"
+  },
+  "verdict": "pass with conditions",
+  "complete": true
+}
diff --git a/docs/design/agent-workflows/sdk-local-tools/review/plan.md b/docs/design/agent-workflows/sdk-local-tools/review/plan.md
new file mode 100644
index 0000000000..609a0d4ea0
--- /dev/null
+++ b/docs/design/agent-workflows/sdk-local-tools/review/plan.md
@@ -0,0 +1,45 @@
+# plan.md – Review Plan
+
+## Strategy
+
+Findings-driven scan (`scan-codebase`, `depth=deep`) over the sdk-local-tools tool-resolution
+change set. Rubrics applied: `sdk.md` (the SDK resolver + its public surface), `services.md`
+(the server-side orchestration in `tools.py` / `app.py`), `general.md` as the baseline pass,
+with `security.md` and `architecture.md` at baseline depth (the change touches secrets and a
+service->SDK boundary). All 10 universal criteria applied.
+
+The change set is ~420 lines of meaningful logic across 5 implementation files plus 4 test
+files. It sits just under the 400-line split threshold, so it is reviewed as one pass grouped
+by risk.
+
+## Pass order (highest risk first)
+
+| # | Pass | Files | Rubric mapping | Budget |
+|---|---|---|---|---|
+| 1 | Gateway network path + fail-fast | `services/oss/src/agent/tools.py` (`_resolve_gateway`, `resolve_tools`) | services SV-2/6/10, security S-14/15, general G-1/2/3 | done |
+| 2 | Local resolver + secret scoping | `tool_resolution.py` (`LocalToolResolver`, `SecretResolver`, spec builders) | sdk SK-3/22/26/27, security S-14, general G-2/3/5/10 | done |
+| 3 | Typed error contract | `errors.py` (`ToolResolutionError`) | sdk SK-7/17, general G-7 | done |
+| 4 | Public surface + exports | `__init__.py` | sdk SK-7/15 | done |
+| 5 | Handler integration | `app.py` (`_agent`, `_agent_stream`) | services SV-2, general G-3/5 | done |
+| 6 | Test coverage adequacy | the 4 test files | general G-13..G-17, sdk SK-22 | done |
+
+## Prerequisite automated checks
+
+- Run service agent unit + integration suites: `cd services && uv run python -m pytest
+  oss/tests/pytest/unit/agent oss/tests/pytest/integration/agent -n0 -q` — ran, 33 passed.
+- Run SDK agents unit suite: `cd sdks/python && uv run python -m pytest
+  oss/tests/pytest/unit/agents -n0 -q` — ran, 118 passed.
+- Secret-logging scan over the changed + adjacent files — ran, no secret values logged.
+
+## Invariants to verify (from the design)
+
+1. Dependency direction is `service -> SDK` only; the SDK never imports the service.
+2. builtin/code/client resolve locally; only gateway hits the network.
+3. The service produces identical wire specs (kind/runtime/code/env/callRef/needsApproval/render).
+4. MCP is a no-op unless `AGENTA_AGENT_ENABLE_MCP` is set.
+5. Secret env is scoped per tool; unresolved secrets are omitted, not errored.
+6. Gateway fail-fast paths raise a typed `ToolResolutionError`.
+
+## Deferrals
+
+None. The change set fit one pass within budget.
diff --git a/docs/design/agent-workflows/sdk-local-tools/review/progress.md b/docs/design/agent-workflows/sdk-local-tools/review/progress.md
new file mode 100644
index 0000000000..8319d405e0
--- /dev/null
+++ b/docs/design/agent-workflows/sdk-local-tools/review/progress.md
@@ -0,0 +1,37 @@
+# progress.md – Review Progress
+
+| Pass / file group | Status | Notes |
+|---|---|---|
+| Read harness (instructions, criteria, guidelines, deliverables, rubrics, templates) | done | sdk/services/general/security/architecture rubrics loaded |
+| Read scan-codebase skill + findings schema + lifecycle | done | depth=deep, path given |
+| `services/oss/src/agent/tools.py` | done | gateway path, fail-fast, local delegation, MCP gating |
+| `sdks/python/agenta/sdk/agents/tool_resolution.py` | done | resolver, secret scoping, spec builders |
+| `sdks/python/agenta/sdk/agents/errors.py` | done | `ToolResolutionError` typed context |
+| `sdks/python/agenta/sdk/agents/__init__.py` | done | exports complete and consistent |
+| `services/oss/src/agent/app.py` | done | handler integration, fail-fast propagation |
+| `test_tool_resolution.py` (SDK) | done | offline split, secret omit, MCP gate |
+| `test_tool_refs.py` (service) | done | service split, MCP wire, gateway ref shape |
+| `test_invoke_handler.py` (service) | done | cross-harness identity through handler |
+| `test_resolve_tools_http.py` (integration) | done | gateway HTTP mock, count/incomplete fail-fast |
+| Context: dtos.py, tool_defs.py, secrets.py, client.py, protocol.ts | done | wire shape parity confirmed |
+| Run service unit + integration suites | done | 33 passed |
+| Run SDK agents unit suite | done | 118 passed |
+| Secret-logging scan | done | names only, no values |
+| Invariant verification (6) | done | all hold; one positional-ordering coupling noted |
+| Promote findings | done | 6 findings, 1 risk |
+| Synthesis (summary, scorecard) | done | verdict: pass with conditions |
+
+Review complete. No items deferred. No items out of scope skipped within the named set.
+
+## Remediation
+
+| Item | Status |
+|---|---|
+| F-001 through F-006 | resolved |
+| R-001 | mitigated |
+| Q-001 and Q-002 | resolved |
+| SDK agent/routing tests | 146 passed |
+| Service agent tests | 34 passed |
+| API unit tests | 859 passed |
+| TypeScript tool tests | 3 passed |
+| Extension bundle | built |
diff --git a/docs/design/agent-workflows/sdk-local-tools/review/questions.md b/docs/design/agent-workflows/sdk-local-tools/review/questions.md
new file mode 100644
index 0000000000..3e6d34609f
--- /dev/null
+++ b/docs/design/agent-workflows/sdk-local-tools/review/questions.md
@@ -0,0 +1,6 @@
+# questions.md – Resolved Questions
+
+| ID | Question | Context / Evidence | Provisional suggestions | Status |
+|---|---|---|---|---|
+| Q-001 | How is a `ToolResolutionError` raised by the `_agent` coroutine surfaced on the `/messages` SSE path versus the JSON `/invoke` path? | Failures before stream creation are protocol-level JSON errors. Failures after stream creation use SSE error parts. The routing layer now preserves the former. | Covered by routing and handler tests. | resolved |
+| Q-002 | Does `/tools/resolve` guarantee response ordering matches request order, and can it echo a ref identity for key-based matching? | Ordering is no longer trusted. The existing resolved `call_ref` is normalized and used as the correlation key. | Covered by a reversed-response integration test. | resolved |
diff --git a/docs/design/agent-workflows/sdk-local-tools/review/risks.md b/docs/design/agent-workflows/sdk-local-tools/review/risks.md
new file mode 100644
index 0000000000..7be024966b
--- /dev/null
+++ b/docs/design/agent-workflows/sdk-local-tools/review/risks.md
@@ -0,0 +1,9 @@
+# risks.md – Systemic Risks
+
+| ID | Category | Description | Likelihood | Impact | Evidence | Mitigation | Status |
+|---|---|---|---|---|---|---|---|
+| R-001 | Architecture | Gateway resolution formerly carried approval/render metadata by response position. | Low | Medium | `services/oss/src/agent/tools/gateway.py` | Resolved specs are now matched to configs by normalized `call_ref`; a reversed-response integration test pins the behavior. | mitigated |
+
+## Notes
+
+- No open architecture risk remains from this review.
diff --git a/docs/design/agent-workflows/sdk-local-tools/review/scope.md b/docs/design/agent-workflows/sdk-local-tools/review/scope.md
new file mode 100644
index 0000000000..da85db0056
--- /dev/null
+++ b/docs/design/agent-workflows/sdk-local-tools/review/scope.md
@@ -0,0 +1,82 @@
+# scope.md – Review Scope
+
+## Objectives
+
+- Verify the sdk-local-tools tool-resolution change set preserves its design invariants:
+  dependency direction `service -> SDK`, local resolution of builtin/code/client, network only
+  for gateway, identical wire specs across paths, MCP gated off by default, per-tool scoped
+  secret env with omit-on-missing, and typed `ToolResolutionError` on gateway fail-fast paths.
+- Surface correctness, soundness, completeness, consistency, security, and testing findings in
+  the changed files, judged against the surrounding code they integrate with.
+- Confirm the two gaps flagged by the prior conventions review are resolved (Pydantic
+  `ResolvedTools`, typed error) and that the tests lock the behaviour they claim to.
+
+## Codebase / branch
+
+| Field | Value |
+|---|---|
+| Repository | agenta (monorepo) |
+| Branch / commit | `gitbutler/workspace` (working tree; scope files are new vs `main`) |
+| Review type | combination: sdk + services + security (baseline) + architecture (baseline) |
+
+## Inclusions
+
+Implementation:
+
+- `sdks/python/agenta/sdk/agents/tool_resolution.py`
+- `sdks/python/agenta/sdk/agents/errors.py`
+- `sdks/python/agenta/sdk/agents/__init__.py`
+- `services/oss/src/agent/tools.py`
+- `services/oss/src/agent/app.py`
+
+Tests:
+
+- `sdks/python/oss/tests/pytest/unit/agents/test_tool_resolution.py`
+- `services/oss/tests/pytest/unit/agent/test_tool_refs.py`
+- `services/oss/tests/pytest/unit/agent/test_invoke_handler.py`
+- `services/oss/tests/pytest/integration/agent/test_resolve_tools_http.py`
+
+## Exclusions
+
+Read for integration judgement only, not reviewed as scope:
+
+- `sdks/python/agenta/sdk/agents/dtos.py` (`SessionConfig`, `ToolCallback`), `tool_defs.py`,
+  `adapters/harnesses.py`
+- `services/oss/src/agent/secrets.py`, `services/oss/src/agent/client.py`
+- `services/agent/src/protocol.ts` and the TS runner tool builders (wire-shape consumers)
+- All other unrelated workspace changes on the branch.
+
+## Constraints and assumptions
+
+- Scope files are new on this branch; there is no `main` baseline to diff against, so
+  "identical wire specs" is judged as cross-path identity (local vs server-side), which the
+  tests assert, plus parity against the TS `ResolvedToolSpec` contract.
+- Review is verification-oriented (scan-codebase, `depth=deep`). Runtime validation is limited
+  to the two named pytest targets; no live LLM, runner, or backend was exercised.
+
+## Stakeholders
+
+| Role | Name / Team |
+|---|---|
+| Requestor | mahmoud@agenta.ai |
+| Author(s) | agent-workflows contributors |
+| Reviewer(s) | `agent` (scan-codebase, code-review harness) |
+| Decision owner | mahmoud@agenta.ai |
+
+## Timeline
+
+| Milestone | Target date |
+|---|---|
+| Review start | 2026-06-19 |
+| Final deliverables | 2026-06-19 |
+
+## Success criteria
+
+- Every in-scope file read in full and judged against the 10 universal criteria.
+- All design invariants explicitly confirmed or flagged.
+- Findings carry severity, confidence, real `file:line`, evidence, and ≥1 remediation.
+- Both named test suites run and their pass/fail recorded.
+
+## Approval
+
+Requestor reviews `summary.md` + `scorecard.md` and closes the review.
diff --git a/docs/design/agent-workflows/sdk-local-tools/review/scorecard.md b/docs/design/agent-workflows/sdk-local-tools/review/scorecard.md
new file mode 100644
index 0000000000..5afe196e7b
--- /dev/null
+++ b/docs/design/agent-workflows/sdk-local-tools/review/scorecard.md
@@ -0,0 +1,30 @@
+# scorecard.md – Review Scorecard
+
+## Metrics
+
+| Metric | Value | Interpretation |
+|---|---|---|
+| Critical findings | 0 | No release-blocking defects. |
+| High findings | 0 | No significant functional/security defects. |
+| Medium findings | 3 | Two gateway-path gaps (test, observability) + one unspecified stream-failure surface. |
+| Low findings | 2 | Cross-kind description default; mutate-and-return helper. |
+| Info findings | 1 | Redundant `mcp_servers` re-assign. |
+| Open risks | 1 | R-001 positional gateway carry-back (architecture/soundness). |
+| Open questions | 2 | Q-001 stream error surface; Q-002 resolve ordering contract. |
+| Files reviewed | 9 | 5 implementation + 4 tests. |
+| Files in scope | 9 | |
+| Review coverage | 100% | Every in-scope file read in full. |
+| Tests run | 151 passed | SDK agents unit 118 + service unit/integration 33. |
+| **Overall verdict** | **PASS WITH CONDITIONS** | Invariants hold; address the three mediums + R-001 before rollout. |
+
+## Verdict criteria
+
+| Verdict | Condition |
+|---|---|
+| Pass | No critical findings; <= 2 high findings with remediation plans in progress |
+| Pass with conditions | No critical findings; > 2 high findings OR open questions blocking release |
+| Fail | >= 1 critical finding OR scope not fully reviewed |
+
+Applied: zero critical and zero high, but two open questions (Q-001, Q-002) gate the disposition
+of F-003 and R-001, and three medium findings warrant fixes before rollout. Verdict: pass with
+conditions.
diff --git a/docs/design/agent-workflows/sdk-local-tools/review/summary.md b/docs/design/agent-workflows/sdk-local-tools/review/summary.md
new file mode 100644
index 0000000000..8f3eb763bd
--- /dev/null
+++ b/docs/design/agent-workflows/sdk-local-tools/review/summary.md
@@ -0,0 +1,115 @@
+# summary.md – Review Summary
+
+## Overview
+
+**Reviewed:** the sdk-local-tools tool-resolution change set (see `scope.md`): the new SDK
+`LocalToolResolver` / `SecretResolver` / `ResolvedTools` / spec builders
+(`tool_resolution.py`), the typed `ToolResolutionError` (`errors.py`), the SDK exports
+(`__init__.py`), the refactored server-side orchestration (`services/oss/src/agent/tools.py`),
+the handler that consumes it (`app.py`), and the four accompanying test files.
+**Goals:** verify the design invariants hold; surface correctness, soundness, completeness,
+consistency, security, and testing findings; confirm the prior conventions review's two gaps are
+closed.
+**Constraints:** verification-oriented scan (`depth=deep`); runtime validation limited to the
+two named pytest targets; no live LLM/runner/backend.
+**Review type(s):** sdk + services + general baseline, with security and architecture at
+baseline depth.
+**Date range:** 2026-06-19 – 2026-06-19.
+
+---
+
+## Health verdict
+
+> **PASS — CONDITIONS RESOLVED**
+
+All review findings and the architecture risk were addressed in the subsequent organization
+refactor. The canonical implementation now lives in `agenta.sdk.agents.tools`, MCP is a sibling
+subsystem, the service uses explicit gateway/vault adapters, and the runner file names describe
+dispatch and callback responsibilities.
+
+Validation after the fixes:
+
+- SDK agent/routing tests: 146 passed.
+- Service agent unit/integration tests: 34 passed.
+- API unit tests: 859 passed.
+- TypeScript tool dispatch/bridge/MCP tests: 3 passed.
+- TypeScript extension bundle: built successfully.
+
+---
+
+## Key findings
+
+### Critical and high severity
+
+- None. There are no critical findings and no high findings.
+
+### Medium severity
+
+- **F-001 resolved:** gateway approval/render metadata is tested, including a reordered
+  multi-tool response.
+- **F-002 resolved:** HTTP failures are logged and normalized to
+  `GatewayToolResolutionError`.
+- **F-003 resolved:** failures before stream creation remain JSON error envelopes with their
+  original HTTP status; routing and handler tests pin this contract.
+
+### Low / info severity
+
+- **F-004 resolved:** gateway descriptions default to the resolved name.
+- **F-005 resolved:** canonical spec construction is immutable; the unshipped compatibility
+  helper and former flat modules were removed.
+- **F-006 resolved:** `resolve_agent_resources` returns tools and MCP resources in one result.
+
+### Positive observations
+
+- All six invariants verified: service->SDK dependency direction (the SDK never imports the
+  service), local resolution for builtin/code/client with only gateway on the network, wire-spec
+  parity with the TS `ResolvedToolSpec` contract (kind/runtime/code/env/callRef/needsApproval/
+  render), MCP gated off by `AGENTA_AGENT_ENABLE_MCP`, per-tool scoped secret env with
+  omit-on-missing, and typed `ToolResolutionError` on the gateway count/incomplete fail-fast
+  paths.
+- The two prior conventions-review gaps are closed: `ResolvedToolSet` is a Pydantic model, and
+  the service raises a typed `ToolResolutionError(RuntimeError)` (so existing `except
+  RuntimeError` callers keep working) carrying structured `status` / `ref_count` / `spec_count`.
+- Secrets are scoped per tool and never logged (only secret *names* appear in warnings); the
+  union-resolve-then-filter approach correctly prevents cross-tool env leakage.
+- Test design is honest: the cross-harness handler test asserts both identical bodies *and*
+  divergent per-harness configs at the backend boundary, avoiding the tautology the comment calls
+  out; the integration suite covers the gateway count-mismatch and incomplete-spec fail-fast.
+- Both suites pass: SDK agents unit 118 passed; service agent unit + integration 33 passed.
+
+---
+
+## Key risks
+
+- **R-001 mitigated:** gateway responses are joined to configs by normalized `call_ref`, not
+  response position.
+
+---
+
+## Open questions
+
+- **Q-001 answered:** pre-stream failures use JSON errors; only failures after streaming begins
+  become SSE error parts.
+- **Q-002 answered:** ordering is no longer trusted; the existing `call_ref` is the correlation
+  identity.
+
+---
+
+## Coverage and metrics
+
+| Metric | Value |
+|---|---|
+| Files in scope | 9 (5 implementation + 4 tests) |
+| Files reviewed | 9 |
+| Coverage | 100% |
+| Critical findings | 0 |
+| High findings | 0 |
+| Medium findings | 3 |
+| Low findings | 2 |
+| Info findings | 1 |
+
+---
+
+## Recommended next steps
+
+No review remediation remains.
diff --git a/docs/design/agent-workflows/sdk-local-tools/status.md b/docs/design/agent-workflows/sdk-local-tools/status.md
new file mode 100644
index 0000000000..5628574334
--- /dev/null
+++ b/docs/design/agent-workflows/sdk-local-tools/status.md
@@ -0,0 +1,104 @@
+# Status
+
+The source of truth for this effort. Update this page as the work moves.
+
+## Where this stands (2026-06-19)
+
+**Stage: organization refactor implemented and review findings resolved.** The SDK tool
+runtime now uses `agenta.sdk.agents.tools`; MCP is a sibling `agenta.sdk.agents.mcp`
+subsystem; the service owns HTTP/vault adapters under `services/oss/src/agent/tools/`.
+Persisted discriminator values remain unchanged. The former flat `tool_defs.py` and
+`tool_resolution.py` modules were removed because this API had not shipped; legacy persisted
+configuration shapes are handled only at the explicit compatibility boundary in
+`agents/tools/compat.py`.
+
+The prerequisite is also not done: `LocalBackend` is a stub that raises
+(`sdks/python/agenta/sdk/agents/adapters/local.py:30`). The sibling effort owns it
+([`../scratch/sdk-local-backend/status.md`](../scratch/sdk-local-backend/status.md)).
+
+## Organization review (2026-06-19)
+
+- Canonical declarations use `*Config`; runnable values use `*Spec`.
+- Strict parsing and legacy compatibility are separate.
+- Missing secrets, unsupported providers, duplicate names, and invalid gateway responses use
+  typed errors.
+- API core reuses SDK-owned neutral tool config classes.
+- Gateway metadata correlation uses `call_ref`, not response position.
+- Pre-stream `/messages` failures preserve JSON error responses.
+- TypeScript files are named `callback.ts` and `dispatch.ts`.
+
+## Behavior decisions retained (2026-06-19)
+
+The earlier review settled these product and dependency decisions. They remain useful, but
+class names such as `LocalToolResolver` and `SecretResolver` below are historical placeholders;
+the organization proposal recommends better public names.
+
+1. **Where resolution lives.** Both, behind one interface: an offline `LocalToolResolver`
+   (built-in, code, client, MCP later) and an opt-in `AgentaToolResolver` for gateway.
+   Gateway is the only kind that stays in the backend. Dependency direction `service -> SDK`.
+2. **Code and MCP executors.** Code reuses the bundled in-process Pi engine. MCP is out of
+   scope for the first releases: wired but a no-op behind a feature flag that defaults to off,
+   for both Pi and Claude.
+3. **Secrets.** Env by default, behind a pluggable `SecretResolver`. The vault is a later,
+   connected path that reads `custom_secret` rows by name once the consumer is built
+   ([open-issues.md](../open-issues.md)). Custom secrets are storage-only today by design.
+4. **First slice.** `LocalBackend` (Pi) + built-in + code + env secrets, offline.
+
+## Validation
+
+- SDK agent/routing tests: 146 passed.
+- Service agent tests: 34 passed.
+- API unit tests: 859 passed.
+- TypeScript tool tests: 3 passed.
+- TypeScript extension bundle: built successfully.
+
+## Remaining prerequisite
+
+`LocalBackend` is still a separate prerequisite. End-to-end standalone execution remains
+blocked on that sibling effort, not on tool organization or resolution.
+
+## Key files (the map for whoever continues)
+
+**SDK (where the new work lands):**
+- `sdks/python/agenta/sdk/agents/tools/`: tool configuration, runtime specs, parsing,
+  compatibility conversion, resolution, wire serialization, and errors.
+- `sdks/python/agenta/sdk/agents/mcp/`: the sibling MCP configuration and resolution domain.
+- `sdks/python/agenta/sdk/agents/dtos.py`: `SessionConfig` and its tool fields (the delivery
+  contract, `:498`).
+- `sdks/python/agenta/sdk/agents/adapters/harnesses.py`: the harness adapters that shape
+  already-resolved specs (`:65`).
+- `sdks/python/agenta/sdk/agents/adapters/local.py`: `LocalBackend`, the stub
+  (`:24`, prerequisite).
+
+**Service (the reference resolver to share from, never to depend back on):**
+- `services/oss/src/agent/tools/resolver.py`: service composition of SDK tool and MCP
+  resolvers.
+- `services/oss/src/agent/tools/gateway.py`: Agenta gateway HTTP adapter.
+- `services/oss/src/agent/tools/secrets.py`: named-secret HTTP adapter for tools and MCP.
+- `services/oss/src/agent/secrets.py`: provider-key resolution for harness authentication.
+- `services/oss/src/agent/app.py`: where the service builds `SessionConfig` (`:91`).
+
+**Runner (execution; already handles code/callback in-process):**
+- `services/agent/src/engines/pi.ts`: the in-process Pi engine, `mcpTools: false` (`:58`),
+  branches on tool `kind` (`:144`).
+- `services/agent/src/tools/code.ts`, `dispatch.ts`, `callback.ts`, `mcp-server.ts`: the
+  executors and callback adapter.
+
+**API (the future named-secret consumer):**
+- `api/oss/src/apis/fastapi/vault/router.py`: CRUD only, no name-based resolve route yet
+  (lines 36 to 74).
+- `docs/design/vault-named-secrets/context.md`: the named-secrets effort,
+  runtime-consumption out of scope this iteration (`:23`).
+
+## Known traps
+
+- **The vault named-secret consumer is not built.** Declared code/MCP custom secrets resolve
+  to `{}` today, server path included. This is the storage-only design this iteration, not a
+  bug (research.md, stage 3). The offline slice supplies secrets from env via `SecretResolver`
+  instead of the vault.
+- **The in-process Pi engine has no MCP.** Do not assume bundling the Pi runner brings MCP;
+  it does not (`services/agent/src/engines/pi.ts:58`).
+- **Client tools need a browser.** A headless local run cannot fulfil them; the in-process Pi
+  engine skips them (`services/agent/src/engines/pi.ts:165`).
+- **Dependency direction.** The SDK must never call the service. Share spec-building logic by
+  moving it into the SDK, not by importing the service from the SDK.
diff --git a/sdks/python/agenta/sdk/agents/__init__.py b/sdks/python/agenta/sdk/agents/__init__.py
index 9df80c3fd9..19d0ac77b0 100644
--- a/sdks/python/agenta/sdk/agents/__init__.py
+++ b/sdks/python/agenta/sdk/agents/__init__.py
@@ -88,7 +88,11 @@
     parse_tool_config,
     parse_tool_configs,
 )
-from .ui_messages import from_ui_messages, to_ui_message, ui_message_stream
+from .adapters.vercel import (
+    from_ui_messages,
+    to_ui_message,
+    ui_message_stream,
+)
 
 __all__ = [
     # DTOs
@@ -107,7 +111,7 @@
     "AgentEvent",
     "AgentResult",
     "AgentRun",
-    # UI message codec (the /messages egress adapter)
+    # Former flat Vercel adapter names (compatibility; new code uses adapters.vercel)
     "from_ui_messages",
     "to_ui_message",
     "ui_message_stream",
diff --git a/sdks/python/agenta/sdk/agents/adapters/__init__.py b/sdks/python/agenta/sdk/agents/adapters/__init__.py
index e8002be7af..30e555d82b 100644
--- a/sdks/python/agenta/sdk/agents/adapters/__init__.py
+++ b/sdks/python/agenta/sdk/agents/adapters/__init__.py
@@ -3,6 +3,7 @@
 - Backend adapters: ``RivetBackend`` (rivet over ACP), ``InProcessPiBackend`` (in-process Pi,
   the reference backend), ``LocalBackend`` (standalone SDK runs; not yet implemented).
 - Harness adapters: ``PiHarness``, ``ClaudeHarness``, ``AgentaHarness`` (+ ``make_harness``).
+- HTTP/browser protocol adapters live in subpackages, e.g. ``adapters.vercel``.
 
 Shared plumbing for the runner-backed adapters lives in ``agents/utils``.
 """
diff --git a/sdks/python/agenta/sdk/agents/adapters/vercel/__init__.py b/sdks/python/agenta/sdk/agents/adapters/vercel/__init__.py
new file mode 100644
index 0000000000..8778b98fc4
--- /dev/null
+++ b/sdks/python/agenta/sdk/agents/adapters/vercel/__init__.py
@@ -0,0 +1,35 @@
+"""Vercel AI SDK adapters for the agent runtime.
+
+The neutral agent runtime speaks ``Message``, ``AgentEvent``, and ``AgentRun``. This package
+is the browser protocol adapter: Vercel ``UIMessage`` request bodies, UI Message Stream parts,
+SSE framing, and the ``/messages`` route helpers.
+"""
+
+from .messages import (
+    from_ui_messages,
+    message_to_vercel_ui_message,
+    to_ui_message,
+    vercel_ui_messages_to_messages,
+)
+from .routing import (
+    inject_stream_session_id,
+    register_agent_message_routes,
+    resolve_session_id,
+)
+from .sse import VERCEL_UI_MESSAGE_STREAM_HEADERS, vercel_sse_stream
+from .stream import agent_run_to_vercel_parts, ui_message_stream
+
+__all__ = [
+    "vercel_ui_messages_to_messages",
+    "message_to_vercel_ui_message",
+    "agent_run_to_vercel_parts",
+    "VERCEL_UI_MESSAGE_STREAM_HEADERS",
+    "vercel_sse_stream",
+    "resolve_session_id",
+    "inject_stream_session_id",
+    "register_agent_message_routes",
+    # Former flat-module names.
+    "from_ui_messages",
+    "to_ui_message",
+    "ui_message_stream",
+]
diff --git a/sdks/python/agenta/sdk/agents/adapters/vercel/messages.py b/sdks/python/agenta/sdk/agents/adapters/vercel/messages.py
new file mode 100644
index 0000000000..7f718b9032
--- /dev/null
+++ b/sdks/python/agenta/sdk/agents/adapters/vercel/messages.py
@@ -0,0 +1,219 @@
+"""Vercel ``UIMessage`` conversion at the agent HTTP edge.
+
+This adapter translates between the Vercel AI SDK ``UIMessage`` parts shape and the
+neutral agent runtime ``Message`` / ``ContentBlock`` types. The neutral DTOs stay the port;
+Vercel-specific part names live here.
+"""
+
+from __future__ import annotations
+
+from typing import Any, Dict, List, Optional
+
+from ...dtos import AgentResult, ContentBlock, Message
+
+TOOL_APPROVAL_REQUEST = "tool-approval-request"
+TOOL_APPROVAL_RESPONSE = "tool-approval-response"
+TOOL_OUTPUT_AVAILABLE = "tool-output-available"
+
+
+def vercel_ui_messages_to_messages(raw: Optional[List[Any]]) -> List[Message]:
+    """Coerce inbound Vercel ``UIMessage`` objects into neutral messages."""
+    messages: List[Message] = []
+    for item in raw or []:
+        message = _ui_message_to_message(item)
+        if message is not None:
+            messages.append(message)
+    return messages
+
+
+def _ui_message_to_message(raw: Any) -> Optional[Message]:
+    if isinstance(raw, Message):
+        return raw
+    if not isinstance(raw, dict) or "role" not in raw:
+        return None
+    role = str(raw["role"])
+
+    parts = raw.get("parts")
+    if parts is None:
+        return Message.from_raw(raw)
+
+    blocks: List[ContentBlock] = []
+    for part in parts or []:
+        blocks.extend(_part_to_blocks(part))
+
+    if not blocks:
+        return Message(role=role, content="")
+    if all(block.type == "text" for block in blocks):
+        return Message(role=role, content="".join(block.text or "" for block in blocks))
+    return Message(role=role, content=blocks)
+
+
+def _part_to_blocks(part: Any) -> List[ContentBlock]:
+    if not isinstance(part, dict):
+        return []
+    ptype = str(part.get("type", ""))
+
+    if ptype == "text":
+        text = part.get("text")
+        return [ContentBlock(type="text", text=text)] if text is not None else []
+
+    if ptype == "file":
+        media = part.get("mediaType") or part.get("mimeType")
+        kind = (
+            "image"
+            if isinstance(media, str) and media.startswith("image/")
+            else "resource"
+        )
+        return [
+            ContentBlock(
+                type=kind,
+                uri=part.get("url") or part.get("uri"),
+                data=part.get("data"),
+                mime_type=media,
+            )
+        ]
+
+    if ptype == TOOL_APPROVAL_REQUEST:
+        return []
+
+    if ptype == TOOL_APPROVAL_RESPONSE:
+        return _approval_response_blocks(part)
+
+    if (
+        ptype == TOOL_OUTPUT_AVAILABLE
+        or ptype == "dynamic-tool"
+        or ptype.startswith("tool-")
+    ):
+        return _tool_part_blocks(part, ptype)
+
+    return []
+
+
+def _tool_part_blocks(part: Dict[str, Any], ptype: str) -> List[ContentBlock]:
+    """A Vercel tool part -> neutral tool-call/result content blocks."""
+    tool_call_id = part.get("toolCallId") or part.get("tool_call_id")
+    tool_name = part.get("toolName") or part.get("tool_name")
+    if (
+        tool_name is None
+        and ptype.startswith("tool-")
+        and ptype != TOOL_OUTPUT_AVAILABLE
+    ):
+        tool_name = ptype[len("tool-") :]
+
+    blocks: List[ContentBlock] = []
+    if ptype != TOOL_OUTPUT_AVAILABLE or "input" in part:
+        blocks.append(
+            ContentBlock(
+                type="tool_call",
+                tool_call_id=tool_call_id,
+                tool_name=tool_name,
+                input=part.get("input"),
+            )
+        )
+
+    state = part.get("state")
+    error_text = part.get("errorText")
+    if error_text is not None or state == "output-error":
+        blocks.append(
+            ContentBlock(
+                type="tool_result",
+                tool_call_id=tool_call_id,
+                tool_name=tool_name,
+                output=error_text if error_text is not None else part.get("output"),
+                is_error=True,
+            )
+        )
+    elif "output" in part or state == "output-available":
+        blocks.append(
+            ContentBlock(
+                type="tool_result",
+                tool_call_id=tool_call_id,
+                tool_name=tool_name,
+                output=part.get("output"),
+                is_error=False,
+            )
+        )
+    return blocks
+
+
+def _approval_response_blocks(part: Dict[str, Any]) -> List[ContentBlock]:
+    """A cross-turn approval reply -> a tool-result block keyed by toolCallId."""
+    tool_call_id = (
+        part.get("toolCallId") or part.get("tool_call_id") or part.get("approvalId")
+    )
+    output = part.get("output")
+    if output is None:
+        approved = part.get("approved")
+        output = {"approved": approved} if approved is not None else part.get("reason")
+    return [ContentBlock(type="tool_result", tool_call_id=tool_call_id, output=output)]
+
+
+def message_to_vercel_ui_message(
+    source: Any,
+    *,
+    message_id: str = "msg-1",
+) -> Dict[str, Any]:
+    """Render an ``AgentResult`` or neutral ``Message`` as one Vercel ``UIMessage``."""
+    if isinstance(source, AgentResult):
+        return {
+            "id": message_id,
+            "role": "assistant",
+            "parts": [{"type": "text", "text": source.output or ""}],
+        }
+    if isinstance(source, Message):
+        return {
+            "id": message_id,
+            "role": source.role,
+            "parts": _content_to_parts(source.content),
+        }
+    raise TypeError(
+        "message_to_vercel_ui_message expects an AgentResult or Message, "
+        f"got {type(source).__name__!r}"
+    )
+
+
+def _content_to_parts(content: Any) -> List[Dict[str, Any]]:
+    if isinstance(content, str):
+        return [{"type": "text", "text": content}] if content else []
+    parts: List[Dict[str, Any]] = []
+    for block in content or []:
+        parts.extend(_block_to_parts(block))
+    return parts
+
+
+def _block_to_parts(block: ContentBlock) -> List[Dict[str, Any]]:
+    if block.type == "text":
+        return [{"type": "text", "text": block.text or ""}]
+    if block.type in ("image", "resource"):
+        part: Dict[str, Any] = {"type": "file"}
+        if block.uri is not None:
+            part["url"] = block.uri
+        if block.mime_type is not None:
+            part["mediaType"] = block.mime_type
+        if block.data is not None:
+            part["data"] = block.data
+        return [part]
+    if block.type == "tool_call":
+        return [
+            {
+                "type": f"tool-{block.tool_name or 'tool'}",
+                "toolCallId": block.tool_call_id,
+                "state": "input-available",
+                "input": block.input,
+            }
+        ]
+    if block.type == "tool_result":
+        return [
+            {
+                "type": f"tool-{block.tool_name or 'tool'}",
+                "toolCallId": block.tool_call_id,
+                "state": "output-error" if block.is_error else "output-available",
+                "output": block.output,
+            }
+        ]
+    return []
+
+
+# Back-compat aliases for the former flat module API.
+from_ui_messages = vercel_ui_messages_to_messages
+to_ui_message = message_to_vercel_ui_message
diff --git a/sdks/python/agenta/sdk/agents/adapters/vercel/routing.py b/sdks/python/agenta/sdk/agents/adapters/vercel/routing.py
new file mode 100644
index 0000000000..99e16cef58
--- /dev/null
+++ b/sdks/python/agenta/sdk/agents/adapters/vercel/routing.py
@@ -0,0 +1,175 @@
+"""FastAPI route wiring for the agent ``/messages`` Vercel adapter."""
+
+from __future__ import annotations
+
+import re
+from typing import Any, Callable, Collection, Optional
+from uuid import uuid4
+
+from fastapi import Request
+from fastapi.responses import JSONResponse, Response
+
+from agenta.sdk.contexts.tracing import tracing_context_manager
+from agenta.sdk.models.workflows import (
+    LoadSessionRequest,
+    WorkflowBatchResponse,
+    WorkflowInvokeRequest,
+    WorkflowRequestData,
+    WorkflowStreamingResponse,
+)
+
+from .messages import vercel_ui_messages_to_messages
+
+# An opaque, project-scoped session id (RFC §4.1): bounded length, restricted charset.
+_SESSION_ID_RE = re.compile(r"^[A-Za-z0-9._:-]{1,128}$")
+
+
+def resolve_session_id(session_id: Optional[str]) -> Optional[str]:
+    """Mint a new id when absent, echo a valid one, or return ``None`` when invalid."""
+    if session_id is None:
+        return "sess_" + uuid4().hex
+    return session_id if _SESSION_ID_RE.match(session_id) else None
+
+
+def inject_stream_session_id(
+    response: WorkflowStreamingResponse,
+    session_id: str,
+) -> None:
+    """Stamp ``messageMetadata.sessionId`` onto the first Vercel ``start`` part."""
+    original = response.generator
+
+    async def generator():
+        stamped = False
+        async for part in original():
+            if not stamped and isinstance(part, dict) and part.get("type") == "start":
+                part.setdefault("messageMetadata", {})["sessionId"] = session_id
+                stamped = True
+            yield part
+
+    response.generator = generator
+
+
+def make_messages_endpoint(
+    *,
+    wf: Any,
+    get_request_tracing_context: Callable[[Request], Any],
+    parse_accept: Callable[[Request], Optional[str]],
+    stream_media_types: Collection[str],
+    make_json_response: Callable[[WorkflowBatchResponse], Response],
+    make_not_acceptable_response: Callable[[str, Any], Response],
+    make_stream_response: Callable[[WorkflowStreamingResponse, str], Response],
+    handle_failure: Callable[[Exception], Any],
+):
+    """Build the ``POST /messages`` endpoint for one routed agent workflow."""
+
+    async def messages_endpoint(req: Request, request: WorkflowInvokeRequest):
+        credentials = req.state.auth.get("credentials")
+
+        session_id = resolve_session_id(request.session_id)
+        if session_id is None:
+            return JSONResponse(
+                status_code=400,
+                content={"detail": "session_id violates the allowed charset/length"},
+            )
+
+        try:
+            request.session_id = session_id
+            if request.data is None:
+                request.data = WorkflowRequestData()
+
+            request.data.messages = [
+                message.to_wire()
+                for message in vercel_ui_messages_to_messages(request.data.messages)
+            ]
+
+            requested = parse_accept(req)
+            want_stream = requested in stream_media_types
+            request.data.stream = want_stream
+
+            with tracing_context_manager(get_request_tracing_context(req)):
+                response = await wf.invoke(
+                    request=request,
+                    secrets=None,
+                    credentials=credentials,
+                )
+
+            if isinstance(response, (WorkflowBatchResponse, WorkflowStreamingResponse)):
+                response.session_id = session_id
+
+            if (
+                isinstance(response, WorkflowBatchResponse)
+                and response.status
+                and response.status.code is not None
+                and response.status.code >= 400
+            ):
+                return make_json_response(response)
+
+            if want_stream:
+                if not isinstance(response, WorkflowStreamingResponse):
+                    return make_not_acceptable_response(str(requested), response)
+                inject_stream_session_id(response, session_id)
+                return make_stream_response(response, "vercel")
+
+            if not isinstance(response, WorkflowBatchResponse):
+                return make_not_acceptable_response(
+                    requested or "application/json", response
+                )
+            return make_json_response(response)
+
+        except Exception as exception:
+            return await handle_failure(exception)
+
+    return messages_endpoint
+
+
+def make_load_session_endpoint():
+    """Build the v1 ``POST /load-session`` stub endpoint."""
+
+    async def load_session_endpoint(req: Request, request: LoadSessionRequest):
+        return JSONResponse(
+            content={"session_id": request.session_id, "messages": []},
+        )
+
+    return load_session_endpoint
+
+
+def register_agent_message_routes(
+    target: Any,
+    prefix: str,
+    *,
+    wf: Any,
+    invoke_responses: dict,
+    get_request_tracing_context: Callable[[Request], Any],
+    parse_accept: Callable[[Request], Optional[str]],
+    stream_media_types: Collection[str],
+    make_json_response: Callable[[WorkflowBatchResponse], Response],
+    make_not_acceptable_response: Callable[[str, Any], Response],
+    make_stream_response: Callable[[WorkflowStreamingResponse, str], Response],
+    handle_failure: Callable[[Exception], Any],
+) -> None:
+    """Register ``/messages`` and ``/load-session`` on a FastAPI app/router target."""
+    target.add_api_route(
+        prefix + "/messages",
+        make_messages_endpoint(
+            wf=wf,
+            get_request_tracing_context=get_request_tracing_context,
+            parse_accept=parse_accept,
+            stream_media_types=stream_media_types,
+            make_json_response=make_json_response,
+            make_not_acceptable_response=make_not_acceptable_response,
+            make_stream_response=make_stream_response,
+            handle_failure=handle_failure,
+        ),
+        methods=["POST"],
+        responses=invoke_responses,
+    )
+    target.add_api_route(
+        prefix + "/load-session",
+        make_load_session_endpoint(),
+        methods=["POST"],
+    )
+
+
+# Back-compat aliases for the former routing-private helper names.
+_resolve_session_id = resolve_session_id
+_inject_stream_session_id = inject_stream_session_id
diff --git a/sdks/python/agenta/sdk/agents/adapters/vercel/sse.py b/sdks/python/agenta/sdk/agents/adapters/vercel/sse.py
new file mode 100644
index 0000000000..942d8e03d6
--- /dev/null
+++ b/sdks/python/agenta/sdk/agents/adapters/vercel/sse.py
@@ -0,0 +1,30 @@
+"""SSE framing for the Vercel AI SDK UI Message Stream."""
+
+from __future__ import annotations
+
+from json import dumps
+from typing import Any, AsyncGenerator
+
+# Headers the Vercel AI SDK client and intermediaries require for a UI Message Stream.
+# ``x-accel-buffering: no`` stops a proxy from re-buffering the SSE so parts flush live.
+VERCEL_UI_MESSAGE_STREAM_HEADERS = {
+    "x-vercel-ai-ui-message-stream": "v1",
+    "cache-control": "no-cache",
+    "x-accel-buffering": "no",
+}
+
+
+def vercel_sse_stream(aiter: AsyncGenerator[Any, None]):
+    """Frame Vercel UI Message Stream parts as SSE and append ``[DONE]``."""
+
+    async def gen():
+        async for chunk in aiter:
+            yield "data: " + dumps(chunk, ensure_ascii=False) + "\n\n"
+        yield "data: [DONE]\n\n"
+
+    return gen()
+
+
+# Back-compat aliases for the former routing-private names.
+_VERCEL_UI_MESSAGE_STREAM_HEADERS = VERCEL_UI_MESSAGE_STREAM_HEADERS
+_vercel_sse_stream = vercel_sse_stream
diff --git a/sdks/python/agenta/sdk/agents/adapters/vercel/stream.py b/sdks/python/agenta/sdk/agents/adapters/vercel/stream.py
new file mode 100644
index 0000000000..6d0e1526b2
--- /dev/null
+++ b/sdks/python/agenta/sdk/agents/adapters/vercel/stream.py
@@ -0,0 +1,216 @@
+"""Encode neutral agent run events as Vercel UI Message Stream parts."""
+
+from __future__ import annotations
+
+from typing import Any, AsyncIterator, Dict, Optional
+
+from ...dtos import AgentResult
+from ...streaming import AgentRun
+from .messages import TOOL_APPROVAL_REQUEST
+
+
+async def agent_run_to_vercel_parts(
+    run: AgentRun,
+    *,
+    session_id: Optional[str] = None,
+    message_id: str = "msg-1",
+    trace_id: Optional[str] = None,
+) -> AsyncIterator[Dict[str, Any]]:
+    """Project a live ``AgentRun`` into Vercel UI Message Stream part dictionaries."""
+    start: Dict[str, Any] = {"type": "start", "messageId": message_id}
+    if session_id is not None:
+        start["messageMetadata"] = {"sessionId": session_id}
+    yield start
+    yield {"type": "start-step"}
+
+    text_seq = 0
+    reasoning_seq = 0
+    usage: Optional[Dict[str, Any]] = None
+    stop_reason: Optional[str] = None
+
+    try:
+        async for event in run:
+            etype = event.type
+            data = event.data
+
+            if etype == "message":
+                text_seq += 1
+                tid = f"text-{text_seq}"
+                yield {"type": "text-start", "id": tid}
+                yield {"type": "text-delta", "id": tid, "delta": data.get("text", "")}
+                yield {"type": "text-end", "id": tid}
+            elif etype == "message_start":
+                yield {"type": "text-start", "id": data.get("id")}
+            elif etype == "message_delta":
+                yield {
+                    "type": "text-delta",
+                    "id": data.get("id"),
+                    "delta": data.get("delta", ""),
+                }
+            elif etype == "message_end":
+                yield {"type": "text-end", "id": data.get("id")}
+            elif etype == "thought":
+                reasoning_seq += 1
+                rid = f"reasoning-{reasoning_seq}"
+                yield {"type": "reasoning-start", "id": rid}
+                yield {
+                    "type": "reasoning-delta",
+                    "id": rid,
+                    "delta": data.get("text", ""),
+                }
+                yield {"type": "reasoning-end", "id": rid}
+            elif etype == "reasoning_start":
+                yield {"type": "reasoning-start", "id": data.get("id")}
+            elif etype == "reasoning_delta":
+                yield {
+                    "type": "reasoning-delta",
+                    "id": data.get("id"),
+                    "delta": data.get("delta", ""),
+                }
+            elif etype == "reasoning_end":
+                yield {"type": "reasoning-end", "id": data.get("id")}
+            elif etype == "tool_call":
+                tool_call_id = data.get("id")
+                tool_name = data.get("name")
+                yield {
+                    "type": "tool-input-start",
+                    "toolCallId": tool_call_id,
+                    "toolName": tool_name,
+                }
+                available: Dict[str, Any] = {
+                    "type": "tool-input-available",
+                    "toolCallId": tool_call_id,
+                    "toolName": tool_name,
+                    "input": data.get("input"),
+                }
+                if data.get("render") is not None:
+                    available["render"] = data["render"]
+                yield available
+            elif etype == "tool_result":
+                tool_call_id = data.get("id")
+                if data.get("denied"):
+                    yield {
+                        "type": "tool-output-denied",
+                        "toolCallId": tool_call_id,
+                    }
+                elif data.get("isError"):
+                    yield {
+                        "type": "tool-output-error",
+                        "toolCallId": tool_call_id,
+                        "errorText": _as_text(data.get("output")),
+                    }
+                else:
+                    structured = data.get("data")
+                    out = structured if structured is not None else data.get("output")
+                    available = {
+                        "type": "tool-output-available",
+                        "toolCallId": tool_call_id,
+                        "output": out,
+                    }
+                    if data.get("render") is not None:
+                        available["render"] = data["render"]
+                    yield available
+            elif etype == "interaction_request":
+                yield _interaction_part(data)
+            elif etype == "data":
+                part: Dict[str, Any] = {
+                    "type": f"data-{data.get('name', 'data')}",
+                    "data": data.get("data"),
+                }
+                if data.get("transient"):
+                    part["transient"] = True
+                yield part
+            elif etype == "file":
+                yield {
+                    "type": "file",
+                    "url": data.get("url"),
+                    "mediaType": data.get("mediaType"),
+                }
+            elif etype == "usage":
+                usage = _usage_metadata(data)
+            elif etype == "error":
+                yield {"type": "error", "errorText": data.get("message", "")}
+            elif etype == "done":
+                stop_reason = data.get("stopReason")
+    except Exception as exc:
+        yield {"type": "error", "errorText": str(exc)}
+        return
+
+    if usage is None or trace_id is None:
+        result = _safe_result(run)
+        if result is not None:
+            if usage is None:
+                usage = _usage_metadata(result.usage or {})
+                if stop_reason is None:
+                    stop_reason = result.stop_reason
+            if trace_id is None:
+                trace_id = result.trace_id
+
+    yield {"type": "finish-step"}
+    finish: Dict[str, Any] = {"type": "finish"}
+    if stop_reason is not None:
+        finish["finishReason"] = stop_reason
+    metadata: Dict[str, Any] = {}
+    if usage:
+        metadata["usage"] = usage
+    if trace_id is not None:
+        metadata["traceId"] = trace_id
+    if metadata:
+        finish["messageMetadata"] = metadata
+    yield finish
+
+
+def _interaction_part(data: Dict[str, Any]) -> Dict[str, Any]:
+    """Project a neutral ``interaction_request`` event to a Vercel stream part."""
+    kind = data.get("kind")
+    payload = data.get("payload") or {}
+    if kind == "permission":
+        return {
+            "type": TOOL_APPROVAL_REQUEST,
+            "approvalId": data.get("id"),
+            "toolCallId": _approval_tool_call_id(payload),
+            "availableReplies": payload.get("availableReplies"),
+            "toolCall": payload.get("toolCall"),
+        }
+    if kind == "input":
+        return {"type": "data-input-request", "id": data.get("id"), "data": payload}
+    return {
+        "type": "data-interaction",
+        "id": data.get("id"),
+        "data": {"kind": kind, "payload": payload},
+    }
+
+
+def _approval_tool_call_id(payload: Dict[str, Any]) -> Optional[Any]:
+    tool_call_id = payload.get("toolCallId")
+    if tool_call_id is not None:
+        return tool_call_id
+    tool_call = payload.get("toolCall")
+    if isinstance(tool_call, dict):
+        return tool_call.get("id") or tool_call.get("toolCallId")
+    return None
+
+
+def _usage_metadata(data: Dict[str, Any]) -> Dict[str, Any]:
+    return {
+        key: data[key]
+        for key in ("input", "output", "total", "cost")
+        if data.get(key) is not None
+    }
+
+
+def _as_text(value: Any) -> str:
+    if value is None:
+        return ""
+    return value if isinstance(value, str) else str(value)
+
+
+def _safe_result(run: AgentRun) -> Optional[AgentResult]:
+    try:
+        return run.result()
+    except Exception:
+        return None
+
+
+# Back-compat alias for the former flat module API.
+ui_message_stream = agent_run_to_vercel_parts
diff --git a/sdks/python/agenta/sdk/agents/ui_messages.py b/sdks/python/agenta/sdk/agents/ui_messages.py
index 41517245aa..1e19af0348 100644
--- a/sdks/python/agenta/sdk/agents/ui_messages.py
+++ b/sdks/python/agenta/sdk/agents/ui_messages.py
@@ -1,491 +1,24 @@
-"""UI message codec: translate between the Vercel AI SDK ``UIMessage`` wire shape and the
-neutral agent runtime types (``Message`` / ``AgentEvent`` / ``AgentResult``).
+"""Compatibility imports for the Vercel UI Message adapter.
 
-This is the ``/messages`` egress adapter, the parts-aware sibling of
-:func:`agenta.sdk.agents.to_messages` (which only understands the ``/invoke`` ``{role,
-content}`` shape). The neutral types in ``dtos.py`` stay the port; this module is one more
-adapter behind that seam, so the ``/run`` runner wire (``utils/wire.py``, ``{role, content}``)
-is unchanged — the Vercel shape lives only at the HTTP edge.
-
-Three directions:
-
-- :func:`from_ui_messages` — inbound ``UIMessage[]`` -> ``List[Message]``. Text and file parts
-  fold into content blocks; tool and approval parts are PRESERVED as structured ``tool_call`` /
-  ``tool_result`` :class:`~agenta.sdk.agents.dtos.ContentBlock`s (never flattened to text), so a
-  cross-turn human-in-the-loop reply replays as a real tool turn and the model resumes from the
-  result. The runner's message transcript renders these blocks into the cold replay.
-- :func:`to_ui_message` — outbound ``AgentResult`` / ``Message`` -> one ``UIMessage`` dict, for
-  the ``load-session`` history.
-- :func:`ui_message_stream` — the streaming edge: a live
-  :class:`~agenta.sdk.agents.streaming.AgentRun` -> Vercel UI Message Stream parts
-  (``start`` ... ``finish``). The SSE framing and the terminal ``data: [DONE]`` are added by the
-  routing layer (``_vercel_sse_stream``); this generator yields the part dicts only.
+New code should import from :mod:`agenta.sdk.agents.adapters.vercel`.
 """
 
 from __future__ import annotations
 
-from typing import Any, AsyncIterator, Dict, List, Optional
-
-from .dtos import AgentResult, ContentBlock, Message
-from .streaming import AgentRun
-
-# Inbound UIMessage part type names handled specially (the rest of ``tool-*`` is a tool call).
-_TOOL_APPROVAL_REQUEST = "tool-approval-request"
-_TOOL_APPROVAL_RESPONSE = "tool-approval-response"
-_TOOL_OUTPUT_AVAILABLE = "tool-output-available"
-
-
-# ---------------------------------------------------------------------------
-# Inbound: UIMessage[] -> List[Message]
-# ---------------------------------------------------------------------------
-
-
-def from_ui_messages(raw: Optional[List[Any]]) -> List[Message]:
-    """Coerce inbound Vercel ``UIMessage`` objects into neutral :class:`Message` objects.
-
-    The parts-aware sibling of :func:`agenta.sdk.agents.to_messages`. Tool and approval parts
-    are preserved as structured tool-call / tool-result content blocks (never dropped), so a
-    cross-turn human-in-the-loop reply resumes the pending interaction on the next turn.
-    """
-    messages: List[Message] = []
-    for item in raw or []:
-        message = _ui_message_to_message(item)
-        if message is not None:
-            messages.append(message)
-    return messages
-
-
-def _ui_message_to_message(raw: Any) -> Optional[Message]:
-    if isinstance(raw, Message):
-        return raw
-    if not isinstance(raw, dict) or "role" not in raw:
-        return None
-    role = str(raw["role"])
-
-    parts = raw.get("parts")
-    if parts is None:
-        # Not a parts-based UIMessage — fall back to the {role, content} shape so a mixed
-        # history still parses.
-        return Message.from_raw(raw)
-
-    blocks: List[ContentBlock] = []
-    for part in parts or []:
-        blocks.extend(_part_to_blocks(part))
-
-    if not blocks:
-        return Message(role=role, content="")
-    # Collapse an all-text message to a plain string (the shape the runner replays); keep the
-    # block list when any structured (file / tool) content is present.
-    if all(block.type == "text" for block in blocks):
-        return Message(role=role, content="".join(block.text or "" for block in blocks))
-    return Message(role=role, content=blocks)
-
-
-def _part_to_blocks(part: Any) -> List[ContentBlock]:
-    if not isinstance(part, dict):
-        return []
-    ptype = str(part.get("type", ""))
-
-    if ptype == "text":
-        text = part.get("text")
-        return [ContentBlock(type="text", text=text)] if text is not None else []
-
-    if ptype == "file":
-        media = part.get("mediaType") or part.get("mimeType")
-        kind = (
-            "image"
-            if isinstance(media, str) and media.startswith("image/")
-            else "resource"
-        )
-        return [
-            ContentBlock(
-                type=kind,
-                uri=part.get("url") or part.get("uri"),
-                data=part.get("data"),
-                mime_type=media,
-            )
-        ]
-
-    if ptype == _TOOL_APPROVAL_REQUEST:
-        # The server's own request, echoed back in history; regenerated on replay, not input.
-        return []
-
-    if ptype == _TOOL_APPROVAL_RESPONSE:
-        return _approval_response_blocks(part)
-
-    if (
-        ptype == _TOOL_OUTPUT_AVAILABLE
-        or ptype == "dynamic-tool"
-        or ptype.startswith("tool-")
-    ):
-        return _tool_part_blocks(part, ptype)
-
-    # reasoning / step-start / data-* parts are the assistant's own output or transient UI;
-    # they are not model input on replay, so they are dropped.
-    return []
-
-
-def _tool_part_blocks(part: Dict[str, Any], ptype: str) -> List[ContentBlock]:
-    """A Vercel tool part -> a ``tool_call`` block plus, when resolved, a ``tool_result``.
-
-    Field names match what the runner's transcript renders: ``toolCallId`` / ``toolName`` /
-    ``input`` / ``output`` / ``isError`` (via :meth:`ContentBlock.to_wire`).
-    """
-    tool_call_id = part.get("toolCallId") or part.get("tool_call_id")
-    tool_name = part.get("toolName") or part.get("tool_name")
-    if (
-        tool_name is None
-        and ptype.startswith("tool-")
-        and ptype != _TOOL_OUTPUT_AVAILABLE
-    ):
-        tool_name = ptype[len("tool-") :]
-
-    blocks: List[ContentBlock] = []
-
-    # The call itself (a bare tool-output-available part carries only a result).
-    if ptype != _TOOL_OUTPUT_AVAILABLE or "input" in part:
-        blocks.append(
-            ContentBlock(
-                type="tool_call",
-                tool_call_id=tool_call_id,
-                tool_name=tool_name,
-                input=part.get("input"),
-            )
-        )
-
-    state = part.get("state")
-    error_text = part.get("errorText")
-    if error_text is not None or state == "output-error":
-        blocks.append(
-            ContentBlock(
-                type="tool_result",
-                tool_call_id=tool_call_id,
-                tool_name=tool_name,
-                output=error_text if error_text is not None else part.get("output"),
-                is_error=True,
-            )
-        )
-    elif "output" in part or state == "output-available":
-        blocks.append(
-            ContentBlock(
-                type="tool_result",
-                tool_call_id=tool_call_id,
-                tool_name=tool_name,
-                output=part.get("output"),
-                is_error=False,
-            )
-        )
-    return blocks
-
-
-def _approval_response_blocks(part: Dict[str, Any]) -> List[ContentBlock]:
-    """A cross-turn ``tool-approval-response`` reply -> a ``tool_result`` keyed by toolCallId,
-    so the runtime matches the pending interaction and resumes (the resolve step is joint)."""
-    tool_call_id = (
-        part.get("toolCallId") or part.get("tool_call_id") or part.get("approvalId")
-    )
-    output = part.get("output")
-    if output is None:
-        approved = part.get("approved")
-        output = {"approved": approved} if approved is not None else part.get("reason")
-    return [ContentBlock(type="tool_result", tool_call_id=tool_call_id, output=output)]
-
-
-# ---------------------------------------------------------------------------
-# Outbound (batch): AgentResult / Message -> one UIMessage dict
-# ---------------------------------------------------------------------------
-
-
-def to_ui_message(source: Any, *, message_id: str = "msg-1") -> Dict[str, Any]:
-    """Render an :class:`AgentResult` or :class:`Message` as one Vercel ``UIMessage`` dict
-    (the shape ``load-session`` returns and ``useChat`` takes as its initial messages)."""
-    if isinstance(source, AgentResult):
-        return {
-            "id": message_id,
-            "role": "assistant",
-            "parts": [{"type": "text", "text": source.output or ""}],
-        }
-    if isinstance(source, Message):
-        return {
-            "id": message_id,
-            "role": source.role,
-            "parts": _content_to_parts(source.content),
-        }
-    raise TypeError(
-        f"to_ui_message expects an AgentResult or Message, got {type(source).__name__!r}"
-    )
-
-
-def _content_to_parts(content: Any) -> List[Dict[str, Any]]:
-    if isinstance(content, str):
-        return [{"type": "text", "text": content}] if content else []
-    parts: List[Dict[str, Any]] = []
-    for block in content or []:
-        parts.extend(_block_to_parts(block))
-    return parts
-
-
-def _block_to_parts(block: ContentBlock) -> List[Dict[str, Any]]:
-    if block.type == "text":
-        return [{"type": "text", "text": block.text or ""}]
-    if block.type in ("image", "resource"):
-        part: Dict[str, Any] = {"type": "file"}
-        if block.uri is not None:
-            part["url"] = block.uri
-        if block.mime_type is not None:
-            part["mediaType"] = block.mime_type
-        if block.data is not None:
-            part["data"] = block.data
-        return [part]
-    if block.type == "tool_call":
-        return [
-            {
-                "type": f"tool-{block.tool_name or 'tool'}",
-                "toolCallId": block.tool_call_id,
-                "state": "input-available",
-                "input": block.input,
-            }
-        ]
-    if block.type == "tool_result":
-        return [
-            {
-                "type": f"tool-{block.tool_name or 'tool'}",
-                "toolCallId": block.tool_call_id,
-                "state": "output-error" if block.is_error else "output-available",
-                "output": block.output,
-            }
-        ]
-    return []
-
-
-# ---------------------------------------------------------------------------
-# Streaming edge: a live AgentRun -> Vercel UI Message Stream parts
-# ---------------------------------------------------------------------------
-
-
-async def ui_message_stream(
-    run: AgentRun,
-    *,
-    session_id: Optional[str] = None,
-    message_id: str = "msg-1",
-    trace_id: Optional[str] = None,
-) -> AsyncIterator[Dict[str, Any]]:
-    """Encode a live :class:`AgentRun` as Vercel UI Message Stream part dicts.
-
-    Consumes the run's live ``AgentEvent`` stream and yields parts as they arrive: ``start``
-    (carrying ``messageMetadata.sessionId``) first, then the body, then ``finish`` (carrying
-    ``messageMetadata.traceId`` so the client can open the run's OTel trace, RFC §6.1). The SSE
-    framing and the terminal ``data: [DONE]`` are added by the routing layer
-    (``_vercel_sse_stream``). On a terminal run failure the run raises while iterating; that is
-    surfaced as an ``error`` part (RFC §8.2) and the stream ends without a ``finish``.
-    """
-    start: Dict[str, Any] = {"type": "start", "messageId": message_id}
-    if session_id is not None:
-        start["messageMetadata"] = {"sessionId": session_id}
-    yield start
-    yield {"type": "start-step"}
-
-    text_seq = 0
-    reasoning_seq = 0
-    usage: Optional[Dict[str, Any]] = None
-    stop_reason: Optional[str] = None
-
-    try:
-        async for event in run:
-            etype = event.type
-            data = event.data
-
-            if etype == "message":
-                text_seq += 1
-                tid = f"text-{text_seq}"
-                yield {"type": "text-start", "id": tid}
-                yield {"type": "text-delta", "id": tid, "delta": data.get("text", "")}
-                yield {"type": "text-end", "id": tid}
-            elif etype == "message_start":
-                yield {"type": "text-start", "id": data.get("id")}
-            elif etype == "message_delta":
-                yield {
-                    "type": "text-delta",
-                    "id": data.get("id"),
-                    "delta": data.get("delta", ""),
-                }
-            elif etype == "message_end":
-                yield {"type": "text-end", "id": data.get("id")}
-            elif etype == "thought":
-                reasoning_seq += 1
-                rid = f"reasoning-{reasoning_seq}"
-                yield {"type": "reasoning-start", "id": rid}
-                yield {
-                    "type": "reasoning-delta",
-                    "id": rid,
-                    "delta": data.get("text", ""),
-                }
-                yield {"type": "reasoning-end", "id": rid}
-            elif etype == "reasoning_start":
-                yield {"type": "reasoning-start", "id": data.get("id")}
-            elif etype == "reasoning_delta":
-                yield {
-                    "type": "reasoning-delta",
-                    "id": data.get("id"),
-                    "delta": data.get("delta", ""),
-                }
-            elif etype == "reasoning_end":
-                yield {"type": "reasoning-end", "id": data.get("id")}
-            elif etype == "tool_call":
-                tool_call_id = data.get("id")
-                tool_name = data.get("name")
-                yield {
-                    "type": "tool-input-start",
-                    "toolCallId": tool_call_id,
-                    "toolName": tool_name,
-                }
-                available: Dict[str, Any] = {
-                    "type": "tool-input-available",
-                    "toolCallId": tool_call_id,
-                    "toolName": tool_name,
-                    "input": data.get("input"),
-                }
-                if data.get("render") is not None:
-                    available["render"] = data["render"]
-                yield available
-            elif etype == "tool_result":
-                tool_call_id = data.get("id")
-                if data.get("denied"):
-                    # A human denied the tool, so it never ran (RFC: emit tool-output-denied
-                    # instead of tool-output-available; the FE renders the output-denied state).
-                    yield {
-                        "type": "tool-output-denied",
-                        "toolCallId": tool_call_id,
-                    }
-                elif data.get("isError"):
-                    yield {
-                        "type": "tool-output-error",
-                        "toolCallId": tool_call_id,
-                        "errorText": _as_text(data.get("output")),
-                    }
-                else:
-                    # Prefer the structured object (generative UI); fall back to the text form.
-                    structured = data.get("data")
-                    out = structured if structured is not None else data.get("output")
-                    available = {
-                        "type": "tool-output-available",
-                        "toolCallId": tool_call_id,
-                        "output": out,
-                    }
-                    if data.get("render") is not None:
-                        available["render"] = data["render"]
-                    yield available
-            elif etype == "interaction_request":
-                yield _interaction_part(data)
-            elif etype == "data":
-                part: Dict[str, Any] = {
-                    "type": f"data-{data.get('name', 'data')}",
-                    "data": data.get("data"),
-                }
-                if data.get("transient"):
-                    part["transient"] = True
-                yield part
-            elif etype == "file":
-                yield {
-                    "type": "file",
-                    "url": data.get("url"),
-                    "mediaType": data.get("mediaType"),
-                }
-            elif etype == "usage":
-                usage = _usage_metadata(data)
-            elif etype == "error":
-                yield {"type": "error", "errorText": data.get("message", "")}
-            elif etype == "done":
-                stop_reason = data.get("stopReason")
-            # unknown event types are ignored
-    except Exception as exc:  # AgentRun raises on a terminal ok=false result
-        yield {"type": "error", "errorText": str(exc)}
-        return
-
-    # Pull usage and the trace id from the terminal result when not already known, the same
-    # fallback both lean on (RFC §6.1: the finish trace id matches the JSON response's).
-    if usage is None or trace_id is None:
-        result = _safe_result(run)
-        if result is not None:
-            if usage is None:
-                usage = _usage_metadata(result.usage or {})
-                if stop_reason is None:
-                    stop_reason = result.stop_reason
-            if trace_id is None:
-                trace_id = result.trace_id
-
-    yield {"type": "finish-step"}
-    finish: Dict[str, Any] = {"type": "finish"}
-    if stop_reason is not None:
-        finish["finishReason"] = stop_reason
-    # usage and traceId coexist under messageMetadata; the client reads message.metadata.traceId.
-    metadata: Dict[str, Any] = {}
-    if usage:
-        metadata["usage"] = usage
-    if trace_id is not None:
-        metadata["traceId"] = trace_id
-    if metadata:
-        finish["messageMetadata"] = metadata
-    yield finish
-
-
-def _interaction_part(data: Dict[str, Any]) -> Dict[str, Any]:
-    """Project an ``interaction_request`` event to a Vercel part. Permission -> an approval
-    request; input -> a forward-looking input part; any other kind (e.g. ``client_tool``) ->
-    a generic interaction part so it is surfaced, not dropped (the resolve step is joint)."""
-    kind = data.get("kind")
-    payload = data.get("payload") or {}
-    if kind == "permission":
-        return {
-            "type": _TOOL_APPROVAL_REQUEST,
-            "approvalId": data.get("id"),
-            # REQUIRED alongside approvalId (RFC / AI SDK chunk): the gated tool's call id, so
-            # the FE binds the approval to its existing tool part and the inbound
-            # tool-approval-response (keyed by toolCallId) correlates back for the cross-turn
-            # resume. Prefer the top-level toolCallId the runner emits; fall back to the nested
-            # ACP toolCall detail (id / toolCallId).
-            "toolCallId": _approval_tool_call_id(payload),
-            "availableReplies": payload.get("availableReplies"),
-            "toolCall": payload.get("toolCall"),
-        }
-    if kind == "input":
-        return {"type": "data-input-request", "id": data.get("id"), "data": payload}
-    return {
-        "type": "data-interaction",
-        "id": data.get("id"),
-        "data": {"kind": kind, "payload": payload},
-    }
-
-
-def _approval_tool_call_id(payload: Dict[str, Any]) -> Optional[Any]:
-    """The gated tool's call id for a ``tool-approval-request``. The runner stamps a top-level
-    ``toolCallId`` on the permission payload; if it is absent, dig it out of the nested ACP
-    ``toolCall`` detail (``id`` / ``toolCallId``)."""
-    tool_call_id = payload.get("toolCallId")
-    if tool_call_id is not None:
-        return tool_call_id
-    tool_call = payload.get("toolCall")
-    if isinstance(tool_call, dict):
-        return tool_call.get("id") or tool_call.get("toolCallId")
-    return None
-
-
-def _usage_metadata(data: Dict[str, Any]) -> Dict[str, Any]:
-    return {
-        key: data[key]
-        for key in ("input", "output", "total", "cost")
-        if data.get(key) is not None
-    }
-
-
-def _as_text(value: Any) -> str:
-    if value is None:
-        return ""
-    return value if isinstance(value, str) else str(value)
-
-
-def _safe_result(run: AgentRun) -> Optional[AgentResult]:
-    try:
-        return run.result()
-    except Exception:  # result not available (stream not fully consumed / failed)
-        return None
+from .adapters.vercel import (
+    agent_run_to_vercel_parts,
+    from_ui_messages,
+    message_to_vercel_ui_message,
+    to_ui_message,
+    ui_message_stream,
+    vercel_ui_messages_to_messages,
+)
+
+__all__ = [
+    "vercel_ui_messages_to_messages",
+    "message_to_vercel_ui_message",
+    "agent_run_to_vercel_parts",
+    "from_ui_messages",
+    "to_ui_message",
+    "ui_message_stream",
+]
diff --git a/sdks/python/agenta/sdk/decorators/routing.py b/sdks/python/agenta/sdk/decorators/routing.py
index 4a57846d6e..8b2f1d6136 100644
--- a/sdks/python/agenta/sdk/decorators/routing.py
+++ b/sdks/python/agenta/sdk/decorators/routing.py
@@ -20,6 +20,15 @@
     WorkflowBaseResponse,
     WorkflowServiceResponseData,
 )
+from agenta.sdk.agents.adapters.vercel.routing import (
+    inject_stream_session_id,
+    register_agent_message_routes,
+    resolve_session_id,
+)
+from agenta.sdk.agents.adapters.vercel.sse import (
+    VERCEL_UI_MESSAGE_STREAM_HEADERS as _VERCEL_UI_MESSAGE_STREAM_HEADERS,
+    vercel_sse_stream as _vercel_sse_stream,
+)
 from agenta.sdk.middlewares.routing.cors import CORSMiddleware
 from agenta.sdk.middlewares.routing.auth import AuthMiddleware
 from agenta.sdk.middlewares.routing.otel import OTelMiddleware
@@ -29,12 +38,25 @@
 from agenta.sdk.engines.running.errors import ErrorStatus
 
 
+def _resolve_session_id(session_id: Optional[str]) -> Optional[str]:
+    """Compatibility wrapper for the Vercel adapter helper."""
+    return resolve_session_id(session_id)
+
+
+def _inject_stream_session_id(
+    response: WorkflowStreamingResponse,
+    session_id: str,
+) -> None:
+    """Compatibility wrapper for the Vercel adapter helper."""
+    inject_stream_session_id(response, session_id)
+
+
 # ---------------------------------------------------------------------------
 # Reserved path segments — may not appear anywhere in a route path.
 # These names are used by the per-route namespace triple itself.
 # ---------------------------------------------------------------------------
 
-_RESERVED_PATHS = {"invoke", "inspect"}
+_RESERVED_PATHS = {"invoke", "inspect", "messages", "load-session"}
 
 
 def _validate_path(path: str) -> None:
@@ -195,15 +217,27 @@ def _make_stream_response(
 ) -> StreamingResponse:
     aiter = response.iterator()
 
-    if wire_format == "sse":
-        media_type = "text/event-stream"
-        res = StreamingResponse(_sse_stream(aiter), media_type=media_type)
+    if wire_format == "vercel":
+        # The Vercel UI Message Stream: SSE framing terminated by `data: [DONE]`, plus the
+        # headers the AI SDK client and proxies require. Endpoint-selected (the agent
+        # `/messages` route passes "vercel"), not derived from Accept — a Vercel UI message
+        # stream and a plain SSE stream share the `text/event-stream` media type, so the
+        # choice cannot come from the Accept header alone.
+        res = StreamingResponse(
+            _vercel_sse_stream(aiter), media_type="text/event-stream"
+        )
+        for key, value in _VERCEL_UI_MESSAGE_STREAM_HEADERS.items():
+            res.headers.setdefault(key, value)
+    elif wire_format == "sse":
+        res = StreamingResponse(_sse_stream(aiter), media_type="text/event-stream")
     elif wire_format == "ndjson":
-        media_type = "application/x-ndjson"
-        res = StreamingResponse(_ndjson_stream(aiter), media_type=media_type)
+        res = StreamingResponse(
+            _ndjson_stream(aiter), media_type="application/x-ndjson"
+        )
     else:
-        media_type = "application/x-ndjson"
-        res = StreamingResponse(_ndjson_stream(aiter), media_type=media_type)
+        res = StreamingResponse(
+            _ndjson_stream(aiter), media_type="application/x-ndjson"
+        )
 
     return _set_common_headers(res, response)  # type: ignore
 
@@ -451,6 +485,10 @@ async def inspect_endpoint(req: Request, request: WorkflowInspectRequest):
             except Exception as exception:
                 return await handle_inspect_failure(exception)
 
+        # Agent-only endpoints are Vercel/browser-protocol adapters. Keep their request
+        # folding, session id handling, and UI Message Stream details out of the generic
+        # workflow route decorator.
+
         invoke_responses: dict = {
             200: {
                 "description": "Negotiated response — format determined by Accept header",
@@ -489,6 +527,25 @@ async def inspect_endpoint(req: Request, request: WorkflowInspectRequest):
             },
         }
 
+        agent_enabled = bool(self.flags and self.flags.get("is_agent"))
+
+        def _add_agent_routes(target: Any, prefix: str) -> None:
+            """Register the agent-only /messages + /load-session routes on a target
+            (sub-app / router / mount root), mirroring how /invoke + /inspect are added."""
+            register_agent_message_routes(
+                target,
+                prefix,
+                wf=wf,
+                invoke_responses=invoke_responses,
+                get_request_tracing_context=_get_request_tracing_context,
+                parse_accept=_parse_accept,
+                stream_media_types=STREAM_MEDIA_TYPES,
+                make_json_response=_make_json_response,
+                make_not_acceptable_response=_make_not_acceptable_response,
+                make_stream_response=_make_stream_response,
+                handle_failure=handle_invoke_failure,
+            )
+
         # ------------------------------------------------------------------
         # Legacy path: router= was provided.
         # Registers prefixed routes on the APIRouter without isolation.
@@ -506,6 +563,8 @@ async def inspect_endpoint(req: Request, request: WorkflowInspectRequest):
                 methods=["POST"],
                 response_model=WorkflowInvokeRequest,
             )
+            if agent_enabled:
+                _add_agent_routes(self.router_fallback, self.path)
             return foo
 
         # ------------------------------------------------------------------
@@ -528,6 +587,8 @@ async def inspect_endpoint(req: Request, request: WorkflowInspectRequest):
                 methods=["POST"],
                 response_model=WorkflowInvokeRequest,
             )
+            if agent_enabled:
+                _add_agent_routes(self.mount_root, "")
 
             return foo
 
@@ -545,6 +606,8 @@ async def inspect_endpoint(req: Request, request: WorkflowInspectRequest):
             methods=["POST"],
             response_model=WorkflowInvokeRequest,
         )
+        if agent_enabled:
+            _add_agent_routes(sub_app, "")
 
         self.mount_root.mount(self.path, sub_app)
 
diff --git a/sdks/python/agenta/sdk/models/workflows.py b/sdks/python/agenta/sdk/models/workflows.py
index a9437342df..0cb751e9dc 100644
--- a/sdks/python/agenta/sdk/models/workflows.py
+++ b/sdks/python/agenta/sdk/models/workflows.py
@@ -79,6 +79,7 @@ class WorkflowFlags(BaseModel):
     # interface-derived
     ## schema
     is_chat: bool = False
+    is_agent: bool = False
     ## hook
     has_url: bool = False
     ## code
@@ -106,6 +107,7 @@ class WorkflowQueryFlags(BaseModel):
     # interface-derived
     ## schema
     is_chat: Optional[bool] = None
+    is_agent: Optional[bool] = None
     ## hook
     has_url: Optional[bool] = None
     ## code
@@ -209,6 +211,15 @@ class WorkflowRequestData(BaseModel):
     #
     testcase: Optional[dict] = None
     inputs: Optional[dict] = None
+    # The agent ``/messages`` egress lifts the conversation out of ``inputs`` to this
+    # first-class member, in the Vercel ``UIMessage`` shape; ``/invoke`` ignores it.
+    messages: Optional[list] = None
+    # Transport mode for the agent ``/messages`` route: the endpoint sets this from the Accept
+    # negotiation so the shared agent handler streams (returns an async generator) instead of
+    # returning a batch dict. A sibling of ``messages`` / ``inputs`` / ``parameters`` on purpose
+    # — it must not live in ``parameters``, where it would leak into agent config / revision
+    # state / trace inputs. ``/invoke`` leaves it unset (batch).
+    stream: Optional[bool] = None
     #
     trace: Optional[dict] = None
     outputs: Optional[Any] = None
@@ -233,6 +244,10 @@ class WorkflowBaseRequest(Metadata):
     secrets: Optional[Dict[str, Any]] = None
     credentials: Optional[str] = None
 
+    # The agent ``/messages`` session this turn belongs to (opaque, project-scoped). Optional;
+    # absent on ``/invoke`` and on the first turn of a server-minted session.
+    session_id: Optional[str] = None
+
     @model_validator(mode="before")
     def _coerce_nested_models(cls, values: Dict[str, Any]) -> Dict[str, Any]:
         if "references" in values and isinstance(values["references"], dict):
@@ -291,6 +306,10 @@ class WorkflowBaseResponse(TraceID, SpanID):
 
     status: Optional[WorkflowServiceStatus] = WorkflowServiceStatus()
 
+    # The resolved agent session id (minted or echoed) on the ``/messages`` response, alongside
+    # ``trace_id`` / ``span_id``. ``None`` for plain ``/invoke`` responses.
+    session_id: Optional[str] = None
+
 
 # back-compat alias
 WorkflowServiceBaseResponse = WorkflowBaseResponse
@@ -324,6 +343,20 @@ async def iterator(self):
 ]
 
 
+class LoadSessionRequest(BaseModel):
+    """``POST /load-session`` body. The session id is required (RFC §7.1)."""
+
+    session_id: str
+
+
+class LoadSessionResponse(BaseModel):
+    """``POST /load-session`` response: a session's history as Vercel ``UIMessage`` objects,
+    the shape ``useChat`` takes as its initial ``messages``."""
+
+    session_id: str
+    messages: List[Dict[str, Any]] = Field(default_factory=list)
+
+
 # aliases ----------------------------------------------------------------------
 
 
diff --git a/sdks/python/oss/tests/pytest/unit/agents/test_ui_messages.py b/sdks/python/oss/tests/pytest/unit/agents/test_ui_messages.py
index aaf9afec10..f7cce7d31c 100644
--- a/sdks/python/oss/tests/pytest/unit/agents/test_ui_messages.py
+++ b/sdks/python/oss/tests/pytest/unit/agents/test_ui_messages.py
@@ -1,12 +1,13 @@
-"""Tests for the UI message codec (``agenta.sdk.agents.ui_messages``), the ``/messages``
-egress adapter between the Vercel ``UIMessage`` shape and the neutral runtime types.
+"""Tests for the Vercel UI message adapter, the ``/messages`` egress adapter between the
+Vercel ``UIMessage`` shape and the neutral runtime types.
 
 Three directions:
 
-- ``from_ui_messages`` — inbound parts -> ``Message``; tool/approval parts are preserved as
-  structured ``tool_call`` / ``tool_result`` content blocks (the HITL reply channel).
-- ``to_ui_message`` — outbound ``AgentResult`` / ``Message`` -> one ``UIMessage`` dict.
-- ``ui_message_stream`` — a live ``AgentRun`` -> Vercel UI Message Stream parts.
+- ``vercel_ui_messages_to_messages`` — inbound parts -> ``Message``; tool/approval parts are
+  preserved as structured ``tool_call`` / ``tool_result`` content blocks.
+- ``message_to_vercel_ui_message`` — outbound ``AgentResult`` / ``Message`` -> one
+  ``UIMessage`` dict.
+- ``agent_run_to_vercel_parts`` — a live ``AgentRun`` -> Vercel UI Message Stream parts.
 
 The stream tests fabricate an ``AgentRun`` from a fixed record list (the same trick
 ``test_streaming.py`` uses), so they are pure and need no backend.
@@ -17,10 +18,10 @@
 from typing import Any, Dict, List
 
 from agenta.sdk.agents import AgentRun, AgentResult, Message
-from agenta.sdk.agents.ui_messages import (
-    from_ui_messages,
-    to_ui_message,
-    ui_message_stream,
+from agenta.sdk.agents.adapters.vercel import (
+    agent_run_to_vercel_parts,
+    message_to_vercel_ui_message,
+    vercel_ui_messages_to_messages,
 )
 
 
@@ -37,17 +38,17 @@ def _run(events: List[Dict[str, Any]], result: Dict[str, Any]) -> AgentRun:
 
 
 async def _collect(run: AgentRun, **kwargs) -> List[Dict[str, Any]]:
-    return [part async for part in ui_message_stream(run, **kwargs)]
+    return [part async for part in agent_run_to_vercel_parts(run, **kwargs)]
 
 
 # ---------------------------------------------------------------------------
-# from_ui_messages
+# vercel_ui_messages_to_messages
 # ---------------------------------------------------------------------------
 
 
 class TestFromUIMessages:
     def test_all_text_message_collapses_to_string(self):
-        msgs = from_ui_messages(
+        msgs = vercel_ui_messages_to_messages(
             [{"id": "m1", "role": "user", "parts": [{"type": "text", "text": "hi"}]}]
         )
         assert len(msgs) == 1
@@ -55,7 +56,7 @@ def test_all_text_message_collapses_to_string(self):
         assert msgs[0].content == "hi"
 
     def test_file_part_becomes_image_or_resource_block(self):
-        msgs = from_ui_messages(
+        msgs = vercel_ui_messages_to_messages(
             [
                 {
                     "id": "m1",
@@ -75,7 +76,7 @@ def test_file_part_becomes_image_or_resource_block(self):
     def test_tool_part_is_preserved_as_structured_blocks(self):
         # A resolved tool part -> a tool_call block plus a tool_result block, keyed by
         # toolCallId, with the field names the runner transcript renders.
-        msgs = from_ui_messages(
+        msgs = vercel_ui_messages_to_messages(
             [
                 {
                     "id": "m2",
@@ -110,7 +111,7 @@ def test_tool_part_is_preserved_as_structured_blocks(self):
         ]
 
     def test_tool_error_part_sets_is_error(self):
-        msgs = from_ui_messages(
+        msgs = vercel_ui_messages_to_messages(
             [
                 {
                     "id": "m2",
@@ -134,7 +135,7 @@ def test_tool_error_part_sets_is_error(self):
 
     def test_approval_response_becomes_tool_result_keyed_by_call_id(self):
         # The cross-turn HITL reply: a tool_result keyed by toolCallId so the runtime resumes.
-        msgs = from_ui_messages(
+        msgs = vercel_ui_messages_to_messages(
             [
                 {
                     "id": "m3",
@@ -156,7 +157,7 @@ def test_approval_response_becomes_tool_result_keyed_by_call_id(self):
 
     def test_approval_request_part_is_dropped_on_replay(self):
         # The server's own request, echoed back; regenerated on replay, not model input.
-        msgs = from_ui_messages(
+        msgs = vercel_ui_messages_to_messages(
             [
                 {
                     "id": "m4",
@@ -172,18 +173,18 @@ def test_approval_request_part_is_dropped_on_replay(self):
 
     def test_plain_role_content_message_still_parses(self):
         # A non-parts {role, content} message in a mixed history falls back cleanly.
-        msgs = from_ui_messages([{"role": "user", "content": "hello"}])
+        msgs = vercel_ui_messages_to_messages([{"role": "user", "content": "hello"}])
         assert msgs[0].content == "hello"
 
 
 # ---------------------------------------------------------------------------
-# to_ui_message
+# message_to_vercel_ui_message
 # ---------------------------------------------------------------------------
 
 
 class TestToUIMessage:
     def test_agent_result_becomes_assistant_text_message(self):
-        ui = to_ui_message(AgentResult(output="Paris."), message_id="m9")
+        ui = message_to_vercel_ui_message(AgentResult(output="Paris."), message_id="m9")
         assert ui == {
             "id": "m9",
             "role": "assistant",
@@ -204,14 +205,14 @@ def test_message_with_tool_blocks_round_trips_to_parts(self):
                 ),
             ],
         )
-        ui = to_ui_message(msg)
+        ui = message_to_vercel_ui_message(msg)
         assert ui["role"] == "assistant"
         assert ui["parts"][0]["type"] == "tool-getWeather"
         assert ui["parts"][0]["toolCallId"] == "c1"
 
 
 # ---------------------------------------------------------------------------
-# ui_message_stream
+# agent_run_to_vercel_parts
 # ---------------------------------------------------------------------------
 
 
@@ -421,7 +422,7 @@ async def test_terminal_failure_emits_error_part_and_no_finish(self):
             {"kind": "result", "result": {"ok": False, "error": "kaboom"}},
         ]
         run = AgentRun(_from_list(records))
-        parts = [part async for part in ui_message_stream(run, session_id="s1")]
+        parts = [part async for part in agent_run_to_vercel_parts(run, session_id="s1")]
         types = [p["type"] for p in parts]
         assert types[0] == "start"
         assert "finish" not in types
diff --git a/sdks/python/oss/tests/pytest/utils/test_messages_endpoint.py b/sdks/python/oss/tests/pytest/utils/test_messages_endpoint.py
new file mode 100644
index 0000000000..e2c6cdfe71
--- /dev/null
+++ b/sdks/python/oss/tests/pytest/utils/test_messages_endpoint.py
@@ -0,0 +1,226 @@
+"""Tests for the agent ``/messages`` + ``/load-session`` endpoints.
+
+Two layers:
+
+- Direct unit tests of the two pure Vercel routing helpers (``resolve_session_id``,
+  ``inject_stream_session_id``).
+- HTTP tests over a Starlette ``TestClient`` driving the real ``route(flags={"is_agent":
+  True})`` wiring with a fake agent handler (no harness/runner). Registering on a bare
+  ``FastAPI`` app keeps the auth middleware out; a stand-in sets ``request.state.auth``. The
+  offline tracing mock (mirroring ``test_negotiation_integration``) lets ``wf.invoke`` run
+  without ``ag.init()``.
+"""
+
+from unittest.mock import MagicMock, patch
+
+import pytest
+from fastapi import FastAPI
+from fastapi.testclient import TestClient
+
+from agenta.sdk.agents.adapters.vercel.routing import (
+    inject_stream_session_id,
+    resolve_session_id,
+)
+from agenta.sdk.decorators.routing import route
+from agenta.sdk.models.workflows import (
+    WorkflowBatchResponse,
+    WorkflowServiceStatus,
+    WorkflowStreamingResponse,
+)
+
+
+# ---------------------------------------------------------------------------
+# Pure helpers
+# ---------------------------------------------------------------------------
+
+
+def test_resolve_session_id_mints_echoes_and_validates():
+    assert resolve_session_id("sess_ok") == "sess_ok"
+    assert resolve_session_id(None).startswith("sess_")
+    assert resolve_session_id("bad id!") is None  # space + '!' are out of charset
+    assert resolve_session_id("x" * 200) is None  # over the length bound
+
+
+@pytest.mark.asyncio
+async def test_inject_stream_session_id_stamps_first_start_part():
+    async def base():
+        yield {"type": "start", "messageId": "m1"}
+        yield {"type": "text-delta", "id": "t1", "delta": "x"}
+
+    resp = WorkflowStreamingResponse(generator=base)
+    inject_stream_session_id(resp, "sess_z")
+
+    parts = [p async for p in resp.iterator()]
+    assert parts[0]["messageMetadata"]["sessionId"] == "sess_z"
+    assert parts[1] == {"type": "text-delta", "id": "t1", "delta": "x"}
+
+
+# ---------------------------------------------------------------------------
+# HTTP wiring
+# ---------------------------------------------------------------------------
+
+
+_UI_MESSAGE = {"role": "user", "parts": [{"type": "text", "text": "hello"}]}
+
+
+def _build_client() -> TestClient:
+    app = FastAPI()
+
+    # Stand in for AuthMiddleware (omitted by using a bare app): the endpoints read
+    # ``request.state.auth``. No credentials needed — the fake handler runs locally.
+    @app.middleware("http")
+    async def _fake_auth(request, call_next):
+        request.state.auth = {}
+        return await call_next(request)
+
+    @route("/", app=app, flags={"is_agent": True})
+    async def agent(messages=None, inputs=None, parameters=None, stream=None):
+        if stream:
+
+            async def gen():
+                yield {"type": "start", "messageId": "m1"}
+                yield {"type": "text-start", "id": "t1"}
+                yield {"type": "text-delta", "id": "t1", "delta": "hi"}
+                yield {"type": "text-end", "id": "t1"}
+                yield {"type": "finish"}
+
+            return gen()
+        return {"role": "assistant", "content": "hi", "echoed": messages}
+
+    return TestClient(app)
+
+
+def _build_failing_client() -> TestClient:
+    app = FastAPI()
+
+    @app.middleware("http")
+    async def _fake_auth(request, call_next):
+        request.state.auth = {}
+        return await call_next(request)
+
+    @route("/", app=app, flags={"is_agent": True})
+    async def failing_agent(messages=None, inputs=None, parameters=None, stream=None):
+        return WorkflowBatchResponse(
+            status=WorkflowServiceStatus(
+                code=500,
+                message="tool resolution failed before stream",
+                type="https://agenta.ai/docs/errors#v1:sdk:tool-resolution-error",
+            )
+        )
+
+    return TestClient(app)
+
+
+@pytest.fixture()
+def client():
+    """A TestClient with the offline tracing mock active so ``wf.invoke`` runs without
+    ``ag.init()`` (same approach as ``test_negotiation_integration``)."""
+    with (
+        patch("agenta.sdk.decorators.tracing.ag") as mock_ag,
+        patch("agenta.sdk.decorators.running.ag") as mock_run_ag,
+    ):
+        mock_span = MagicMock()
+        mock_span.is_recording.return_value = False
+        mock_span.get_span_context.return_value = MagicMock(trace_id=0, span_id=0)
+        mock_ag.tracing = MagicMock()
+        mock_ag.tracing.get_current_span.return_value = mock_span
+        mock_ag.tracing.redact = None
+        mock_tracer = MagicMock()
+        mock_tracer.start_as_current_span.return_value.__enter__ = MagicMock(
+            return_value=mock_span
+        )
+        mock_tracer.start_as_current_span.return_value.__exit__ = MagicMock(
+            return_value=None
+        )
+        mock_ag.tracer = mock_tracer
+        mock_run_ag.DEFAULT_AGENTA_SINGLETON_INSTANCE = MagicMock()
+        mock_run_ag.DEFAULT_AGENTA_SINGLETON_INSTANCE.api_key = None
+        yield _build_client()
+
+
+def test_messages_json_mints_session_and_folds_conversation(client):
+    res = client.post("/messages", json={"data": {"messages": [_UI_MESSAGE]}})
+    assert res.status_code == 200
+    body = res.json()
+    assert body["session_id"].startswith("sess_")
+    assert body["data"]["outputs"]["content"] == "hi"
+    # The Vercel UIMessage was folded to a neutral {role, content} message for the handler.
+    assert body["data"]["outputs"]["echoed"] == [{"role": "user", "content": "hello"}]
+
+
+def test_messages_echoes_supplied_session_id(client):
+    res = client.post(
+        "/messages",
+        json={"session_id": "sess_keep", "data": {"messages": [_UI_MESSAGE]}},
+    )
+    assert res.status_code == 200
+    assert res.json()["session_id"] == "sess_keep"
+
+
+def test_messages_sse_streams_with_done_and_session_in_start(client):
+    res = client.post(
+        "/messages",
+        headers={"accept": "text/event-stream"},
+        json={"session_id": "sess_abc", "data": {"messages": [_UI_MESSAGE]}},
+    )
+    assert res.status_code == 200
+    assert res.headers["x-vercel-ai-ui-message-stream"] == "v1"
+    text = res.text
+    assert '"sessionId": "sess_abc"' in text  # stamped onto the start part
+    assert '"type": "text-delta"' in text
+    assert "data: [DONE]" in text
+
+
+def test_messages_sse_preserves_json_error_before_stream():
+    with (
+        patch("agenta.sdk.decorators.tracing.ag") as mock_ag,
+        patch("agenta.sdk.decorators.running.ag") as mock_run_ag,
+    ):
+        mock_span = MagicMock()
+        mock_span.is_recording.return_value = False
+        mock_span.get_span_context.return_value = MagicMock(trace_id=0, span_id=0)
+        mock_ag.tracing = MagicMock()
+        mock_ag.tracing.get_current_span.return_value = mock_span
+        mock_ag.tracing.redact = None
+        mock_tracer = MagicMock()
+        mock_tracer.start_as_current_span.return_value.__enter__ = MagicMock(
+            return_value=mock_span
+        )
+        mock_tracer.start_as_current_span.return_value.__exit__ = MagicMock(
+            return_value=None
+        )
+        mock_ag.tracer = mock_tracer
+        mock_run_ag.DEFAULT_AGENTA_SINGLETON_INSTANCE = MagicMock()
+        mock_run_ag.DEFAULT_AGENTA_SINGLETON_INSTANCE.api_key = None
+        client = _build_failing_client()
+
+        response = client.post(
+            "/messages",
+            headers={"accept": "text/event-stream"},
+            json={
+                "session_id": "sess_error",
+                "data": {"messages": [_UI_MESSAGE]},
+            },
+        )
+
+    assert response.status_code == 500
+    assert response.headers["content-type"].startswith("application/json")
+    assert "x-vercel-ai-ui-message-stream" not in response.headers
+    body = response.json()
+    assert body["status"]["code"] == 500
+    assert "tool resolution failed before stream" in body["status"]["message"]
+    assert body["session_id"] == "sess_error"
+    assert "[DONE]" not in response.text
+
+
+def test_messages_rejects_invalid_session_id(client):
+    res = client.post(
+        "/messages", json={"session_id": "bad id!", "data": {"messages": []}}
+    )
+    assert res.status_code == 400
+
+
+def test_load_session_returns_stub_history(client):
+    res = client.post("/load-session", json={"session_id": "sess_abc"})
+    assert res.status_code == 200
+    assert res.json() == {"session_id": "sess_abc", "messages": []}
diff --git a/sdks/python/oss/tests/pytest/utils/test_routing.py b/sdks/python/oss/tests/pytest/utils/test_routing.py
index 0bed6e7922..a1851e5907 100644
--- a/sdks/python/oss/tests/pytest/utils/test_routing.py
+++ b/sdks/python/oss/tests/pytest/utils/test_routing.py
@@ -7,6 +7,8 @@
 3. router= param   — issues DeprecationWarning, falls back to prefixed registration
 """
 
+import asyncio
+import json
 import warnings
 
 import pytest
@@ -15,10 +17,13 @@
 
 from agenta.sdk.decorators.routing import (
     _RESERVED_PATHS,
+    _make_stream_response,
     _validate_path,
     create_app,
     route,
 )
+from agenta.sdk.agents.adapters.vercel.sse import vercel_sse_stream
+from agenta.sdk.models.workflows import WorkflowStreamingResponse
 
 
 # ---------------------------------------------------------------------------
@@ -127,6 +132,49 @@ async def foo():
         assert "/foo/invoke" not in parent_schema.get("paths", {})
 
 
+# ---------------------------------------------------------------------------
+# 2b. Agent-only endpoints (/messages + /load-session), gated on is_agent
+# ---------------------------------------------------------------------------
+
+
+class TestAgentEndpoints:
+    def test_is_agent_sub_app_has_messages_and_load_session(self):
+        app = create_app()
+
+        @route("/chat", app=app, flags={"is_agent": True})
+        async def chat():
+            return {"role": "assistant", "content": "hi"}
+
+        schema = _mounts(app)["/chat"].app.openapi()
+        assert "/messages" in schema["paths"]
+        assert "/load-session" in schema["paths"]
+        assert "/invoke" in schema["paths"]  # the base routes are still present
+
+    def test_non_agent_route_has_no_agent_endpoints(self):
+        app = create_app()
+
+        @route("/qa", app=app)
+        async def qa():
+            return "answer"
+
+        schema = _mounts(app)["/qa"].app.openapi()
+        assert "/messages" not in schema["paths"]
+        assert "/load-session" not in schema["paths"]
+
+    def test_root_agent_route_registers_on_mount_root(self):
+        # The agent app uses route("/", app=app, flags={"is_agent": True}); the endpoints
+        # land on the app itself, not a mounted sub-app.
+        app = create_app()
+
+        @route("/", app=app, flags={"is_agent": True})
+        async def agent():
+            return {"role": "assistant", "content": "hi"}
+
+        schema = app.openapi()
+        assert "/messages" in schema["paths"]
+        assert "/load-session" in schema["paths"]
+
+
 # ---------------------------------------------------------------------------
 # 3. router= deprecation warning
 # ---------------------------------------------------------------------------
@@ -179,3 +227,76 @@ async def noisy():
         mounts_after = set(_mount_paths(default_app))
         # No new mounts should have appeared on default_app
         assert mounts_after == mounts_before
+
+
+# ---------------------------------------------------------------------------
+# 4. Reserved agent paths (/messages, /load-session)
+# ---------------------------------------------------------------------------
+
+
+class TestReservedAgentPaths:
+    def test_agent_endpoint_names_are_reserved(self):
+        assert {"messages", "load-session"} <= _RESERVED_PATHS
+
+    @pytest.mark.parametrize("reserved", ["messages", "load-session"])
+    def test_route_rejects_reserved_agent_path(self, reserved):
+        with pytest.raises(ValueError, match=reserved):
+            route(f"/{reserved}")
+
+
+# ---------------------------------------------------------------------------
+# 5. Vercel UI Message Stream framing
+# ---------------------------------------------------------------------------
+
+
+async def _collect(aiter):
+    return [chunk async for chunk in aiter]
+
+
+def _sse_payload(chunk: str) -> str:
+    """The JSON body of one `data: <json>\\n\\n` SSE event."""
+    assert chunk.startswith("data: ") and chunk.endswith("\n\n")
+    return chunk[len("data: ") : -2]
+
+
+class TestVercelUIMessageStream:
+    def test_framing_wraps_each_part_and_appends_done(self):
+        async def parts():
+            yield {"type": "start", "messageMetadata": {"sessionId": "sess_1"}}
+            yield {"type": "text-delta", "id": "t1", "delta": "hi"}
+            yield {"type": "finish"}
+
+        chunks = asyncio.run(_collect(vercel_sse_stream(parts())))
+
+        # one SSE event per part, plus the terminal [DONE]
+        assert len(chunks) == 4
+        assert json.loads(_sse_payload(chunks[0])) == {
+            "type": "start",
+            "messageMetadata": {"sessionId": "sess_1"},
+        }
+        assert json.loads(_sse_payload(chunks[1])) == {
+            "type": "text-delta",
+            "id": "t1",
+            "delta": "hi",
+        }
+        assert chunks[-1] == "data: [DONE]\n\n"
+
+    def test_done_is_emitted_for_an_empty_stream(self):
+        async def parts():
+            return
+            yield  # pragma: no cover — makes this an async generator
+
+        chunks = asyncio.run(_collect(vercel_sse_stream(parts())))
+        assert chunks == ["data: [DONE]\n\n"]
+
+    def test_make_stream_response_vercel_sets_headers_and_media_type(self):
+        async def parts():
+            yield {"type": "start"}
+
+        response = WorkflowStreamingResponse(generator=lambda: parts())
+        res = _make_stream_response(response, "vercel")
+
+        assert res.media_type == "text/event-stream"
+        assert res.headers["x-vercel-ai-ui-message-stream"] == "v1"
+        assert res.headers["cache-control"] == "no-cache"
+        assert res.headers["x-accel-buffering"] == "no"
diff --git a/services/agent/README.md b/services/agent/README.md
index 8c8e8f949d..4a96328479 100644
--- a/services/agent/README.md
+++ b/services/agent/README.md
@@ -84,7 +84,7 @@ live workflow span.
 Tools are resolved in the Python backend and arrive on the request as `customTools` plus a
 `toolCallback`. Delivery is capability-routed: the Pi extension registers them natively;
 other harnesses get them over MCP through `tools/mcp-bridge.ts` + `tools/mcp-server.ts`.
-Either way each call POSTs back to Agenta's `/tools/call` (`tools/client.ts`), so the
+Either way each call POSTs back to Agenta's `/tools/call` (`tools/callback.ts`), so the
 provider key and connection auth stay server-side.
 
 ## The extension bundle
diff --git a/services/agent/src/engines/pi.ts b/services/agent/src/engines/pi.ts
index 91b63e0a24..cd1ce93505 100644
--- a/services/agent/src/engines/pi.ts
+++ b/services/agent/src/engines/pi.ts
@@ -43,7 +43,8 @@ import {
   type ToolCallbackContext,
   resolvePromptText,
 } from "../protocol.ts";
-import { EMPTY_OBJECT_SCHEMA, callAgentaTool } from "../tools/client.ts";
+import { EMPTY_OBJECT_SCHEMA } from "../tools/callback.ts";
+import { runResolvedTool } from "../tools/dispatch.ts";
 
 /** What the in-process Pi engine supports. Static (no daemon to probe, unlike rivet). */
 const PI_CAPABILITIES: HarnessCapabilities = {
@@ -140,43 +141,63 @@ function lastStopReason(messages: any[]): string | undefined {
 }
 
 /**
- * Turn resolved tool specs into Pi customTools. Each tool's `execute` does one POST back
- * through Agenta's /tools/call, so Pi runs the loop while the Composio key and connection
- * auth stay server-side. A failed call throws, which Pi turns into a tool-error result
- * (the loop continues) rather than a run failure.
+ * Turn resolved tool specs into Pi customTools, branching on the executor `kind`:
+ *  - `callback` (default): `execute` POSTs back through Agenta's /tools/call, so the Composio
+ *    key and connection auth stay server-side.
+ *  - `code`: `execute` runs the snippet in a sandbox subprocess with its scoped secret env.
+ *  - `client`: browser-fulfilled, so skipped on the in-process path (no browser to answer).
+ *
+ * A failed `execute` throws, which Pi turns into a tool-error result (the loop continues)
+ * rather than a run failure. Pi accepts a plain JSON Schema for `parameters` (non-TypeBox path).
  */
 export function buildCustomTools(
   specs: ResolvedToolSpec[],
   callback: ToolCallbackContext | undefined,
 ): any[] {
-  if (specs.length === 0) return [];
-  if (!callback?.endpoint) {
-    log(`skipping ${specs.length} custom tool(s): missing toolCallback endpoint`);
-    return [];
+  const tools: any[] = [];
+  for (const spec of specs) {
+    const base = {
+      name: spec.name,
+      label: spec.name,
+      description: spec.description ?? spec.name,
+      parameters: (spec.inputSchema as any) ?? EMPTY_OBJECT_SCHEMA,
+    };
+    if (spec.kind === "client") {
+      log(`skipping client tool '${spec.name}' (browser-fulfilled; not available in-process)`);
+      continue;
+    }
+    if (spec.kind === "code") {
+      tools.push({
+        ...base,
+        async execute(toolCallId: string, params: unknown, signal?: AbortSignal) {
+          const text = await runResolvedTool(spec, params, { toolCallId, signal });
+          return { content: [{ type: "text", text }], details: { kind: "code" } };
+        },
+      });
+      continue;
+    }
+    // callback (default): route back to Agenta's /tools/call.
+    if (!callback?.endpoint) {
+      log(`skipping callback tool '${spec.name}': missing toolCallback endpoint`);
+      continue;
+    }
+    tools.push({
+      ...base,
+      async execute(toolCallId: string, params: unknown, signal?: AbortSignal) {
+        const text = await runResolvedTool(spec, params, {
+          toolCallId,
+          endpoint: callback.endpoint,
+          authorization: callback.authorization,
+          signal,
+        });
+        return {
+          content: [{ type: "text", text }],
+          details: { callRef: spec.callRef },
+        };
+      },
+    });
   }
-
-  return specs.map((spec) => ({
-    name: spec.name,
-    label: spec.name,
-    description: spec.description ?? spec.name,
-    // Pi accepts a plain JSON Schema for `parameters` (its validator has a non-TypeBox
-    // path); the schema is resolved live from the provider catalog.
-    parameters: (spec.inputSchema as any) ?? EMPTY_OBJECT_SCHEMA,
-    async execute(toolCallId: string, params: unknown, signal?: AbortSignal) {
-      const text = await callAgentaTool(
-        callback.endpoint,
-        callback.authorization,
-        spec.callRef,
-        toolCallId,
-        params,
-        signal,
-      );
-      return {
-        content: [{ type: "text", text }],
-        details: { callRef: spec.callRef },
-      };
-    },
-  }));
+  return tools;
 }
 
 export async function runPi(
@@ -261,101 +282,114 @@ export async function runPi(
       log(`custom tools: ${customTools.map((t) => t.name).join(", ")}`);
     }
 
-    const { session } = await createAgentSession({
-      cwd,
-      model,
-      authStorage,
-      modelRegistry,
-      tools: toolAllowlist,
-      customTools,
-      sessionManager: SessionManager.inMemory(cwd),
-      settingsManager: SettingsManager.inMemory(),
-      resourceLoader: loader,
-    });
+    // Created before the prompt so a throw mid-run still flushes the partial trace and
+    // disposes the session (the inner finally below). Mirrors the rivet engine's pattern.
+    let session: Awaited<ReturnType<typeof createAgentSession>>["session"] | undefined;
+    try {
+      ({ session } = await createAgentSession({
+        cwd,
+        model,
+        authStorage,
+        modelRegistry,
+        tools: toolAllowlist,
+        customTools,
+        sessionManager: SessionManager.inMemory(cwd),
+        settingsManager: SettingsManager.inMemory(),
+        resourceLoader: loader,
+      }));
 
-    // Hand the session id + model to the extension so spans carry them.
-    otel.config.sessionId = session.sessionId;
-    otel.config.provider = model.provider;
-    otel.config.requestModel = model.id;
+      // Hand the session id + model to the extension so spans carry them.
+      otel.config.sessionId = session.sessionId;
+      otel.config.provider = model.provider;
+      otel.config.requestModel = model.id;
 
-    // Accumulate streamed text as the primary output channel. On the streaming path, flush
-    // each Pi `text_delta` as a `message_delta` live (Pi deltas are already pure, so they
-    // emit verbatim); the block opens on the first delta and closes after the run.
-    let streamed = "";
-    let piTextId: string | undefined;
-    session.subscribe((event: any) => {
-      if (
-        event.type === "message_update" &&
-        event.assistantMessageEvent?.type === "text_delta"
-      ) {
-        const delta = event.assistantMessageEvent.delta ?? "";
-        if (!delta) return;
-        streamed += delta;
-        if (emit) {
-          if (piTextId === undefined) {
-            piTextId = "msg-0";
-            emit({ type: "message_start", id: piTextId });
+      // Accumulate streamed text as the primary output channel. On the streaming path, flush
+      // each Pi `text_delta` as a `message_delta` live (Pi deltas are already pure, so they
+      // emit verbatim); the block opens on the first delta and closes after the run.
+      let streamed = "";
+      let piTextId: string | undefined;
+      session.subscribe((event: any) => {
+        if (
+          event.type === "message_update" &&
+          event.assistantMessageEvent?.type === "text_delta"
+        ) {
+          const delta = event.assistantMessageEvent.delta ?? "";
+          if (!delta) return;
+          streamed += delta;
+          if (emit) {
+            if (piTextId === undefined) {
+              piTextId = "msg-0";
+              emit({ type: "message_start", id: piTextId });
+            }
+            emit({ type: "message_delta", id: piTextId, delta });
           }
-          emit({ type: "message_delta", id: piTextId, delta });
         }
-      }
-    });
+      });
 
-    await session.prompt(prompt);
+      await session.prompt(prompt);
 
-    const output = streamed.trim() || extractAssistantText(session.messages);
-    const sessionId = session.sessionId;
-    const stopReason = lastStopReason(session.messages);
-    const usage = otel.usage();
-    session.dispose();
+      const output = streamed.trim() || extractAssistantText(session.messages);
+      const sessionId = session.sessionId;
+      const stopReason = lastStopReason(session.messages);
+      const usage = otel.usage();
 
-    // Ship this run's trace before the result is returned (and before the CLI process
-    // exits): invoke_agent has a remote parent, so the per-trace flush is what exports it.
-    await otel.flush();
+      // Ship this run's trace before the result is returned (and before the CLI process
+      // exits): invoke_agent has a remote parent, so the per-trace flush is what exports it.
+      await otel.flush();
 
-    // The structured stream is thinner here than on the rivet path: Pi's in-process tool
-    // events feed the trace spans, while the result-level event log carries the final
-    // message, usage, and stop reason (enough for the platform without double-plumbing).
-    //
-    // On the streaming path the events were flushed live via `emit`, so the result log stays
-    // empty; here we only close the open text block (or synthesize one when the text never
-    // streamed) and flush the tail usage/done events.
-    const events: AgentEvent[] = [];
-    const emitOrLog = (event: AgentEvent): void => {
-      if (emit) emit(event);
-      else events.push(event);
-    };
-    if (emit) {
-      if (piTextId !== undefined) {
-        emit({ type: "message_end", id: piTextId });
+      // The structured stream is thinner here than on the rivet path: Pi's in-process tool
+      // events feed the trace spans, while the result-level event log carries the final
+      // message, usage, and stop reason (enough for the platform without double-plumbing).
+      //
+      // On the streaming path the events were flushed live via `emit`, so the result log stays
+      // empty; here we only close the open text block (or synthesize one when the text never
+      // streamed) and flush the tail usage/done events.
+      const events: AgentEvent[] = [];
+      const emitOrLog = (event: AgentEvent): void => {
+        if (emit) emit(event);
+        else events.push(event);
+      };
+      if (emit) {
+        if (piTextId !== undefined) {
+          emit({ type: "message_end", id: piTextId });
+        } else if (output) {
+          emit({ type: "message_start", id: "msg-0" });
+          emit({ type: "message_delta", id: "msg-0", delta: output });
+          emit({ type: "message_end", id: "msg-0" });
+        }
       } else if (output) {
-        emit({ type: "message_start", id: "msg-0" });
-        emit({ type: "message_delta", id: "msg-0", delta: output });
-        emit({ type: "message_end", id: "msg-0" });
+        events.push({ type: "message", text: output });
       }
-    } else if (output) {
-      events.push({ type: "message", text: output });
-    }
-    if (usage.total > 0) emitOrLog({ type: "usage", ...usage });
-    emitOrLog({ type: "done", stopReason });
+      if (usage.total > 0) emitOrLog({ type: "usage", ...usage });
+      emitOrLog({ type: "done", stopReason });
 
-    const messages: ChatMessage[] = output
-      ? [{ role: "assistant", content: output }]
-      : [];
+      const messages: ChatMessage[] = output
+        ? [{ role: "assistant", content: output }]
+        : [];
 
-    return {
-      ok: true,
-      output,
-      messages,
-      events,
-      usage,
-      stopReason,
-      // `streamingDeltas` is only honest when a live sink carried the deltas end-to-end.
-      capabilities: { ...PI_CAPABILITIES, streamingDeltas: !!emit },
-      sessionId,
-      model: `${model.provider}/${model.id}`,
-      traceId: otel.config.traceId,
-    };
+      return {
+        ok: true,
+        output,
+        messages,
+        events,
+        usage,
+        stopReason,
+        // `streamingDeltas` is only honest when a live sink carried the deltas end-to-end.
+        capabilities: { ...PI_CAPABILITIES, streamingDeltas: !!emit },
+        sessionId,
+        model: `${model.provider}/${model.id}`,
+        traceId: otel.config.traceId,
+      };
+    } catch (err) {
+      // Flush the partial trace before the error propagates so a failed run is still
+      // observable (the happy-path flush above never ran). Best-effort: never mask `err`.
+      await otel.flush().catch(() => {});
+      throw err;
+    } finally {
+      // Pi keeps the in-memory session alive until disposed; release it on every exit
+      // (success or throw). Guarded for the case where createAgentSession itself threw.
+      session?.dispose();
+    }
   } finally {
     try {
       rmSync(cwd, { recursive: true, force: true });
diff --git a/services/agent/src/engines/rivet.ts b/services/agent/src/engines/rivet.ts
index 62f978a5c8..6437d7502c 100644
--- a/services/agent/src/engines/rivet.ts
+++ b/services/agent/src/engines/rivet.ts
@@ -45,14 +45,22 @@ import { local } from "sandbox-agent/local";
 import { daytona } from "sandbox-agent/daytona";
 
 import { createRivetOtel } from "../tracing/otel.ts";
-import { buildToolMcpServers } from "../tools/mcp-bridge.ts";
+import { buildToolMcpServers, type McpServerStdio } from "../tools/mcp-bridge.ts";
 import { startToolRelay } from "../tools/relay.ts";
+import {
+  PolicyResponder,
+  decisionToReply,
+  policyFromRequest,
+  type Responder,
+} from "../responder.ts";
 import {
   type AgentRunRequest,
   type AgentRunResult,
   type ChatMessage,
+  type ContentBlock,
   type EmitEvent,
   type HarnessCapabilities,
+  type McpServerConfig,
   type ResolvedToolSpec,
   type ToolCallbackContext,
   messageText,
@@ -252,19 +260,60 @@ function priorMessages(request: AgentRunRequest): ChatMessage[] {
   return lastMatch === -1 ? messages : messages.filter((_, i) => i !== lastMatch);
 }
 
+function safeJson(value: unknown): string {
+  if (value === undefined || value === null) return "";
+  try {
+    return typeof value === "string" ? value : JSON.stringify(value);
+  } catch {
+    return String(value);
+  }
+}
+
+/**
+ * Render one message for the replayed transcript, INCLUDING resolved tool turns. Under the
+ * cold model the harness rebuilds context from this text, and ACP prompt content blocks
+ * cannot carry tool calls/results — so a resolved interaction (an approved tool that ran, a
+ * client-fulfilled tool) is encoded here as text, letting the model resume from the result
+ * instead of re-asking. This is the cross-turn HITL continuation substrate: the `/messages`
+ * egress folds inbound UIMessage tool/approval parts into `tool_call` / `tool_result` content
+ * blocks, and they survive into the replay here. Plain string / text blocks pass through;
+ * image/resource blocks are summarized.
+ */
+export function messageTranscript(content: string | ContentBlock[] | undefined): string {
+  if (!content) return "";
+  if (typeof content === "string") return content;
+  const parts: string[] = [];
+  for (const block of content) {
+    if (!block) continue;
+    if (block.type === "text" && typeof block.text === "string") {
+      parts.push(block.text);
+    } else if (block.type === "tool_call") {
+      parts.push(`[called ${block.toolName ?? "tool"}(${safeJson(block.input)})]`);
+    } else if (block.type === "tool_result") {
+      const body = safeJson(block.output);
+      parts.push(`[${block.toolName ?? "tool"} ${block.isError ? "error" : "returned"}: ${body}]`);
+    } else if (block.type === "image") {
+      parts.push("[image]");
+    } else if (block.type === "resource") {
+      parts.push(block.uri ? `[resource: ${block.uri}]` : "[resource]");
+    }
+  }
+  return parts.filter(Boolean).join("\n");
+}
+
 /**
  * The text sent over ACP for this turn. Each invoke is a cold sandbox, so prior turns
  * are replayed as transcript context ahead of the latest user message — this is the
  * "persisted message history replayed" model, with the client/playground holding the
  * history. Capped by AGENTA_AGENT_HISTORY_MAX_CHARS so replay tokens stay bounded.
  */
-function buildTurnText(request: AgentRunRequest): string {
+export function buildTurnText(request: AgentRunRequest): string {
   const latest = resolvePrompt(request);
-  const history = priorMessages(request).filter((m) => messageText(m.content));
+  const history = priorMessages(request).filter((m) => messageTranscript(m.content));
   if (history.length === 0) return latest;
 
   const maxChars = Number(process.env.AGENTA_AGENT_HISTORY_MAX_CHARS ?? 24000);
-  let transcript = history.map((m) => `${m.role}: ${messageText(m.content)}`).join("\n");
+  let transcript = history.map((m) => `${m.role}: ${messageTranscript(m.content)}`).join("\n");
   if (transcript.length > maxChars) transcript = transcript.slice(-maxChars);
   return (
     `Conversation so far:\n${transcript}\n\n` +
@@ -272,6 +321,33 @@ function buildTurnText(request: AgentRunRequest): string {
   );
 }
 
+/**
+ * Convert user-declared MCP servers (already resolved server-side, secrets injected into
+ * `env`) into ACP stdio entries. Only `stdio` is delivered over ACP today; `http`/remote
+ * carries no auth on the wire by design and is skipped. The per-server `tools` allowlist is
+ * NOT enforced over ACP in v1 — the harness lists all of a server's tools — so it is dropped
+ * with a log rather than silently implying a filter that does not happen.
+ */
+export function toAcpMcpServers(servers: McpServerConfig[] | undefined): McpServerStdio[] {
+  const out: McpServerStdio[] = [];
+  for (const s of servers ?? []) {
+    if ((s.transport ?? "stdio") !== "stdio" || !s.command) {
+      log(`skipping non-stdio MCP server '${s?.name ?? "?"}' (remote transport deferred)`);
+      continue;
+    }
+    if (s.tools && s.tools.length > 0) {
+      log(`MCP server '${s.name}': per-server tool allowlist not enforced over ACP (v1)`);
+    }
+    out.push({
+      name: s.name,
+      command: s.command,
+      args: s.args ?? [],
+      env: Object.entries(s.env ?? {}).map(([name, value]) => ({ name, value: String(value) })),
+    });
+  }
+  return out;
+}
+
 /**
  * Pick the harness-specific model id for a requested name. Harnesses expose their own
  * ids (Pi: "openai-codex/gpt-5.5"; Claude: its own). Match exact, then by the id after
@@ -715,12 +791,25 @@ export async function runRivet(
     // forward MCP, Claude/Codex do).
     const capabilities = await probeCapabilities(sandbox, harness);
     const toolSpecs = (request.customTools as ResolvedToolSpec[]) ?? [];
+    const userMcpCount = request.mcpServers?.length ?? 0;
+    // MCP delivery is gated on `mcpTools`: pi-acp does not forward MCP, Claude/Codex do. The
+    // synthesized `agenta-tools` server (gateway/code tools) and the user-declared servers
+    // ride the same gate.
     const mcpServers =
       !isPi && capabilities.mcpTools
-        ? buildToolMcpServers(toolSpecs, request.toolCallback as ToolCallbackContext | undefined)
+        ? [
+            ...buildToolMcpServers(
+              toolSpecs,
+              request.toolCallback as ToolCallbackContext | undefined,
+            ),
+            ...toAcpMcpServers(request.mcpServers),
+          ]
         : [];
-    if (!isPi && toolSpecs.length > 0 && !capabilities.mcpTools) {
-      log(`harness '${harness}' lacks MCP tool support; ${toolSpecs.length} tool(s) not delivered`);
+    if (!isPi && (toolSpecs.length > 0 || userMcpCount > 0) && !capabilities.mcpTools) {
+      log(
+        `harness '${harness}' lacks MCP support; ${toolSpecs.length} tool(s) and ` +
+          `${userMcpCount} user MCP server(s) not delivered`,
+      );
     }
 
     const session = await sandbox.createSession({
@@ -759,19 +848,37 @@ export async function runRivet(
       if (update) run.handleUpdate(update);
     });
 
-    // Auto-approve permission requests so a permission-gating harness (e.g. Claude
-    // Code) does not block on tool use. Tools are backend-resolved and trusted; the run
-    // is headless so there is no human to prompt. The per-run `permissionPolicy` (or the
-    // AGENTA_RIVET_DENY_PERMISSIONS env) flips this to reject.
-    const denyPermissions =
-      request.permissionPolicy === "deny" ||
-      process.env.AGENTA_RIVET_DENY_PERMISSIONS === "true";
+    // Permission gating, behind the Responder seam. Pi never gates; a permission-gating
+    // harness (e.g. Claude) raises a request, which we (a) surface as an `interaction_request`
+    // event so the egress can project it (Vercel `tool-approval-request`) and the trace can
+    // record it, and (b) resolve via the responder. The headless `PolicyResponder` keeps the
+    // prior behavior: auto-allow trusted backend tools, or deny per `permissionPolicy` /
+    // AGENTA_RIVET_DENY_PERMISSIONS. A cross-turn responder (true HITL) slots in here later
+    // without touching the harness. Tools are backend-resolved and trusted; the run is headless.
+    const responder: Responder = new PolicyResponder(policyFromRequest(request.permissionPolicy));
     session.onPermissionRequest((req: any) => {
-      const replies: string[] = req?.availableReplies ?? [];
-      const reply = denyPermissions
-        ? "reject"
-        : replies.find((r) => r === "always") ?? replies.find((r) => r === "once") ?? "once";
-      if (req?.id) session.respondPermission(req.id, reply as any).catch(() => {});
+      const id = String(req?.id ?? "");
+      const availableReplies: string[] = req?.availableReplies ?? [];
+      run.emitEvent({
+        type: "interaction_request",
+        id, // ACP permission id -> Vercel approvalId
+        kind: "permission",
+        payload: {
+          // toolCallId of the gated tool, so the cross-turn approval reply correlates back to
+          // its tool call (and the #6 resume finds it). `toolCall` is the ACP ToolCallUpdate.
+          toolCallId: req?.toolCall?.toolCallId,
+          toolCall: req?.toolCall,
+          availableReplies,
+          options: req?.options,
+        },
+      });
+      void responder
+        .onPermission({ id, availableReplies, raw: req })
+        .then((decision) => {
+          if (!req?.id) return;
+          return session.respondPermission(req.id, decisionToReply(decision, availableReplies) as any);
+        })
+        .catch(() => {});
     });
 
     if (useToolRelay) {
diff --git a/services/agent/src/extensions/agenta.ts b/services/agent/src/extensions/agenta.ts
index e410b9d46e..5e50b47ceb 100644
--- a/services/agent/src/extensions/agenta.ts
+++ b/services/agent/src/extensions/agenta.ts
@@ -24,71 +24,19 @@
  * it (local, the docker sidecar, a Daytona snapshot). Default export is the Pi
  * ExtensionFactory.
  */
-import { existsSync, mkdirSync, readFileSync, unlinkSync, writeFileSync } from "node:fs";
+import { writeFileSync } from "node:fs";
 
 import type { ExtensionAPI } from "@earendil-works/pi-coding-agent";
 
 import { createAgentaOtel } from "../tracing/otel.ts";
 import type { ResolvedToolSpec } from "../protocol.ts";
-import { EMPTY_OBJECT_SCHEMA, callAgentaTool } from "../tools/client.ts";
-import {
-  RELAY_POLL_MS,
-  RELAY_REQ_SUFFIX,
-  RELAY_RES_SUFFIX,
-  RELAY_TIMEOUT_MS,
-  sanitizeRelayId,
-  sleep,
-  type RelayResponse,
-} from "../tools/relay.ts";
+import { EMPTY_OBJECT_SCHEMA } from "../tools/callback.ts";
+import { runResolvedTool } from "../tools/dispatch.ts";
 
 function log(message: string): void {
   process.stderr.write(`[agenta-pi-ext] ${message}\n`);
 }
 
-/**
- * Daytona tool call: the in-sandbox process can't reach Agenta, so write the request to a
- * file the runner watches and poll for the response it writes back (see tools/relay.ts).
- */
-async function relayToolCall(
-  dir: string,
-  callRef: string,
-  toolCallId: string,
-  params: unknown,
-  signal?: AbortSignal,
-): Promise<string> {
-  const id = sanitizeRelayId(toolCallId);
-  const reqPath = `${dir}/${id}${RELAY_REQ_SUFFIX}`;
-  const resPath = `${dir}/${id}${RELAY_RES_SUFFIX}`;
-  try {
-    mkdirSync(dir, { recursive: true });
-  } catch {
-    // The runner also creates it; a race here is harmless.
-  }
-  writeFileSync(reqPath, JSON.stringify({ callRef, toolCallId, args: params ?? {} }), "utf-8");
-
-  const deadline = Date.now() + RELAY_TIMEOUT_MS;
-  while (Date.now() < deadline) {
-    if (signal?.aborted) throw new Error("aborted");
-    if (existsSync(resPath)) {
-      const res = JSON.parse(readFileSync(resPath, "utf-8")) as RelayResponse;
-      try {
-        unlinkSync(reqPath);
-      } catch {
-        /* best-effort cleanup */
-      }
-      try {
-        unlinkSync(resPath);
-      } catch {
-        /* best-effort cleanup */
-      }
-      if (res.ok) return res.text ?? "";
-      throw new Error(res.error || `tool relay failed for ${callRef}`);
-    }
-    await sleep(RELAY_POLL_MS);
-  }
-  throw new Error(`tool relay timed out for ${callRef}`);
-}
-
 /** Register the resolved tools (from env) as Pi tools that call back to Agenta. */
 function registerTools(pi: ExtensionAPI): void {
   const raw = process.env.AGENTA_TOOL_SPECS;
@@ -107,7 +55,13 @@ function registerTools(pi: ExtensionAPI): void {
   // the runner via files in this dir. Unset for local runs (direct /tools/call).
   const relayDir = process.env.AGENTA_TOOL_RELAY_DIR;
 
+  let registered = 0;
   for (const spec of specs) {
+    // `client` tools are browser-fulfilled; there is no browser in the sandbox to answer.
+    if (spec.kind === "client") {
+      log(`skipping client tool '${spec.name}' (browser-fulfilled)`);
+      continue;
+    }
     pi.registerTool({
       name: spec.name,
       label: spec.name,
@@ -115,14 +69,24 @@ function registerTools(pi: ExtensionAPI): void {
       // Pi accepts plain JSON Schema here (non-TypeBox validation path).
       parameters: (spec.inputSchema as any) ?? EMPTY_OBJECT_SCHEMA,
       async execute(toolCallId: string, params: unknown, signal?: AbortSignal) {
-        const text = relayDir
-          ? await relayToolCall(relayDir, spec.callRef, toolCallId, params, signal)
-          : await callAgentaTool(endpoint, authorization, spec.callRef, toolCallId, params, signal);
-        return { content: [{ type: "text", text }], details: { callRef: spec.callRef } };
+        // `code` runs the snippet locally in the sandbox (no relay, even on Daytona); the
+        // callback path relays through the runner on Daytona, else POSTs to /tools/call.
+        const text = await runResolvedTool(spec, params, {
+          toolCallId,
+          endpoint,
+          authorization,
+          relayDir,
+          signal,
+        });
+        return {
+          content: [{ type: "text", text }],
+          details: spec.kind === "code" ? { kind: "code" } : { callRef: spec.callRef },
+        };
       },
     } as any);
+    registered += 1;
   }
-  log(`registered ${specs.length} tool(s) -> ${relayDir ? `relay ${relayDir}` : endpoint}`);
+  log(`registered ${registered} tool(s) -> ${relayDir ? `relay ${relayDir}` : endpoint}`);
 }
 
 /** The Pi ExtensionFactory: tools + (env-driven) tracing + usage writeback. */
diff --git a/services/agent/src/protocol.ts b/services/agent/src/protocol.ts
index 98a33ad876..086e0654b1 100644
--- a/services/agent/src/protocol.ts
+++ b/services/agent/src/protocol.ts
@@ -11,11 +11,20 @@
 
 /** One piece of a message. `text` is all the playground sends today; the rest is plumbed. */
 export interface ContentBlock {
-  type: "text" | "image" | "resource" | string;
+  type: "text" | "image" | "resource" | "tool_call" | "tool_result" | string;
   text?: string;
   data?: string;
   mimeType?: string;
   uri?: string;
+  // Tool-turn carriers, used for structured-message continuation (cross-turn HITL): a
+  // resolved tool call replays as a `tool_call` block plus a `tool_result` block so the
+  // model resumes from the result instead of re-asking. The `/messages` egress folds the
+  // inbound UIMessage tool/approval parts into these (it must not drop them).
+  toolCallId?: string;
+  toolName?: string;
+  input?: unknown;
+  output?: unknown;
+  isError?: boolean;
 }
 
 export interface ChatMessage {
@@ -38,15 +47,32 @@ export interface TraceContext {
 }
 
 /**
- * A runnable tool the backend already resolved from the agent config: name + description +
- * JSON-Schema params for the model, plus the `callRef` slug the execution bridge sends back
- * to Agenta's /tools/call. The Composio key and connection auth stay server-side.
+ * A runnable tool the backend already resolved from the agent config.
+ *
+ * Three orthogonal axes:
+ *  - `kind` (executor): how the runner fulfils a call. `callback` POSTs back through Agenta's
+ *    /tools/call (gateway tools; the Composio key stays server-side); `code` runs `code` in a
+ *    sandbox subprocess with `env` (resolved secrets, scoped to the subprocess); `client` is
+ *    fulfilled by the browser across a turn boundary. Absent = `callback` (back-compat).
+ *  - `needsApproval`: gate the call on a human yes/no (mechanics owned by the run-event layer).
+ *  - `render`: a generative-UI hint (see `RenderHint`).
+ *
+ * `callRef` is set for `callback` tools (the slug the bridge sends back to /tools/call);
+ * `runtime`/`code`/`env` for `code` tools. The Composio key and connection auth stay
+ * server-side.
  */
 export interface ResolvedToolSpec {
   name: string;
   description?: string;
   inputSchema?: Record<string, unknown> | null;
-  callRef: string;
+  /** Set for `callback` (gateway) tools only; absent for `code` / `client`. */
+  callRef?: string;
+  kind?: "callback" | "code" | "client";
+  runtime?: "python" | "node";
+  code?: string;
+  env?: Record<string, string>;
+  needsApproval?: boolean;
+  render?: RenderHint;
 }
 
 /** Where and how to route a tool call back through Agenta. */
@@ -55,6 +81,21 @@ export interface ToolCallbackContext {
   authorization?: string;
 }
 
+/**
+ * A user-declared MCP server attached to the run. `stdio` launches `command`/`args` with
+ * `env` (secret env already resolved server-side); `tools` is an optional allowlist (empty =
+ * all). Remote (`http`) carries no auth on the wire by design.
+ */
+export interface McpServerConfig {
+  name: string;
+  transport?: "stdio" | "http";
+  command?: string;
+  args?: string[];
+  env?: Record<string, string>;
+  url?: string;
+  tools?: string[];
+}
+
 /**
  * What a harness can do, probed from the runtime (rivet `AgentCapabilities`). The runner
  * branches on these flags instead of the harness name, and returns them in the result.
@@ -82,6 +123,18 @@ export interface HarnessCapabilities {
  * are emitted live on the streaming path; a consumer that sees the delta family for a block
  * never also sees a coalesced `message` for it (see `createRivetOtel.finish`).
  */
+/**
+ * A generative-UI hint stamped onto a tool's events so the frontend can render it. The
+ * tool-definition plan adds the matching `render?` field to `ResolvedToolSpec`; the runner
+ * copies it onto `tool_call` / `tool_result` so the egress can project it without a spec
+ * lookup. `component` is a prebuilt client component (no code execution); `source` ships
+ * code rendered in a sandbox; `spec` is a declarative UI tree (data, not code).
+ */
+export type RenderHint =
+  | { kind: "component"; component: string }
+  | { kind: "source"; runtime: "react" | "html"; source: string | string[] }
+  | { kind: "spec"; schema: string };
+
 export type AgentEvent =
   | { type: "message"; text: string }
   | { type: "thought"; text: string }
@@ -91,8 +144,29 @@ export type AgentEvent =
   | { type: "reasoning_start"; id: string }
   | { type: "reasoning_delta"; id: string; delta: string }
   | { type: "reasoning_end"; id: string }
-  | { type: "tool_call"; id?: string; name?: string; input?: unknown }
-  | { type: "tool_result"; id?: string; output?: string; isError?: boolean }
+  | { type: "tool_call"; id?: string; name?: string; input?: unknown; render?: RenderHint }
+  | {
+      type: "tool_result";
+      id?: string;
+      output?: string;
+      /** Structured output (object), used for generative UI; `output` stays the text form. */
+      data?: unknown;
+      isError?: boolean;
+      render?: RenderHint;
+    }
+  // A human-in-the-loop request the harness raised (ACP reverse-RPC). The egress projects
+  // it to a Vercel `tool-approval-request` (permission) or an input/data part (elicitation);
+  // the reply returns cross-turn in the next `/messages` message history, matched by `id`.
+  | {
+      type: "interaction_request";
+      id: string;
+      kind: "permission" | "input" | "client_tool";
+      payload?: unknown;
+    }
+  // One-way generative-UI payloads (not tied to a tool result). `data` -> Vercel `data-<name>`,
+  // `file` -> Vercel `file`.
+  | { type: "data"; name: string; data: unknown; transient?: boolean }
+  | { type: "file"; url: string; mediaType: string }
   | { type: "usage"; input?: number; output?: number; total?: number; cost?: number }
   | { type: "error"; message: string }
   | { type: "done"; stopReason?: string };
@@ -149,6 +223,8 @@ export interface AgentRunRequest {
   skills?: string[];
   /** Resolved runnable tools (WP-7). */
   customTools?: ResolvedToolSpec[];
+  /** User-declared MCP servers, resolved (secret env injected). Omitted when there are none. */
+  mcpServers?: McpServerConfig[];
   /** Where customTools route their calls back to. Required when customTools is set. */
   toolCallback?: ToolCallbackContext;
   /** How a permission-gating harness handles tool-use prompts: "auto" (default) | "deny". */
diff --git a/services/agent/src/responder.ts b/services/agent/src/responder.ts
new file mode 100644
index 0000000000..6af4132841
--- /dev/null
+++ b/services/agent/src/responder.ts
@@ -0,0 +1,77 @@
+/**
+ * The interaction responder seam.
+ *
+ * A harness (the ACP "Agent") does not only emit tool calls. It also raises typed
+ * reverse-RPC interaction requests that something must answer: permission gates today,
+ * elicitation (input) and client-side tools later. Today the rivet runner answered the
+ * permission gate inline with a hardcoded auto-approve. This module lifts that decision
+ * behind a `Responder` interface so it is pluggable:
+ *
+ *   - `PolicyResponder` is the headless answer (a fixed `auto` / `deny` policy, no human).
+ *     It reproduces the previous behavior exactly and is what `/invoke` uses.
+ *   - A cross-turn responder (the `/messages` HITL path) slots in here later: it surfaces the
+ *     request to the browser, ends the turn, and resolves on the next turn's reply. The
+ *     harness adapter does not change when the responder does.
+ *
+ * Resolution is modeled as `allow` / `deny`; the adapter maps that onto the harness's
+ * available ACP replies via `decisionToReply`.
+ */
+
+export type PermissionPolicy = "auto" | "deny";
+
+export type PermissionDecision = "allow" | "deny";
+
+/** A permission gate raised by the harness, normalized from the ACP request. */
+export interface PermissionRequest {
+  /** The ACP permission id; reused as the `interaction_request` event id for reply matching. */
+  id: string;
+  /** Replies the harness offers (e.g. "always" | "once" | "reject"). */
+  availableReplies: string[];
+  /** The original ACP request, for responders that want the tool-call detail. */
+  raw?: unknown;
+}
+
+/**
+ * Answers interaction requests the harness raises. Permission is the only kind wired today;
+ * `input` (elicitation) and `client_tool` are forward-looking and will extend this interface
+ * alongside the cross-turn responder.
+ */
+export interface Responder {
+  onPermission(request: PermissionRequest): Promise<PermissionDecision>;
+}
+
+/** Headless responder: a fixed policy, no human in the loop. */
+export class PolicyResponder implements Responder {
+  constructor(private readonly policy: PermissionPolicy) {}
+
+  async onPermission(_request: PermissionRequest): Promise<PermissionDecision> {
+    return this.policy === "deny" ? "deny" : "allow";
+  }
+}
+
+/**
+ * Resolve the permission policy with the same precedence as before: an explicit per-run
+ * `permissionPolicy: "deny"` or the `AGENTA_RIVET_DENY_PERMISSIONS` env flips to deny; the
+ * default is auto-allow, because backend-resolved tools are trusted and the run is headless.
+ */
+export function policyFromRequest(permissionPolicy?: string): PermissionPolicy {
+  if (permissionPolicy === "deny" || process.env.AGENTA_RIVET_DENY_PERMISSIONS === "true") {
+    return "deny";
+  }
+  return "auto";
+}
+
+/** Map an allow/deny decision onto the harness's available ACP replies. */
+export function decisionToReply(
+  decision: PermissionDecision,
+  availableReplies: string[],
+): string {
+  if (decision === "deny") {
+    return availableReplies.find((r) => r === "reject") ?? "reject";
+  }
+  return (
+    availableReplies.find((r) => r === "always") ??
+    availableReplies.find((r) => r === "once") ??
+    "once"
+  );
+}
diff --git a/services/agent/src/tools/client.ts b/services/agent/src/tools/callback.ts
similarity index 98%
rename from services/agent/src/tools/client.ts
rename to services/agent/src/tools/callback.ts
index db6a71538f..0f0bae533c 100644
--- a/services/agent/src/tools/client.ts
+++ b/services/agent/src/tools/callback.ts
@@ -1,5 +1,5 @@
 /**
- * Shared Agenta /tools/call client.
+ * Shared Agenta /tools/call callback transport.
  *
  * One implementation of the tool round-trip used by every delivery path:
  *  - engines/pi.ts buildCustomTools (in-process Pi customTools)
diff --git a/services/agent/src/tools/code.ts b/services/agent/src/tools/code.ts
new file mode 100644
index 0000000000..da8115c94a
--- /dev/null
+++ b/services/agent/src/tools/code.ts
@@ -0,0 +1,197 @@
+/**
+ * Code-tool executor: run a resolved `code` tool's snippet in the agent sandbox.
+ *
+ * A code tool ships a snippet (`code`) + a runtime (`python` | `node`) + a scoped `env` (the
+ * tool's declared vault secrets, resolved server-side). Unlike a `callback` tool, it never
+ * touches Agenta's /tools/call — it runs locally where the harness runs, which is exactly why
+ * its secrets are injected here as subprocess env (and nowhere else).
+ *
+ * Entry convention (same for both runtimes): the snippet defines a top-level `main`. A bare
+ * `def main(**inputs)` / `function main(inputs)` is found automatically; an explicit export
+ * (`module.exports.main` / `exports.main` / `module.exports = fn` in Node) is also accepted.
+ * Python calls `main(**inputs)` (keyword args from the tool input object); Node calls
+ * `main(inputs)` (the input object) and may return a promise. The return value is
+ * JSON-serialized and handed to the model as the tool result.
+ *
+ * Shared by every delivery path that runs code locally: engines/pi.ts (in-process Pi),
+ * extensions/agenta.ts (Pi under rivet), tools/mcp-server.ts (the MCP bridge for other
+ * harnesses).
+ */
+import { spawn } from "node:child_process";
+import { mkdtempSync, rmSync, writeFileSync } from "node:fs";
+import { tmpdir } from "node:os";
+import { join } from "node:path";
+
+/** Per-call budget for a code tool. Surfaced as a tool error on timeout. */
+export const CODE_TOOL_TIMEOUT_MS = Number(
+  process.env.AGENTA_AGENT_CODE_TOOL_TIMEOUT_MS ?? 30000,
+);
+
+// argv[1] is the snippet path (python `-c`/node `-e` put the first extra arg at argv[1]);
+// the tool input arrives as JSON on stdin. Both bootstraps evaluate the snippet in a fresh
+// scope and pick up a top-level `main` (a bare `def main`/`function main`), falling back to an
+// explicit export. Either way the contract is: define a callable `main`.
+const PY_BOOTSTRAP = `import sys, json
+_path = sys.argv[1]
+with open(_path) as _f:
+    _src = _f.read()
+_ns = {}
+exec(compile(_src, _path, "exec"), _ns)
+if not callable(_ns.get("main")):
+    sys.stderr.write("code tool must define a callable main(**inputs)")
+    sys.exit(1)
+_args = json.loads(sys.stdin.read() or "{}")
+_out = _ns["main"](**_args)
+sys.stdout.write(json.dumps(_out))
+`;
+
+// `require(path)` would only see CommonJS exports, so a bare top-level `function main` (which
+// exports nothing under CommonJS) would be invisible. Instead read the source and evaluate it
+// in a scope that captures a top-level `main`, while still honoring an explicit
+// `module.exports.main` / `exports.main` / `module.exports = fn`.
+const NODE_BOOTSTRAP = `const fs = require("fs");
+const path = process.argv[1];
+const src = fs.readFileSync(path, "utf8");
+const mod = { exports: {} };
+const factory = new Function(
+  "exports",
+  "require",
+  "module",
+  "__filename",
+  "__dirname",
+  src +
+    "\\n;return (typeof main !== 'undefined' ? main : (module.exports && (module.exports.main || module.exports.default)) || module.exports);",
+);
+const fn = factory(mod.exports, require, mod, path, require("path").dirname(path));
+if (typeof fn !== "function") {
+  process.stderr.write("code tool must define or export a callable main(inputs) function");
+  process.exit(1);
+}
+const args = JSON.parse(fs.readFileSync(0, "utf8") || "{}");
+Promise.resolve(fn(args))
+  .then((out) => process.stdout.write(JSON.stringify(out === undefined ? null : out)))
+  .catch((err) => { process.stderr.write(String((err && err.stack) || err)); process.exit(1); });
+`;
+
+export type CodeRuntime = "python" | "node";
+
+// The minimal set of host env vars a python3/node runtime needs to start. Deliberately
+// excludes everything secret-bearing or sidecar-specific: no AGENTA_*, no *_API_KEY /
+// *_TOKEN, no COMPOSIO_* / DAYTONA_*, no provider keys that the in-process Pi path writes
+// into process.env before a run. Only the tool's declared scoped `env` is layered on top.
+const BASE_ENV_ALLOWLIST = [
+  "PATH",
+  "HOME",
+  "LANG",
+  "LC_ALL",
+  "LC_CTYPE",
+  "TZ",
+  "TMPDIR",
+  "TEMP",
+  "TMP",
+  "NODE_PATH",
+  // Windows essentials, copied only when present.
+  "SystemRoot",
+  "ComSpec",
+];
+
+/** Build the child env from a minimal allowlist (copied only when set) plus scoped secrets. */
+function buildChildEnv(
+  env: Record<string, string> | undefined,
+): Record<string, string> {
+  const base: Record<string, string> = {};
+  for (const key of BASE_ENV_ALLOWLIST) {
+    const value = process.env[key];
+    if (value !== undefined) base[key] = value;
+  }
+  return { ...base, ...(env ?? {}) };
+}
+
+/**
+ * Run a code tool's snippet and return its JSON-serialized output as text. Throws on a
+ * non-zero exit, a timeout, or an abort; callers turn the throw into a tool-error result so
+ * the model loop continues.
+ */
+export async function runCodeTool(
+  runtime: CodeRuntime | undefined,
+  code: string,
+  env: Record<string, string> | undefined,
+  args: unknown,
+  signal?: AbortSignal,
+): Promise<string> {
+  const dir = mkdtempSync(join(tmpdir(), "agenta-code-"));
+  try {
+    const isNode = runtime === "node";
+    const scriptPath = join(dir, isNode ? "tool.js" : "tool.py");
+    writeFileSync(scriptPath, code ?? "", "utf8");
+
+    const command = isNode ? "node" : "python3";
+    const childArgs = isNode
+      ? ["-e", NODE_BOOTSTRAP, scriptPath]
+      : ["-c", PY_BOOTSTRAP, scriptPath];
+
+    return await new Promise<string>((resolve, reject) => {
+      const child = spawn(command, childArgs, {
+        // The child inherits ONLY a minimal startup allowlist (PATH, HOME, locale/temp, and
+        // Windows essentials when present) plus the tool's declared scoped secrets. It does
+        // NOT inherit the sidecar's process.env, so provider keys (OPENAI_API_KEY, etc.) that
+        // the in-process Pi path writes into process.env, AGENTA_* config, and other secret-
+        // bearing vars never reach an author-supplied snippet. Nothing is written to the
+        // agent-visible filesystem beyond the temp dir.
+        env: buildChildEnv(env),
+        stdio: ["pipe", "pipe", "pipe"],
+      });
+
+      let stdout = "";
+      let stderr = "";
+      let settled = false;
+      const finish = (fn: () => void) => {
+        if (settled) return;
+        settled = true;
+        clearTimeout(timer);
+        if (signal) signal.removeEventListener("abort", onAbort);
+        fn();
+      };
+
+      const timer = setTimeout(() => {
+        child.kill("SIGKILL");
+        finish(() =>
+          reject(new Error(`code tool timed out after ${CODE_TOOL_TIMEOUT_MS}ms`)),
+        );
+      }, CODE_TOOL_TIMEOUT_MS);
+
+      const onAbort = () => {
+        child.kill("SIGKILL");
+        finish(() => reject(new Error("aborted")));
+      };
+      if (signal) {
+        if (signal.aborted) onAbort();
+        else signal.addEventListener("abort", onAbort, { once: true });
+      }
+
+      child.stdout.on("data", (d) => (stdout += d));
+      child.stderr.on("data", (d) => (stderr += d));
+      child.on("error", (err) => finish(() => reject(err)));
+      child.on("close", (exitCode) =>
+        finish(() => {
+          if (exitCode === 0) resolve(stdout.trim());
+          else
+            reject(
+              new Error(
+                `code tool exited ${exitCode}: ${stderr.slice(0, 500) || "(no stderr)"}`,
+              ),
+            );
+        }),
+      );
+
+      child.stdin.write(JSON.stringify(args ?? {}));
+      child.stdin.end();
+    });
+  } finally {
+    try {
+      rmSync(dir, { recursive: true, force: true });
+    } catch {
+      // best-effort cleanup of the throwaway snippet dir
+    }
+  }
+}
diff --git a/services/agent/src/tools/dispatch.ts b/services/agent/src/tools/dispatch.ts
new file mode 100644
index 0000000000..f948501265
--- /dev/null
+++ b/services/agent/src/tools/dispatch.ts
@@ -0,0 +1,129 @@
+/**
+ * Shared tool dispatch: execute one backend-resolved tool, branching on its executor `kind`.
+ *
+ * The same "branch on spec.kind to run a resolved tool" logic was duplicated across every
+ * delivery path (engines/pi.ts in-process Pi, extensions/agenta.ts Pi-under-rivet,
+ * tools/mcp-server.ts the MCP bridge). This module owns that dispatch ONCE so a change to how
+ * a kind is executed is a one-line edit, not three. Each call site still keeps its OWN
+ * result-wrapping shape (Pi customTool details, the MCP `content` envelope) and its OWN
+ * advertise/skip behavior for `client` tools — only the execution itself is shared.
+ *
+ * The three executor kinds (see `ResolvedToolSpec`):
+ *  - `code`: run the snippet in a sandbox subprocess with its scoped secret `env`.
+ *  - `client`: browser-fulfilled across a turn boundary; never executed in-sandbox (throws).
+ *  - `callback` (default): POST back through Agenta's /tools/call so the Composio key and
+ *    connection auth stay server-side. On Daytona the in-sandbox process can't reach Agenta,
+ *    so the call is relayed through the runner via files (see tools/relay.ts) when `relayDir`
+ *    is set; otherwise it POSTs directly.
+ *
+ * `relayToolCall` lives here (not in extensions/agenta.ts) so this module is the single
+ * dispatch home with no import cycle back into a call site.
+ */
+import { existsSync, mkdirSync, readFileSync, unlinkSync, writeFileSync } from "node:fs";
+
+import type { ResolvedToolSpec } from "../protocol.ts";
+import { callAgentaTool } from "./callback.ts";
+import { runCodeTool } from "./code.ts";
+import {
+  RELAY_POLL_MS,
+  RELAY_REQ_SUFFIX,
+  RELAY_RES_SUFFIX,
+  RELAY_TIMEOUT_MS,
+  sanitizeRelayId,
+  sleep,
+  type RelayResponse,
+} from "./relay.ts";
+
+/** Options for executing a resolved tool. `endpoint`/`authorization`/`relayDir` only matter for callbacks. */
+export interface RunResolvedToolOpts {
+  /** Stable id for this tool call (used as the /tools/call id and the relay filename). */
+  toolCallId: string;
+  /** /tools/call endpoint for `callback` tools. */
+  endpoint?: string;
+  /** Authorization header for the callback. */
+  authorization?: string;
+  /** Daytona relay dir: when set, callback calls are relayed through the runner via files. */
+  relayDir?: string;
+  /** Caller cancellation, combined with the per-tool timeout. */
+  signal?: AbortSignal;
+}
+
+/**
+ * Daytona tool call: the in-sandbox process can't reach Agenta, so write the request to a
+ * file the runner watches and poll for the response it writes back (see tools/relay.ts).
+ */
+export async function relayToolCall(
+  dir: string,
+  callRef: string,
+  toolCallId: string,
+  params: unknown,
+  signal?: AbortSignal,
+): Promise<string> {
+  const id = sanitizeRelayId(toolCallId);
+  const reqPath = `${dir}/${id}${RELAY_REQ_SUFFIX}`;
+  const resPath = `${dir}/${id}${RELAY_RES_SUFFIX}`;
+  try {
+    mkdirSync(dir, { recursive: true });
+  } catch {
+    // The runner also creates it; a race here is harmless.
+  }
+  writeFileSync(reqPath, JSON.stringify({ callRef, toolCallId, args: params ?? {} }), "utf-8");
+
+  const deadline = Date.now() + RELAY_TIMEOUT_MS;
+  while (Date.now() < deadline) {
+    if (signal?.aborted) throw new Error("aborted");
+    if (existsSync(resPath)) {
+      const res = JSON.parse(readFileSync(resPath, "utf-8")) as RelayResponse;
+      try {
+        unlinkSync(reqPath);
+      } catch {
+        /* best-effort cleanup */
+      }
+      try {
+        unlinkSync(resPath);
+      } catch {
+        /* best-effort cleanup */
+      }
+      if (res.ok) return res.text ?? "";
+      throw new Error(res.error || `tool relay failed for ${callRef}`);
+    }
+    await sleep(RELAY_POLL_MS);
+  }
+  throw new Error(`tool relay timed out for ${callRef}`);
+}
+
+/**
+ * Execute one resolved tool and return its result text. Throws on failure; every call site
+ * turns the throw into a tool-error result so the model loop continues rather than crashing.
+ *
+ *  - `code`   → run the snippet locally (scoped secret env), no callback/relay.
+ *  - `client` → throw: browser-fulfilled, never executed in-sandbox.
+ *  - default/`callback` → relay through the runner when `opts.relayDir` is set (Daytona),
+ *    else POST directly to `opts.endpoint`.
+ */
+export async function runResolvedTool(
+  spec: ResolvedToolSpec,
+  params: unknown,
+  opts: RunResolvedToolOpts,
+): Promise<string> {
+  if (spec.kind === "code") {
+    return runCodeTool(spec.runtime, spec.code ?? "", spec.env, params, opts.signal);
+  }
+  if (spec.kind === "client") {
+    throw new Error(
+      `client tool '${spec.name}' is browser-fulfilled and cannot be executed in-sandbox`,
+    );
+  }
+  // callback (default): route back to Agenta's /tools/call (directly or via the Daytona relay).
+  if (opts.relayDir) {
+    return relayToolCall(opts.relayDir, spec.callRef ?? "", opts.toolCallId, params, opts.signal);
+  }
+  return callAgentaTool(
+    opts.endpoint ?? "",
+    opts.authorization,
+    spec.callRef ?? "",
+    opts.toolCallId,
+    params,
+    opts.signal,
+  );
+}
diff --git a/services/agent/src/tools/mcp-bridge.ts b/services/agent/src/tools/mcp-bridge.ts
index b83e71fe6a..eaf5683a4d 100644
--- a/services/agent/src/tools/mcp-bridge.ts
+++ b/services/agent/src/tools/mcp-bridge.ts
@@ -49,26 +49,53 @@ export interface McpServerStdio {
 
 /**
  * Build the ACP `mcpServers` list that exposes the resolved tools to the harness.
- * Empty when there are no tools or no callback (the no-tools path stays untouched).
+ *
+ * Attachment is decided per tool kind, not on the callback endpoint alone (see protocol.ts
+ * `ResolvedToolSpec.kind`; absent kind means `callback` for back-compat):
+ *  - `client` tools are browser-fulfilled and not advertised by this server (mcp-server.ts
+ *    filters them from tools/list), so they never justify attaching the bridge on their own.
+ *  - "Executable here" = non-client (`code` and `callback`). With zero executable specs we
+ *    return [] (the no-tools path stays untouched).
+ *  - `code` tools run locally in mcp-server.ts (runCodeTool) and need NO callback endpoint, so
+ *    we attach `agenta-tools` whenever there is at least one executable spec.
+ *  - Only `callback` tools require `callback.endpoint`. If callback tools are present but the
+ *    endpoint is missing, we do NOT drop the whole server (that would silently lose the `code`
+ *    tools too): we still attach it and warn, naming the callback tools whose `tools/call` will
+ *    fail. The endpoint/auth env entries are pushed only when the endpoint actually exists.
  */
 export function buildToolMcpServers(
   specs: ResolvedToolSpec[],
   callback: ToolCallbackContext | undefined,
 ): McpServerStdio[] {
   if (!specs || specs.length === 0) return [];
-  if (!callback?.endpoint) {
+
+  // Absent kind defaults to `callback` (back-compat); `client` is the only non-executable kind.
+  const executable = specs.filter((s) => (s.kind ?? "callback") !== "client");
+  if (executable.length === 0) return [];
+
+  // The callback subset is the only thing that needs the endpoint to function.
+  const callbackSpecs = executable.filter((s) => (s.kind ?? "callback") === "callback");
+  const hasEndpoint = Boolean(callback?.endpoint);
+
+  if (callbackSpecs.length > 0 && !hasEndpoint) {
+    const names = callbackSpecs.map((s) => s.name).join(", ");
     process.stderr.write(
-      `[tool-bridge] skipping ${specs.length} tool(s): missing toolCallback endpoint\n`,
+      `[tool-bridge] missing toolCallback endpoint: ${callbackSpecs.length} callback tool(s) ` +
+        `will fail (${names}); still attaching server for the other tool(s)\n`,
     );
-    return [];
   }
 
+  // Pass every executable spec; mcp-server.ts dispatches per kind (code runs locally, callback
+  // routes to the endpoint).
   const env: EnvVariable[] = [
-    { name: "AGENTA_TOOL_SPECS", value: JSON.stringify(specs) },
-    { name: "AGENTA_TOOL_CALLBACK_ENDPOINT", value: callback.endpoint },
+    { name: "AGENTA_TOOL_SPECS", value: JSON.stringify(executable) },
   ];
-  if (callback.authorization) {
-    env.push({ name: "AGENTA_TOOL_CALLBACK_AUTH", value: callback.authorization });
+  // Only carry the callback env when there is an endpoint to call back to.
+  if (hasEndpoint) {
+    env.push({ name: "AGENTA_TOOL_CALLBACK_ENDPOINT", value: callback!.endpoint });
+    if (callback!.authorization) {
+      env.push({ name: "AGENTA_TOOL_CALLBACK_AUTH", value: callback!.authorization });
+    }
   }
 
   const { command, args } = bridgeLauncher();
diff --git a/services/agent/src/tools/mcp-server.ts b/services/agent/src/tools/mcp-server.ts
index 09512bfa36..98a240c50e 100644
--- a/services/agent/src/tools/mcp-server.ts
+++ b/services/agent/src/tools/mcp-server.ts
@@ -16,8 +16,11 @@
  * initialize, tools/list, tools/call; ignores notifications. stdout carries protocol
  * messages only; logs go to stderr.
  */
+import { randomUUID } from "node:crypto";
+
 import type { ResolvedToolSpec } from "../protocol.ts";
-import { EMPTY_OBJECT_SCHEMA, callAgentaTool } from "./client.ts";
+import { EMPTY_OBJECT_SCHEMA } from "./callback.ts";
+import { runResolvedTool } from "./dispatch.ts";
 
 const SPECS: ResolvedToolSpec[] = JSON.parse(process.env.AGENTA_TOOL_SPECS ?? "[]");
 const ENDPOINT = process.env.AGENTA_TOOL_CALLBACK_ENDPOINT ?? "";
@@ -58,7 +61,8 @@ async function handle(message: any): Promise<unknown | undefined> {
       jsonrpc: "2.0",
       id,
       result: {
-        tools: SPECS.map((s) => ({
+        // `client` tools are browser-fulfilled, so this server does not advertise them.
+        tools: SPECS.filter((s) => s.kind !== "client").map((s) => ({
           name: s.name,
           description: s.description ?? s.name,
           inputSchema: (s.inputSchema as Record<string, unknown>) ?? EMPTY_OBJECT_SCHEMA,
@@ -74,13 +78,14 @@ async function handle(message: any): Promise<unknown | undefined> {
       return { jsonrpc: "2.0", id, error: { code: -32602, message: `unknown tool: ${name}` } };
     }
     try {
-      const text = await callAgentaTool(
-        ENDPOINT,
-        AUTH,
-        spec.callRef,
-        `tool-${Date.now()}`,
-        params?.arguments,
-      );
+      // `code` runs the snippet locally (scoped secret env); everything else routes back to
+      // Agenta's /tools/call. A unique id per call so two parallel calls in the same
+      // millisecond don't collide (Date.now() would).
+      const text = await runResolvedTool(spec, params?.arguments, {
+        toolCallId: randomUUID(),
+        endpoint: ENDPOINT,
+        authorization: AUTH,
+      });
       return { jsonrpc: "2.0", id, result: { content: [{ type: "text", text }] } };
     } catch (err) {
       // Surface as an MCP tool error (isError) so the model can recover, not a crash.
diff --git a/services/agent/src/tools/relay.ts b/services/agent/src/tools/relay.ts
index c182e4e7f9..952ff8893a 100644
--- a/services/agent/src/tools/relay.ts
+++ b/services/agent/src/tools/relay.ts
@@ -16,7 +16,7 @@
  * Local runs keep the direct path (the in-process / local-daemon extension reaches Agenta);
  * the relay is only wired when AGENTA_TOOL_RELAY_DIR is set (Daytona + Pi + tools).
  */
-import { callAgentaTool } from "./client.ts";
+import { callAgentaTool } from "./callback.ts";
 import type { ToolCallbackContext } from "../protocol.ts";
 
 export const RELAY_REQ_SUFFIX = ".req.json";
diff --git a/services/agent/src/tracing/otel.ts b/services/agent/src/tracing/otel.ts
index 6f3ddb2848..2815564a1d 100644
--- a/services/agent/src/tracing/otel.ts
+++ b/services/agent/src/tracing/otel.ts
@@ -684,6 +684,12 @@ export interface RivetOtel {
   start(input: { prompt?: string; messages?: any[]; sessionId?: string }): void;
   /** Feed one ACP `session/update` payload (the `update` object). */
   handleUpdate(update: any): void;
+  /**
+   * Record an event the ACP stream does not carry (e.g. an `interaction_request` raised via
+   * the permission callback). Routes through the same choke point as stream events, so it
+   * lands in both the live sink and the batch `events()` log in build order.
+   */
+  emitEvent(event: AgentEvent): void;
   /** End all open spans. Returns the accumulated assistant text. */
   finish(): string;
   /** Flush this run's trace to Agenta (invoke_agent has a remote parent). */
@@ -981,6 +987,7 @@ export function createRivetOtel(init: RivetOtelInit): RivetOtel {
   return {
     start,
     handleUpdate,
+    emitEvent: record,
     finish,
     flush: () => flushTrace(runTraceId),
     traceId: () => runTraceId,
diff --git a/services/agent/test/code-tool.test.ts b/services/agent/test/code-tool.test.ts
new file mode 100644
index 0000000000..0711f57b41
--- /dev/null
+++ b/services/agent/test/code-tool.test.ts
@@ -0,0 +1,92 @@
+/**
+ * Unit test for the code-tool executor (runCodeTool).
+ *
+ * Exercises both runtimes end-to-end through real subprocesses: a python tool, node tools
+ * written as a bare top-level `function main` (the F2 regression) and as an explicit
+ * `module.exports.main`, an async node `main`, the F3 env-isolation guarantee (provider keys
+ * do NOT leak in; declared scoped secrets DO), and the non-zero-exit reject path.
+ *
+ * Run: pnpm exec tsx test/code-tool.test.ts
+ */
+import assert from "node:assert/strict";
+
+import { runCodeTool } from "../src/tools/code.ts";
+
+// --- Python: bare `def main(**kw)` ------------------------------------------
+{
+  const code = 'def main(**kw):\n    return {"sum": kw.get("a", 0) + kw.get("b", 0)}\n';
+  const out = await runCodeTool("python", code, undefined, { a: 2, b: 3 });
+  assert.deepEqual(JSON.parse(out), { sum: 5 }, "python bare main returns the right JSON");
+}
+
+// --- Node: bare top-level `function main` (F2 regression) -------------------
+{
+  const code = "function main(inputs) { return { got: inputs }; }";
+  const out = await runCodeTool("node", code, undefined, { hello: "world" });
+  assert.deepEqual(
+    JSON.parse(out),
+    { got: { hello: "world" } },
+    "node bare function main executes and echoes the input",
+  );
+}
+
+// --- Node: explicit `module.exports.main` -----------------------------------
+{
+  const code = "module.exports.main = function (inputs) { return { via: 'exports', got: inputs }; };";
+  const out = await runCodeTool("node", code, undefined, { x: 1 });
+  assert.deepEqual(
+    JSON.parse(out),
+    { via: "exports", got: { x: 1 } },
+    "node module.exports.main works",
+  );
+}
+
+// --- Node: async `main` returning a Promise ---------------------------------
+{
+  const code =
+    "async function main(inputs) { await new Promise((r) => setTimeout(r, 5)); return { doubled: inputs.n * 2 }; }";
+  const out = await runCodeTool("node", code, undefined, { n: 21 });
+  assert.deepEqual(JSON.parse(out), { doubled: 42 }, "node async main resolves");
+}
+
+// --- F3: provider keys do NOT leak; scoped secrets DO -----------------------
+{
+  const hadKey = "OPENAI_API_KEY" in process.env;
+  const prevKey = process.env.OPENAI_API_KEY;
+  process.env.OPENAI_API_KEY = "leak-me-xyz";
+  try {
+    // The provider key sits in process.env but must not reach the snippet.
+    const leakCode = "function main() { return { key: process.env.OPENAI_API_KEY ?? 'absent' }; }";
+    const leakOut = await runCodeTool("node", leakCode, undefined, {});
+    assert.deepEqual(
+      JSON.parse(leakOut),
+      { key: "absent" },
+      "F3: OPENAI_API_KEY did NOT leak into the snippet env",
+    );
+
+    // A secret declared on the tool (passed via the scoped `env` arg) must be visible.
+    const scopedCode =
+      "function main() { return { secret: process.env.MY_TOOL_SECRET ?? 'absent' }; }";
+    const scopedOut = await runCodeTool("node", scopedCode, { MY_TOOL_SECRET: "ok" }, {});
+    assert.deepEqual(
+      JSON.parse(scopedOut),
+      { secret: "ok" },
+      "F3: scoped MY_TOOL_SECRET IS visible to the snippet",
+    );
+  } finally {
+    if (hadKey) process.env.OPENAI_API_KEY = prevKey;
+    else delete process.env.OPENAI_API_KEY;
+  }
+}
+
+// --- Non-zero exit / throw rejects ------------------------------------------
+{
+  const code = "function main() { throw new Error('boom'); }";
+  await assert.rejects(
+    () => runCodeTool("node", code, undefined, {}),
+    /boom|exited/,
+    "a throwing snippet rejects",
+  );
+}
+
+console.log("code-tool.test.ts: all assertions passed");
diff --git a/services/agent/test/continuation.test.ts b/services/agent/test/continuation.test.ts
new file mode 100644
index 0000000000..0d4af7948f
--- /dev/null
+++ b/services/agent/test/continuation.test.ts
@@ -0,0 +1,55 @@
+/**
+ * Unit tests for the cross-turn HITL continuation substrate.
+ *
+ * Under the cold model the harness rebuilds context from the replayed transcript, and ACP
+ * prompt content blocks cannot carry tool calls/results. So a resolved interaction (an
+ * approved tool that ran, a client-fulfilled tool) must survive into the replay as text.
+ * `messageTranscript` encodes tool turns; `buildTurnText` keeps them in the replayed history.
+ *
+ * Run: pnpm exec tsx test/continuation.test.ts
+ */
+import assert from "node:assert/strict";
+
+import { messageTranscript, buildTurnText } from "../src/engines/rivet.ts";
+import type { AgentRunRequest, ContentBlock } from "../src/protocol.ts";
+
+// --- messageTranscript -------------------------------------------------------
+assert.equal(messageTranscript("hello"), "hello");
+assert.equal(messageTranscript([{ type: "text", text: "a" }, { type: "text", text: "b" }]), "a\nb");
+assert.equal(
+  messageTranscript([{ type: "tool_call", toolName: "getWeather", input: { city: "Paris" } }]),
+  '[called getWeather({"city":"Paris"})]',
+);
+assert.equal(
+  messageTranscript([{ type: "tool_result", toolName: "getWeather", output: { temp: 24 } }]),
+  '[getWeather returned: {"temp":24}]',
+);
+assert.equal(
+  messageTranscript([{ type: "tool_result", toolName: "send", output: "boom", isError: true }]),
+  "[send error: boom]",
+);
+
+// --- buildTurnText keeps a resolved tool turn in the replay ------------------
+{
+  const req: AgentRunRequest = {
+    messages: [
+      { role: "user", content: "weather in Paris?" },
+      {
+        role: "assistant",
+        content: [{ type: "tool_call", toolName: "getWeather", input: { city: "Paris" } } as ContentBlock],
+      },
+      {
+        role: "tool",
+        content: [{ type: "tool_result", toolName: "getWeather", output: { temp: 24 } } as ContentBlock],
+      },
+      { role: "user", content: "and tomorrow?" },
+    ],
+  };
+  const text = buildTurnText(req);
+  assert.ok(text.includes("called getWeather"), "tool call survives replay");
+  assert.ok(text.includes("getWeather returned"), "tool result survives replay");
+  assert.ok(text.includes("and tomorrow?"), "latest user prompt is the live turn");
+  assert.ok(text.startsWith("Conversation so far:"), "transcript header present");
+}
+
+console.log("continuation.test.ts: all assertions passed");
diff --git a/services/agent/test/mcp-servers.test.ts b/services/agent/test/mcp-servers.test.ts
new file mode 100644
index 0000000000..97e821429f
--- /dev/null
+++ b/services/agent/test/mcp-servers.test.ts
@@ -0,0 +1,58 @@
+/**
+ * Unit tests for the user-declared MCP server conversion (Agent B's Slice 4, wired in rivet).
+ *
+ * Agent B's `resolve_mcp_servers` emits the McpServerConfig wire shape
+ * ({name,transport,command,args,env,url?,tools?}, env as a Record), pinned in the Python
+ * test_wire_contract. This covers the TS half: converting that to the ACP stdio entry the
+ * session consumes (env as a {name,value} list), skipping remote/http, and not enforcing the
+ * per-server tools allowlist over ACP in v1.
+ *
+ * Run: pnpm exec tsx test/mcp-servers.test.ts
+ */
+import assert from "node:assert/strict";
+
+import { toAcpMcpServers } from "../src/engines/rivet.ts";
+import type { McpServerConfig } from "../src/protocol.ts";
+
+assert.deepEqual(toAcpMcpServers(undefined), [], "undefined -> []");
+assert.deepEqual(toAcpMcpServers([]), [], "[] -> []");
+
+// stdio server: env Record -> ACP {name,value} list; defaults applied.
+{
+  const servers: McpServerConfig[] = [
+    {
+      name: "github",
+      transport: "stdio",
+      command: "npx",
+      args: ["-y", "@modelcontextprotocol/server-github"],
+      env: { GITHUB_PERSONAL_ACCESS_TOKEN: "ghp_x", LOG_LEVEL: "info" },
+      tools: ["create_issue"], // allowlist not enforced over ACP v1 (logged), server still delivered
+    },
+  ];
+  const out = toAcpMcpServers(servers);
+  assert.equal(out.length, 1);
+  assert.equal(out[0].name, "github");
+  assert.equal(out[0].command, "npx");
+  assert.deepEqual(out[0].args, ["-y", "@modelcontextprotocol/server-github"]);
+  assert.deepEqual(out[0].env, [
+    { name: "GITHUB_PERSONAL_ACCESS_TOKEN", value: "ghp_x" },
+    { name: "LOG_LEVEL", value: "info" },
+  ]);
+}
+
+// remote/http is skipped (no auth on the wire by design); stdio without command is skipped.
+{
+  const out = toAcpMcpServers([
+    { name: "remote", transport: "http", url: "https://example.com/mcp" },
+    { name: "broken", transport: "stdio" }, // no command
+  ]);
+  assert.deepEqual(out, [], "http + command-less stdio both skipped");
+}
+
+// missing env / args default to empty.
+{
+  const out = toAcpMcpServers([{ name: "fs", transport: "stdio", command: "mcp-fs" }]);
+  assert.deepEqual(out, [{ name: "fs", command: "mcp-fs", args: [], env: [] }]);
+}
+
+console.log("mcp-servers.test.ts: all assertions passed");
diff --git a/services/agent/test/responder.test.ts b/services/agent/test/responder.test.ts
new file mode 100644
index 0000000000..e06ae43e00
--- /dev/null
+++ b/services/agent/test/responder.test.ts
@@ -0,0 +1,84 @@
+/**
+ * Unit tests for the interaction responder seam and the otel `emitEvent` hook.
+ *
+ * Covers the behavior parity of the responder (it replaces the old inline auto-approve in
+ * rivet.ts) and that an out-of-stream event (an `interaction_request`) routed through
+ * `emitEvent` lands in both the live sink and the batch `events()` log. No harness, no
+ * network.
+ *
+ * Run: pnpm exec tsx test/responder.test.ts
+ */
+import assert from "node:assert/strict";
+
+import { createRivetOtel } from "../src/tracing/otel.ts";
+import type { AgentEvent } from "../src/protocol.ts";
+import {
+  PolicyResponder,
+  decisionToReply,
+  policyFromRequest,
+} from "../src/responder.ts";
+
+// --- policyFromRequest -------------------------------------------------------
+{
+  delete process.env.AGENTA_RIVET_DENY_PERMISSIONS;
+  assert.equal(policyFromRequest(undefined), "auto");
+  assert.equal(policyFromRequest("auto"), "auto");
+  assert.equal(policyFromRequest("deny"), "deny");
+
+  process.env.AGENTA_RIVET_DENY_PERMISSIONS = "true";
+  assert.equal(policyFromRequest(undefined), "deny", "env forces deny");
+  assert.equal(policyFromRequest("auto"), "deny", "env overrides auto");
+  delete process.env.AGENTA_RIVET_DENY_PERMISSIONS;
+}
+
+// --- decisionToReply (parity with the old inline mapping) --------------------
+{
+  assert.equal(decisionToReply("allow", ["always", "once", "reject"]), "always");
+  assert.equal(decisionToReply("allow", ["once", "reject"]), "once");
+  assert.equal(decisionToReply("allow", []), "once", "allow falls back to once");
+  assert.equal(decisionToReply("deny", ["always", "once", "reject"]), "reject");
+  assert.equal(decisionToReply("deny", []), "reject", "deny falls back to reject");
+}
+
+// --- PolicyResponder ---------------------------------------------------------
+{
+  const auto = new PolicyResponder("auto");
+  const deny = new PolicyResponder("deny");
+  const req = { id: "p1", availableReplies: ["once", "reject"] };
+  assert.equal(await auto.onPermission(req), "allow");
+  assert.equal(await deny.onPermission(req), "deny");
+}
+
+// --- emitEvent: streaming path (sink + batch) --------------------------------
+{
+  const emitted: AgentEvent[] = [];
+  const run = createRivetOtel({ harness: "claude", model: "anthropic/x", emit: (e) => emitted.push(e) });
+  run.start({ prompt: "hi" });
+  const interaction: AgentEvent = {
+    type: "interaction_request",
+    id: "p1",
+    kind: "permission",
+    payload: { availableReplies: ["once", "reject"] },
+  };
+  run.emitEvent(interaction);
+
+  const live = emitted.find((e) => e.type === "interaction_request");
+  assert.ok(live, "interaction_request flushed to the live sink");
+  assert.equal((live as any).id, "p1");
+  assert.ok(
+    run.events().some((e) => e.type === "interaction_request"),
+    "interaction_request also recorded in the batch log",
+  );
+}
+
+// --- emitEvent: one-shot path (batch only) -----------------------------------
+{
+  const run = createRivetOtel({ harness: "claude", model: "anthropic/x" });
+  run.start({ prompt: "hi" });
+  run.emitEvent({ type: "data", name: "weather", data: { temp: 24 } });
+  const ev = run.events().find((e) => e.type === "data");
+  assert.ok(ev, "data event recorded with no live sink");
+  assert.equal((ev as any).name, "weather");
+}
+
+console.log("responder.test.ts: all assertions passed");
diff --git a/services/agent/test/tool-bridge.test.ts b/services/agent/test/tool-bridge.test.ts
new file mode 100644
index 0000000000..1838177918
--- /dev/null
+++ b/services/agent/test/tool-bridge.test.ts
@@ -0,0 +1,151 @@
+/**
+ * Unit tests for buildToolMcpServers (the tool MCP bridge attachment decision).
+ *
+ * Regression cover for F4: attachment must be decided per tool kind, not on the callback
+ * endpoint alone. A `code` tool runs locally in mcp-server.ts and needs no endpoint, so a run
+ * whose tools are all `code` must still attach the `agenta-tools` server. Only `callback`-kind
+ * tools require AGENTA_TOOL_CALLBACK_ENDPOINT; missing it must degrade those tools, not drop the
+ * whole server. `client` tools are browser-fulfilled and never justify attaching the bridge.
+ *
+ * Run: pnpm exec tsx test/tool-bridge.test.ts
+ */
+import assert from "node:assert/strict";
+
+import { buildToolMcpServers } from "../src/tools/mcp-bridge.ts";
+import type { ResolvedToolSpec, ToolCallbackContext } from "../src/protocol.ts";
+
+/** Look up an env var value by name in the ACP {name,value} list (undefined if absent). */
+function envValue(
+  env: { name: string; value: string }[],
+  name: string,
+): string | undefined {
+  return env.find((e) => e.name === name)?.value;
+}
+
+// code-only specs + NO callback -> one server, with AGENTA_TOOL_SPECS but no endpoint (F4).
+{
+  const specs: ResolvedToolSpec[] = [
+    { name: "adder", kind: "code", runtime: "python", code: "def main(**k): return 1" },
+  ];
+  const out = buildToolMcpServers(specs, undefined);
+  assert.equal(out.length, 1, "code-only run still attaches the server");
+  assert.equal(out[0].name, "agenta-tools");
+  assert.ok(
+    envValue(out[0].env, "AGENTA_TOOL_SPECS") !== undefined,
+    "AGENTA_TOOL_SPECS is set",
+  );
+  assert.equal(
+    envValue(out[0].env, "AGENTA_TOOL_CALLBACK_ENDPOINT"),
+    undefined,
+    "no endpoint env for code-only run",
+  );
+  // The full executable spec list round-trips through AGENTA_TOOL_SPECS.
+  assert.deepEqual(JSON.parse(envValue(out[0].env, "AGENTA_TOOL_SPECS")!), specs);
+}
+
+// callback specs + a callback with endpoint -> one server carrying endpoint (+ auth).
+{
+  const specs: ResolvedToolSpec[] = [
+    { name: "search", kind: "callback", callRef: "composio.search" },
+  ];
+  const callback: ToolCallbackContext = {
+    endpoint: "https://agenta.example/tools/call",
+    authorization: "Bearer tok",
+  };
+  const out = buildToolMcpServers(specs, callback);
+  assert.equal(out.length, 1);
+  assert.equal(
+    envValue(out[0].env, "AGENTA_TOOL_CALLBACK_ENDPOINT"),
+    "https://agenta.example/tools/call",
+    "endpoint env set for callback tools",
+  );
+  assert.equal(
+    envValue(out[0].env, "AGENTA_TOOL_CALLBACK_AUTH"),
+    "Bearer tok",
+    "auth env set when provided",
+  );
+}
+
+// callback spec + endpoint but no authorization -> no AUTH env entry.
+{
+  const specs: ResolvedToolSpec[] = [
+    { name: "search", kind: "callback", callRef: "composio.search" },
+  ];
+  const out = buildToolMcpServers(specs, { endpoint: "https://agenta.example/tools/call" });
+  assert.equal(out.length, 1);
+  assert.equal(
+    envValue(out[0].env, "AGENTA_TOOL_CALLBACK_ENDPOINT"),
+    "https://agenta.example/tools/call",
+  );
+  assert.equal(
+    envValue(out[0].env, "AGENTA_TOOL_CALLBACK_AUTH"),
+    undefined,
+    "no AUTH env when authorization absent",
+  );
+}
+
+// absent kind defaults to callback (back-compat): endpoint still wired when present.
+{
+  const specs: ResolvedToolSpec[] = [{ name: "legacy", callRef: "composio.legacy" }];
+  const out = buildToolMcpServers(specs, { endpoint: "https://agenta.example/tools/call" });
+  assert.equal(out.length, 1, "back-compat (no kind) attaches as a callback tool");
+  assert.equal(
+    envValue(out[0].env, "AGENTA_TOOL_CALLBACK_ENDPOINT"),
+    "https://agenta.example/tools/call",
+  );
+}
+
+// mixed code+callback specs + NO endpoint -> still one server (so code works), endpoint omitted.
+{
+  const specs: ResolvedToolSpec[] = [
+    { name: "adder", kind: "code", runtime: "python", code: "def main(**k): return 1" },
+    { name: "search", kind: "callback", callRef: "composio.search" },
+  ];
+  const out = buildToolMcpServers(specs, undefined);
+  assert.notDeepEqual(out, [], "mixed run with no endpoint must not return []");
+  assert.equal(out.length, 1, "still attaches the server so the code tool works");
+  assert.equal(
+    envValue(out[0].env, "AGENTA_TOOL_CALLBACK_ENDPOINT"),
+    undefined,
+    "endpoint env omitted when missing",
+  );
+  // Both executable specs are still passed to the bridge.
+  assert.deepEqual(JSON.parse(envValue(out[0].env, "AGENTA_TOOL_SPECS")!), specs);
+}
+
+// empty specs -> [].
+assert.deepEqual(buildToolMcpServers([], undefined), [], "empty specs -> []");
+
+// client-only specs -> [] (no executable tools; the bridge does not advertise client tools).
+{
+  const specs: ResolvedToolSpec[] = [{ name: "confirm", kind: "client" }];
+  assert.deepEqual(
+    buildToolMcpServers(specs, undefined),
+    [],
+    "client-only -> [] (nothing executable here)",
+  );
+  // Even with an endpoint, client-only stays empty.
+  assert.deepEqual(
+    buildToolMcpServers(specs, { endpoint: "https://agenta.example/tools/call" }),
+    [],
+    "client-only -> [] even with an endpoint",
+  );
+}
+
+// client tools alongside an executable one are dropped from AGENTA_TOOL_SPECS, server attaches.
+{
+  const specs: ResolvedToolSpec[] = [
+    { name: "confirm", kind: "client" },
+    { name: "adder", kind: "code", runtime: "python", code: "def main(**k): return 1" },
+  ];
+  const out = buildToolMcpServers(specs, undefined);
+  assert.equal(out.length, 1, "executable spec attaches the server");
+  const passed: ResolvedToolSpec[] = JSON.parse(envValue(out[0].env, "AGENTA_TOOL_SPECS")!);
+  assert.deepEqual(
+    passed.map((s) => s.name),
+    ["adder"],
+    "client spec excluded from the executable list passed to the bridge",
+  );
+}
+
+console.log("tool-bridge.test.ts: all assertions passed");
diff --git a/services/agent/test/tool-dispatch.test.ts b/services/agent/test/tool-dispatch.test.ts
new file mode 100644
index 0000000000..8ec779d396
--- /dev/null
+++ b/services/agent/test/tool-dispatch.test.ts
@@ -0,0 +1,85 @@
+/**
+ * Unit tests for the shared tool-dispatch module (tools/dispatch.ts) and its routing.
+ *
+ * The kind-dispatch ("branch on spec.kind to execute a resolved tool") used to be duplicated
+ * across engines/pi.ts, extensions/agenta.ts, and tools/mcp-server.ts. It now lives once in
+ * `runResolvedTool`. These tests cover both the routing into that function and the call-site
+ * advertising behavior that stays per-site:
+ *  - buildCustomTools (pi.ts) skips `client` specs, builds a tool per `code`/`callback` spec,
+ *    and skips a `callback` spec with no callback endpoint.
+ *  - runResolvedTool runs a real `code` snippet end-to-end (python) and throws for `client`.
+ *
+ * No network and no harness: the `code` path shells out to python3 (available locally); the
+ * `callback`/relay paths are not exercised here (they need a live /tools/call or a relay dir).
+ *
+ * Run: pnpm exec tsx test/tool-dispatch.test.ts
+ */
+import assert from "node:assert/strict";
+
+import { buildCustomTools } from "../src/engines/pi.ts";
+import { runResolvedTool } from "../src/tools/dispatch.ts";
+import type { ResolvedToolSpec, ToolCallbackContext } from "../src/protocol.ts";
+
+const callback: ToolCallbackContext = { endpoint: "https://agenta.test/tools/call" };
+
+const clientSpec: ResolvedToolSpec = { name: "client_tool", kind: "client" };
+const codeSpec: ResolvedToolSpec = {
+  name: "code_tool",
+  kind: "code",
+  runtime: "python",
+  code: 'def main(**kw):\n    return {"echo": kw}\n',
+};
+const callbackSpec: ResolvedToolSpec = {
+  name: "callback_tool",
+  kind: "callback",
+  callRef: "composio.SOME_ACTION",
+};
+
+// --- buildCustomTools routing -----------------------------------------------
+{
+  const tools = buildCustomTools([clientSpec, codeSpec, callbackSpec], callback);
+  const names = tools.map((t) => t.name);
+
+  // `client` is browser-fulfilled, so it is never registered in-process.
+  assert.ok(!names.includes("client_tool"), "client spec is skipped");
+  // `code` and `callback` each produce exactly one tool with the spec's name.
+  assert.ok(names.includes("code_tool"), "code spec produces a tool");
+  assert.ok(names.includes("callback_tool"), "callback spec produces a tool");
+  assert.equal(tools.length, 2, "only the two executable specs produce tools");
+}
+
+// A `callback` spec with no callback endpoint is skipped (logged), but a sibling `code`
+// spec still registers (code never needs the endpoint).
+{
+  const tools = buildCustomTools([codeSpec, callbackSpec], undefined);
+  const names = tools.map((t) => t.name);
+  assert.ok(names.includes("code_tool"), "code spec still registers without an endpoint");
+  assert.ok(
+    !names.includes("callback_tool"),
+    "callback spec is skipped when no callback endpoint",
+  );
+  assert.equal(tools.length, 1, "only the code spec registers without an endpoint");
+}
+
+// --- runResolvedTool: code executes; client throws --------------------------
+{
+  const text = await runResolvedTool(codeSpec, { greeting: "hi", n: 3 }, {
+    toolCallId: "call-1",
+  });
+  const parsed = JSON.parse(text);
+  assert.deepEqual(
+    parsed,
+    { echo: { greeting: "hi", n: 3 } },
+    "code tool runs the snippet and returns its JSON output containing the input",
+  );
+}
+
+{
+  await assert.rejects(
+    () => runResolvedTool(clientSpec, {}, { toolCallId: "call-2" }),
+    /browser-fulfilled/,
+    "client tool throws (never executed in-sandbox)",
+  );
+}
+
+console.log("tool-dispatch.test.ts: all assertions passed");
diff --git a/services/oss/src/agent/app.py b/services/oss/src/agent/app.py
index fe3e164788..6409065d55 100644
--- a/services/oss/src/agent/app.py
+++ b/services/oss/src/agent/app.py
@@ -27,8 +27,8 @@
     SessionConfig,
     make_harness,
     to_messages,
-    ui_message_stream,
 )
+from agenta.sdk.agents.adapters.vercel import agent_run_to_vercel_parts
 
 from oss.src.agent.config import load_config, wrapper_dir
 from oss.src.agent.schemas import AGENT_SCHEMAS
@@ -113,7 +113,7 @@ async def _agent(
     # generator (the normalizer turns it into a streaming response). `/invoke` and the
     # `/messages` JSON path leave it unset and take the batch path below.
     if stream:
-        return _agent_stream(harness, session_config, msgs)
+        return _agent_vercel_stream(harness, session_config, msgs)
 
     await harness.setup()
     try:
@@ -125,7 +125,7 @@ async def _agent(
     return {"role": "assistant", "content": result.output}
 
 
-async def _agent_stream(harness, session_config, msgs):
+async def _agent_vercel_stream(harness, session_config, msgs):
     """Run one streaming turn and yield Vercel UI Message Stream parts.
 
     Owns the environment lifecycle (``setup`` / ``cleanup``); the per-turn session is torn
@@ -135,7 +135,7 @@ async def _agent_stream(harness, session_config, msgs):
     await harness.setup()
     try:
         run = await harness.stream(session_config, msgs)
-        async for part in ui_message_stream(run):
+        async for part in agent_run_to_vercel_parts(run):
             yield part
         try:
             record_usage(run.result().usage)