diff --git a/docs/design/agent-workflows/open-issues.md b/docs/design/agent-workflows/open-issues.md new file mode 100644 index 0000000000..5cd683202a --- /dev/null +++ b/docs/design/agent-workflows/open-issues.md @@ -0,0 +1,93 @@ +# Open issues + +Deferred TODOs and open questions for the agent-workflows project. Each entry carries enough +context and provenance to act on cold. See the `defer-todo` skill for the format. + +## Open issues + +### Supply secret values to tools during a standalone run + +**Status:** open +**Added:** 2026-06-19 +**Commit:** 6a812efb95 (branch `gitbutler/workspace`) +**Project:** [agent-workflows/sdk-local-tools](./sdk-local-tools/) +**Source:** sdk-local-tools design review session (answering reviewer comments on plan.md, Decision 3) + +**The problem.** An agent's `code` tool declares the secrets it needs by name, for example +`secrets: ["GITHUB_TOKEN"]`. An MCP server declares an env-var-to-name map. The config stores +only the name, never the value. At run time something must turn the name into a value and +inject it as an env var, into the sandbox subprocess for a code tool or into the server +process for MCP. On the Agenta server path that consumer is meant to read the `custom_secret` +entries from the project vault by name, but it is not built yet. Custom secrets are +storage-only this iteration by design, so today a declared secret resolves to nothing and the +tool runs without it. For a standalone run the goal is different: do not depend on the Agenta +vault at all. The trouble is the SDK has no secret resolution of any kind today. Resolution +lives only in the service. So a standalone agent has no way to supply `GITHUB_TOKEN` to its +code tool. + +**Why it is deferred.** The first slice is offline and narrow (built-in plus code tools with +env secrets). Env alone closes the gap for that slice, so the wider question of where secret +values come from does not block it. The vault path also depends on work another effort owns +(see below), so it cannot land here yet. + +**What to decide or do.** Pick the local source of secret values for a standalone run. Three +options, and they are not exclusive. + +1. Env, the offline default. Read the declared name straight from the process environment. + `GITHUB_TOKEN` comes from `os.environ`. Simple, offline, and enough for the first slice. +2. A pluggable `SecretResolver` interface. The user implements it. Env is the built-in + default, but they can back it with a `.env` file, a secret manager, or their own vault. A + small interface for a lot of flexibility. +3. The Agenta vault over HTTP. It reads `custom_secret` entries by name. This needs the + future runtime consumer endpoint to be built, and it is the connected-standalone path + only. + +The lean: ship option 1 as the default and option 2 as the interface it plugs into, both in +the first slice. Treat option 3 as later work tied to the named-secrets effort building the +consumer. See [./sdk-local-tools/plan.md](./sdk-local-tools/plan.md) (Decision 3 and Phase +4), [./sdk-local-tools/research.md](./sdk-local-tools/research.md) (stage 3), and +[../vault-named-secrets/](../vault-named-secrets/). + +### Batch the two vault round-trips on the agent invoke path + +**Status:** open +**Added:** 2026-06-19 +**Commit:** 6a812efb95 (branch `gitbutler/workspace`) +**Project:** [agent-workflows/sdk-local-tools](./sdk-local-tools/) +**Source:** xhigh code review of the sdk-local-tools first slice + +**The problem.** The service resolves a code tool's named secrets inside `resolve_tools` and an +MCP server's named secrets inside `resolve_mcp_servers`. Each one builds its own +`_VaultSecretResolver` and makes its own `POST /secrets/resolve`. When a config has both a code +tool and an enabled MCP server that each declare secrets, a cold invoke makes two sequential +vault round-trips where one batched call over the union of names would do. The two functions +are also awaited one after another in `app.py`, not concurrently. + +**Why it is deferred.** MCP is flag-gated off this release, so the second round-trip does not +happen on the default path yet. The cost only appears once MCP turns on. Fixing it now would be +optimizing a path no one runs. + +**What to decide or do.** When MCP comes off the flag (sdk-local-tools Phase 5), resolve the +union of code-tool and MCP secret names in one vault call, and consider running the independent +resolve steps in `app.py` concurrently. + +### Give the resolved-tool shape one source of truth + +**Status:** open +**Added:** 2026-06-19 +**Commit:** 6a812efb95 (branch `gitbutler/workspace`) +**Project:** [agent-workflows/sdk-local-tools](./sdk-local-tools/) +**Source:** xhigh code review of the sdk-local-tools first slice + +**The problem.** `ResolvedTools` (the SDK resolver's return type) carries the same four fields, +`builtin_tools` / `custom_tools` / `tool_callback` / `mcp_servers`, that `SessionConfig` already +declares. The two shapes can drift: when a new per-tool wire field lands, a maintainer has to +add it in both places, and missing one silently drops the field from either the standalone path +or the service path. + +**Why it is deferred.** The duplication is small and the two types serve different layers today +(a resolver result versus the full session bundle). It is a maintainability touch-point, not a +bug. + +**What to decide or do.** Decide whether the resolver should return the `SessionConfig` tool +fields directly (or a shared sub-model both reuse), so the wire tool shape has one definition. diff --git a/docs/design/agent-workflows/sdk-local-tools/README.md b/docs/design/agent-workflows/sdk-local-tools/README.md new file mode 100644 index 0000000000..cab7ed26d3 --- /dev/null +++ b/docs/design/agent-workflows/sdk-local-tools/README.md @@ -0,0 +1,40 @@ +# SDK local tools + +This folder plans one feature: letting a standalone Agenta Python SDK user run an agent's +**tools** during a fully local run, with no Agenta backend service and no rivet sidecar. + +It complements the sibling effort in +[`../scratch/sdk-local-backend/status.md`](../scratch/sdk-local-backend/status.md). That one +moves the agent runtime into the SDK and builds `LocalBackend` (the engine that runs a +harness on the user's own machine). This one builds the tool layer on top of that engine: +resolving the agent's tool references into runnable specs, and supplying the secrets those +tools need, all without calling Agenta. + +This workspace covers the original feature plan, the organization proposal, its implementation, +and the completed review remediation. `status.md` is the source of truth. + +## Read in this order + +1. **[context.md](context.md)**: why this matters, the standalone promise it keeps, the + goal, and the non-goals. Start here. +2. **[research.md](research.md)**: the verified current-state map. Where every piece lives + today (with `file:line`), what works, and what is missing, broken down per tool kind. +3. **[organization-proposal.md](organization-proposal.md)**: the recommended package, + naming, exception, validation, and migration structure. +4. **[status.md](status.md)**: the source of truth for progress and the concrete first + steps for the next agent. +5. **[codebase-conventions.md](codebase-conventions.md)**: repository patterns learned during + the organization review, including strong precedents and legacy patterns not to copy. +6. **[plan.md](plan.md)**: the earlier phased behavior plan. Its product decisions remain + useful context; its proposed file and symbol names are not final. + +## The one-sentence goal + +A standalone SDK user pulls an agent config from Agenta, builds a `LocalBackend`, and runs +the agent locally **with its tools**: built-in, code, and (later) gateway, client, and MCP. + +## Prerequisite + +This work assumes `LocalBackend` exists. It does not yet; it is a stub that raises (see +research.md). Read `../scratch/sdk-local-backend/status.md` for that effort's state. The +phases here are sequenced so the first tool slice lands right after `LocalBackend` does. diff --git a/docs/design/agent-workflows/sdk-local-tools/codebase-conventions.md b/docs/design/agent-workflows/sdk-local-tools/codebase-conventions.md new file mode 100644 index 0000000000..bccb9c1d87 --- /dev/null +++ b/docs/design/agent-workflows/sdk-local-tools/codebase-conventions.md @@ -0,0 +1,201 @@ +# Codebase conventions learned + +This is a fresh repository review. It records observed patterns and distinguishes the +patterns worth following from legacy or inconsistent code that should not become precedent. + +## Organization + +- Growing backend domains are directories, not a sequence of suffixed files. + `api/oss/src/core/tools/` uses `dtos.py`, `interfaces.py`, `service.py`, `registry.py`, + `exceptions.py`, and provider subpackages. +- The SDK also nests substantial subsystems. `sdks/python/agenta/sdk/engines/running/` + contains interfaces, registries, handlers, runners, errors, templates, and sandbox logic. +- `sdks/python/agenta/sdk/evaluations/runtime/` is a particularly relevant shared-runtime + package. API core imports those SDK model classes directly from + `api/oss/src/core/evaluations/runtime/types.py` and defines only API-specific additions. +- Small composition areas remain flat. `services/oss/src/agent/` has one file per concern: + `app.py`, `client.py`, `config.py`, `schemas.py`, `secrets.py`, `tools.py`, `tracing.py`. +- Tests are organized by runner, test boundary, then domain. Unit tests do not require a + running backend; HTTP adapter behavior belongs in integration tests. +- `__init__.py` is generally used as a curated import surface, while implementations remain + in role-specific modules. + +## File naming + +- Common role names in maintained API core domains are `dtos.py`, `interfaces.py`, + `service.py`, `types.py`, `utils.py`, and `exceptions.py`. +- Semantic names are preferable when the role is clearer than a generic bucket: + `streaming.py`, `registry.py`, `delivery.py`, `context.py`, `mappings.py`. +- Abbreviations such as `defs` are uncommon as file-role names and do not identify whether + values are configuration, DTOs, schemas, or runtime specifications. +- `utils.py` exists widely but often becomes a mixed bucket. New code should use a semantic + filename when one is available. +- A domain-qualified generic filename can be clear (`tools/registry.py`); the same generic + filename at a broad level (`agents/tool_resolution.py`) tends to accumulate unrelated + concerns. + +## Class naming + +- Pydantic classes are nouns that describe their lifecycle or API role: + `ToolConnectionCreate`, `ToolExecutionRequest`, `WorkflowRevisionData`, + `HarnessAgentConfig`. +- Abstract contracts use both plain domain names in the SDK (`Backend`, `Harness`, + `Session`) and `*Interface` in API core (`ToolsDAOInterface`, + `ToolsGatewayInterface`). Consistency within one subsystem matters more than copying one + suffix globally. +- Implementations identify their mechanism or provider: + `InProcessPiBackend`, `RivetBackend`, `ToolsGatewayRegistry`. +- Domain exceptions generally end in `Error` and often inherit from a domain base, as in + `ToolsError`, `EmbedError`, and `UnsupportedHarnessError`. +- The codebase usually capitalizes well-known acronyms in class names (`API`, `LLM`), though + there are mixed conventions for newer acronyms. A subsystem should choose one spelling + deliberately and keep it consistent. + +## Function and method naming + +- Functions use verb-first snake_case names: `resolve_json_pointer`, + `create_connection`, `list_integrations`, `make_oauth_state`. +- Async methods generally use the same domain verb as sync behavior. Some older SDK modules + use `a*` prefixes such as `acreate`; this is legacy style and should not be copied into the + new agent runtime. +- Boolean helpers read as predicates (`supports`, `is_active`, `is_valid`) or explicit + questions (`_mcp_enabled`). +- Mapper functions are clearest when named for the conversion (`decode_oauth_state`, + `_to_harness_config`). Generic verbs such as `attach` should name the data being attached. +- Private helpers use a leading underscore. A service importing another module's private or + implementation helper usually indicates the package boundary is wrong. + +## Function structure + +- New API code prefers keyword-only parameters using `*`. +- Long signatures and calls use visual grouping with `#` separators in API core. +- Services orchestrate dependencies; pure conversion belongs in focused helpers or mapping + modules. +- Adapters own provider-specific behavior. Core orchestration depends on interfaces rather + than concrete HTTP or database implementations. +- Constructors receive dependencies explicitly. `ToolsService` receives its DAO and adapter + registry; agent `Environment` receives a backend. +- Return contracts are typed models when they cross layers. Positional tuples are used in + places but are less extensible and less self-documenting. +- Mutating a result model after construction is not a dominant pattern for domain results; + constructing a complete return value is clearer. + +## Variables + +- Domain-specific names are preferred at boundaries: `project_id`, `provider_key`, + `integration_key`, `connection_create`. +- Short names such as `p`, `out`, `defs`, `entry`, and `values` appear in implementation + code but are weaker choices when the value crosses several stages. +- Names should preserve lifecycle: `tool_config`, `tool_spec`, `secret_names`, + `secret_values`, and `resolved_tools`. +- Wire-specific camelCase is isolated to serialization or explicit wire dictionaries. + Python domain fields remain snake_case. + +## Models and DTOs + +- API core keeps domain contracts in `dtos.py` or `types.py`; FastAPI request and response + envelopes live in API-layer `models.py`. +- Database entities do not leak through service or router contracts. +- Discriminated Pydantic unions are already used for tool references and workflow data. +- Request, response, config, and resolved/runtime values are usually distinguishable by + suffixes such as `Create`, `Request`, `Response`, `Config`, and `Data`. +- Reusing `Dict[str, Any]` for a stable cross-language contract weakens validation and makes + drift harder to detect. Stable wire objects should have typed models and golden + serialization tests. +- Canonical executable configuration should reject unexpected fields. Permissive legacy + handling belongs in a named compatibility adapter. +- Mutable Pydantic defaults such as `items: List[...] = []` exist in older code. New models + should use `Field(default_factory=list)` and `Field(default_factory=dict)`. + +## Docstrings and comments + +- The explicit API guidance says comments should explain non-obvious reasons, invariants, + workarounds, or concurrency/security hazards, not narrate the code. +- Good public docstrings state the contract and important behavior in one paragraph. +- Module docstrings are useful for dependency boundaries and protocol ownership. +- Detailed architecture history, phased decisions, and product rationale belong in + `docs/design/`, not in every source module. +- Section separators are common in large modules, but needing many sections can also be a + signal that the file should become a package. +- Comments around secret isolation, callback reachability, and compatibility aliases are + justified because they preserve security or backwards compatibility. + +## Exceptions + +- New API architecture defines domain exceptions in core and translates them at the API + boundary. +- Domain bases allow broad and narrow handling (`ToolsError` and its subclasses). +- Useful exceptions carry structured fields such as provider, connection, status, or + conflicting identifiers. +- `HTTPException` should not leak into core services or SDK runtime logic. +- Broad catches and best-effort behavior exist for cleanup and optional telemetry. They are + appropriate only when failure is intentionally non-fatal. +- Broad catches around required configuration, tool resolution, or declared secrets can + hide correctness failures and need an explicit policy. +- Raising generic `RuntimeError` while a typed domain error exists is inconsistent and loses + machine-readable context. + +## Dependency direction + +- API routing depends on core services; core services depend on interfaces; adapters + implement those interfaces. +- The agent service already depends on SDK runtime contracts. The SDK must not import the + service. +- The TypeScript runner consumes resolved runtime specs; it should not understand loose + playground compatibility shapes. +- Compatibility parsing should happen once, before resolution. +- HTTP gateway resolution and vault access are adapters, not properties of the tool models. +- When the SDK owns neutral runtime models, API code can import the same classes rather than + maintaining a parallel union. + +## Testing + +- SDK unit tests should cover parsing, pure mapping, policy, and serialization without a + backend. +- Service integration tests should cover HTTP status handling, authentication headers, + malformed responses, and timeouts. +- Golden files are already used for the Python-to-TypeScript runner contract and are the + right mechanism for tool-spec drift. +- Tests should be named for behavior, not internal helper names, when possible. +- Security behavior needs direct tests: secret scoping, missing-secret handling, redaction, + and no ambient environment leakage. +- Correlation and identity behavior also needs tests: duplicate names, built-in/custom + collisions, and provider responses returned in a different order. + +## Strong patterns to copy + +- `api/oss/src/core/tools/`: explicit domain roles and typed exceptions. +- `sdks/python/agenta/sdk/agents/interfaces.py` plus `agents/adapters/`: ports separated from + implementations. +- `sdks/python/agenta/sdk/engines/running/`: nested SDK subsystem with role-specific files. +- `sdks/python/agenta/sdk/evaluations/runtime/`: SDK-owned neutral runtime contracts reused + directly by API core. +- `api/oss/src/dbs/postgres/*/mappings.py`: conversions isolated from persistence and + services. +- `agents/utils/wire.py` plus TypeScript protocol golden tests: one explicit cross-language + boundary. + +## Patterns to avoid treating as precedent + +- Very large multipurpose modules such as `agents/dtos.py` and API + `core/tools/service.py`. +- Legacy SDK async naming with `a*` prefixes. +- `except:` or broad `except Exception` followed by `print`. +- Generic `utils.py` files that contain unrelated domain behavior. +- Silent coercion or dropping as the sole parsing contract. +- Mutable collection defaults in Pydantic models. +- Comments that restate the following code. + +## Practical standard for this feature + +For local tools, the repository evidence supports: + +- one `agents/tools/` domain package. +- lifecycle-specific names (`Config`, `Spec`, `ResolvedToolSet`). +- explicit interfaces for injected secret and gateway dependencies. +- typed domain errors. +- strict default behavior with separately named compatibility behavior. +- pure private mapping helpers. +- one runner serialization boundary. +- tests split by parsing, resolution, secrets, gateway HTTP, and wire contract. +- a sibling MCP package only when MCP is an actual supported subsystem. diff --git a/docs/design/agent-workflows/sdk-local-tools/context.md b/docs/design/agent-workflows/sdk-local-tools/context.md new file mode 100644 index 0000000000..3cadc85881 --- /dev/null +++ b/docs/design/agent-workflows/sdk-local-tools/context.md @@ -0,0 +1,77 @@ +# Context + +## Why this work exists + +The agent runtime makes one promise. The +[ports-and-adapters](../ports-and-adapters.md) page states it plainly: "Nothing in the SDK +runtime calls the Agenta API, so the same code runs an agent standalone, with no Agenta +backend at all." A developer should be able to pull an agent config from Agenta, then run +that exact agent on a laptop, in a CI job, or inside another product, with no Agenta service +in the loop. + +The runtime keeps that promise for everything except tools. A standalone user can build an +`AgentConfig`, pick a harness, and (once `LocalBackend` lands) run a turn. The moment the +agent declares a tool, the promise breaks. Tool resolution lives only in the service, behind +HTTP calls to the Agenta API. So a standalone run either drops the tools silently or cannot +run at all. + +This effort closes that gap. It gives the SDK a way to resolve an agent's tools and supply +their secrets locally, so a standalone agent runs *with* its tools. + +## The target experience + +A standalone user should write roughly this: + +```python +import agenta as ag + +# 1. Pull the agent config from Agenta (one network call to the public registry). +params = ag.ConfigManager.get_from_registry(app_slug="my-agent") +agent = ag.AgentConfig.from_params(params) + +# 2. Build a local engine. No Agenta service, no rivet sidecar. +harness = ag.make_harness("pi", ag.Environment(ag.LocalBackend())) + +# 3. Run the agent locally, WITH its tools. +result = await harness.prompt(session_config, messages) +``` + +Step 3 is impossible today, in two ways. `LocalBackend` is a stub that raises (the sibling +effort owns that). And even with a working `LocalBackend`, nothing in the SDK turns the +agent's tool references into runnable specs or supplies their secrets. This document plans +the second part. + +## The standalone definition this hinges on + +"Standalone" needs a sharp edge, because the answer shapes the whole design. One open +decision (see [plan.md](plan.md)) is whether a run that calls the Agenta public REST API to +resolve a tool still counts as standalone. We frame the question here so the reader holds it +throughout: + +- **Offline-standalone**: the run touches no network except the model provider. Every tool + resolves from data the user already holds. This is the strict reading. +- **Connected-standalone**: the run uses no Agenta *service deployment* (no rivet sidecar, + no self-hosted agent service), but it may call the Agenta public API to resolve a tool or + fetch a secret. This is the loose reading. + +Both are useful. The design should let a user pick, not force one. We propose a two-resolver +split that does exactly that, but the reviewer decides (plan.md, Decision 1). + +## Goal + +Let a standalone SDK user resolve an agent's tools and run them under `LocalBackend`, with a +clear, documented answer for each tool kind: which kinds work fully offline, which need an +opt-in network call, and which stay out of reach for now. + +## Non-goals + +- **Building `LocalBackend` itself.** The sibling effort + ([`../scratch/sdk-local-backend/status.md`](../scratch/sdk-local-backend/status.md)) owns + the engine. This effort is the tool layer on top of it and assumes it exists. +- **Moving gateway (Composio) execution off the server.** The provider key must stay + server-side by design. A standalone gateway tool, if supported at all, calls back to + Agenta; it does not run the provider locally. +- **A new streaming or session model.** Sessions and streaming are covered in + [`../sessions.md`](../sessions.md) and the agent protocol RFC. This effort changes neither. +- **Shipping every tool kind in the first slice.** The plan sequences kinds by value and + difficulty; the first slice is deliberately narrow. diff --git a/docs/design/agent-workflows/sdk-local-tools/conventions-review.md b/docs/design/agent-workflows/sdk-local-tools/conventions-review.md new file mode 100644 index 0000000000..6aabfa8e8a --- /dev/null +++ b/docs/design/agent-workflows/sdk-local-tools/conventions-review.md @@ -0,0 +1,341 @@ +# Agent tools: conventions review + +This reviews the agent **tools** code (SDK + Python service) against our house conventions: +the `api/AGENTS.md` architecture rules, the SDK agents package idiom, and the existing +`api/oss/src/core/tools/` domain. Headline verdict: the tool code is in good shape and +mostly already matches the layer it lives in. The naming and docstrings are consistent with +the agents package. The two real gaps are both small: `ResolvedTools` is a `@dataclass` in a +package where every other contract is a Pydantic model, and the service raises bare +`RuntimeError` where a typed error reads better. Everything else is keep-as-is or a +discuss-later. + +## Tool code map + +| File | Layer | Role | +| --- | --- | --- | +| `sdks/python/agenta/sdk/agents/tool_defs.py` | SDK | Neutral tool vocabulary (`*ToolDef`, `McpServer`) + loose-shape parsers (`parse_tool_def`). | +| `sdks/python/agenta/sdk/agents/tool_resolution.py` | SDK | `LocalToolResolver`, `SecretResolver`/`EnvSecretResolver`, `ResolvedTools`, the per-kind spec builders. | +| `sdks/python/agenta/sdk/agents/dtos.py` | SDK | Holds `ToolCallback`, `SessionConfig` (the resolution output target). | +| `services/oss/src/agent/tools.py` | Service | Server-side orchestration: gateway over HTTP, delegates local kinds to the SDK. | +| `services/oss/src/agent/secrets.py` | Service | Vault secret resolution (`resolve_named_secrets`, `resolve_harness_secrets`). | +| `services/oss/src/agent/client.py` | Service | Shared backend base URL + caller credential. | +| `services/oss/src/agent/app.py` | Service | The `/invoke` handler that calls `resolve_tools` / `resolve_mcp_servers`. | +| `services/agent/src/tools/*.ts` | TS runner | The execution side (`code.ts`, `client.ts`, `mcp-server.ts`, ...). Out of focus. | + +## What I learned about the codebase + +These are the conventions I confirmed by reading, each with a citation. Useful as a +standalone reference. + +- **`api/AGENTS.md` rules are scoped to `api/`.** The file says so in its first line: + "Scope: everything under `api/`" (`api/AGENTS.md:3`). The SDK (`sdks/python`) and the + service (`services/oss`) are different layers with their own idioms. Judge tool code + against the layer it lives in. +- **API domains follow a fixed folder shape**: `apis/fastapi/` + `core/` + + `dbs/postgres/`, with named files `router.py` / `models.py` / `service.py` / + `interfaces.py` / `dtos.py` / `dao.py` / `mappings.py` (`api/AGENTS.md:58-78`). The + canonical copy-me example is `api/oss/src/core/workflows/` (`api/AGENTS.md:74-77`). +- **The existing tools domain follows that shape**: `api/oss/src/core/tools/` has + `dtos.py`, `exceptions.py`, `interfaces.py`, `service.py`, `registry.py`, `utils.py`, and + `providers/` (confirmed by directory listing). It is the fuller layout the SDK tool code + is being compared against. +- **API domain exceptions are typed and live in the core layer**, never `HTTPException` in + services (`api/AGENTS.md:209-217`). The tools domain has a base `ToolsError` plus typed + subclasses with structured context, for example `ActionNotFoundError` + (`api/oss/src/core/tools/exceptions.py:4-58`). Note: this domain uses `exceptions.py`, + while `api/AGENTS.md:210` names `core/{domain}/types.py` or `dtos.py`. So even inside + `api/` the exception-file name is not uniform. +- **API services return typed Pydantic DTOs, not dicts or tuples** (`api/AGENTS.md:298-345`). + The tools service proves resolution output can be a model: `resolve_agent_tools` returns + `AgentToolsResolution` carrying `List[ResolvedAgentTool]` + (`api/oss/src/core/tools/service.py:489-519`, defined at + `api/oss/src/core/tools/dtos.py:277-298`). +- **API signature style is keyword-only with `#`-grouped sections** (`api/AGENTS.md:188-206`). + Live example: `ToolsDAOInterface.create_connection(self, *, project_id, user_id, #, + connection_create)` (`api/oss/src/core/tools/interfaces.py:22-30`). +- **In-code comments are kept minimal, the why not the what** (`api/AGENTS.md:21-22`). +- **The SDK agents package has its own explicit hexagonal vocabulary**, documented in its + module docstring: `dtos.py` (contracts), `interfaces.py` (ports/ABCs), `adapters/` + (implementations), `utils/` (plumbing) (`sdks/python/agenta/sdk/agents/__init__.py:3-11`). + This is the idiom the tool code must match, not the API folder shape. +- **SDK contracts are Pydantic `BaseModel`, deliberately.** The `dtos.py` docstring states + "These are Pydantic models (the SDK already depends on Pydantic)" + (`sdks/python/agenta/sdk/agents/dtos.py:6-7`); all 12 contracts in that file subclass + `BaseModel` (grep count). The tool *defs* follow this too (`tool_defs.py:43-101`, + `McpServer` at `:117`). +- **SDK ports are ABCs.** `Sandbox` / `Session` / `Backend` all subclass `ABC` with + `@abstractmethod` (`sdks/python/agenta/sdk/agents/interfaces.py:41,58,94`). `Protocol` + exists in the SDK but is reserved for structural typing of third-party modules, for + example `_LitellmModule(Protocol)` (`sdks/python/agenta/sdk/utils/lazy.py:22`), and for + call-shape contracts like `InvokeFn(Protocol)` + (`sdks/python/agenta/sdk/decorators/running.py:54`). So an owned, implemented-by-us port + is an ABC; a Protocol is for "shape someone else satisfies". +- **The agents package names its errors file `errors.py`, not `exceptions.py`**, and its + errors subclass `RuntimeError` with structured attributes: + `UnsupportedHarnessError(RuntimeError)` stores `.harness` and `.backend` + (`sdks/python/agenta/sdk/agents/errors.py:13-21`). This is the SDK's own version of the + API's typed-domain-exception rule. +- **The snake/camel boundary is handled by `to_wire()` methods on DTOs.** `ContentBlock`, + `TraceContext`, and `ToolCallback` each own a `to_wire()` that maps snake_case fields to + camelCase wire keys (`dtos.py:122-142`, `:255-262`, `:273-274`). The `/run` request is + assembled in `utils/wire.py` (`request_to_wire`, `sdks/python/.../utils/wire.py:24-57`). + The TS side mirrors the names in `services/agent/src/protocol.ts` + (`utils/wire.py:3-5`). +- **The agents `__init__.py` curates `__all__` with section comments** grouping DTOs, the UI + codec, tool definitions, resolution, ports, errors, and adapters + (`sdks/python/agenta/sdk/agents/__init__.py:73-128`). +- **The agents package uses rich module + class docstrings.** Every module opens with a + multi-paragraph docstring explaining the layer and the why + (`dtos.py:1-9`, `interfaces.py:1-16`, `tool_defs.py:1-25`). This is louder than the + API's "minimal comments" rule, and it is the established norm *for this package*. +- **The SDK uses one module logger per file**: `log = get_module_logger(__name__)` + (`tool_resolution.py:47`, `secrets.py:22`, `tools.py:46`). +- **Tests split unit vs integration by directory and gate gateway/network tests behind + `pytest.mark.integration`.** Unit tool tests live in + `sdks/python/oss/tests/pytest/unit/agents/` (`test_tool_defs.py`, + `test_tool_resolution.py`) and service-side in + `services/oss/tests/pytest/unit/agent/test_tool_refs.py`; the gateway HTTP path lives in + `services/oss/tests/pytest/integration/agent/test_resolve_tools_http.py` with + `pytestmark = pytest.mark.integration` (`test_resolve_tools_http.py:19`). Test modules + open with a docstring naming what they lock (`test_tool_resolution.py:1-8`). + +## Proposals and parallels + +### 1. File organization and placement + +- **House pattern.** API splits a domain into many small files + (`api/AGENTS.md:58-78`). The SDK agents package splits by *role* not by feature: one + `dtos.py`, one `interfaces.py`, an `adapters/` package, a `utils/` package + (`__init__.py:3-11`). The `tool_defs.py` / `tool_resolution.py` pair sits as two + sibling modules at the package root, next to `dtos.py` and `streaming.py`. +- **Current state.** `tool_defs.py` (vocabulary + parsing) and `tool_resolution.py` + (resolution + secrets + spec builders) are two modules. `tool_resolution.py` is doing + three jobs: secret sources (`SecretResolver`/`EnvSecretResolver`), free-function spec + builders, and the `LocalToolResolver`. It is 225 lines, cohesive, and reads cleanly. +- **Proposal: Keep.** Two modules for ~460 lines is right for this package. A `tools/` + subpackage would be over-built next to a single-file `dtos.py` that holds 12 models. The + one thing worth flagging: `SecretResolver` is a generic seam (a `code` tool, an MCP + server, and tomorrow other consumers use it). If a second resolver-shaped abstraction + lands, lifting `SecretResolver`/`EnvSecretResolver` into the package's `interfaces.py` + (where ports already live) would match the idiom better than keeping it in + `tool_resolution.py`. Not now. **Worth discussing** only when a second consumer appears. + +### 2. File naming + +- **House pattern.** Dominant names are `dtos.py`, `service.py`, `interfaces.py`, + `utils.py`, `enums.py` (`api/AGENTS.md:62-72`). But the agents package already chose a + different file set on purpose: `dtos.py`, `interfaces.py`, `streaming.py`, + `tool_defs.py`, plus adapter subpackages such as `adapters/vercel/` for protocol-specific + edge code. Descriptive feature names are the norm *here* when the name marks a real seam. +- **Current state.** `tool_defs.py` and `tool_resolution.py` follow the agents package's + own descriptive-name idiom, not the API's `dtos.py`/`service.py` idiom. +- **Proposal: Keep.** Renaming to `dtos.py`/`service.py` would collide with the package's + existing `dtos.py` and would not be more informative. `tool_defs.py` and + `tool_resolution.py` say exactly what they hold. The names are good. + +### 3. Class naming + +- **House pattern.** API uses `*Service`, `*DAOInterface`, `*DTO` suffixes + (`ToolsService`, `ToolsDAOInterface`, `SecretResponseDTO`). The SDK agents package uses + bare role nouns for ports (`Backend`, `Harness`, `Sandbox`, `Session`, + `interfaces.py:41-198`) and bare nouns for DTOs (`Message`, `AgentResult`). +- **Current state.** `LocalToolResolver`, `SecretResolver`, `EnvSecretResolver`, + `ResolvedTools`, and the `*ToolDef` family. These read as role nouns, matching the agents + package. `EnvSecretResolver` as a concrete adapter of `SecretResolver` mirrors + `InProcessPiBackend` adapting `Backend`. +- **Proposal: Keep.** The names match the package's own port/adapter naming. `*ToolDef` + parallels the API's `Agent*Tool` discriminated union + (`api/oss/src/core/tools/dtos.py:253-274`) closely enough that a reader crossing the two + layers will not be surprised. + +### 4. Function naming and signature style + +- **House pattern.** API prefers keyword-only params with a leading `*` and `#`-grouped + sections (`api/AGENTS.md:188-206`; live at + `api/oss/src/core/tools/interfaces.py:22-30`). The SDK agents package follows this on its + *ports*: `Session.prompt(self, messages, *, on_event=None)` + (`interfaces.py:67-72`), `Backend.create_session(self, sandbox, config, *, harness, ...)` + (`interfaces.py:120-129`). +- **Current state.** `LocalToolResolver.__init__(self, secret_resolver=None, *, + enable_mcp=False)` already uses keyword-only for the flag + (`tool_resolution.py:166-171`), matching the ports. But the module-level spec builders are + positional: `code_tool_spec(tool, env)`, `mcp_server_spec(server, values)`, + `attach_orthogonal(entry, tool)` (`tool_resolution.py:88,98,124`). +- **Proposal: Small change (optional).** These free functions take two args each and are + internal builders called from one place. Positional is defensible and the call sites are + clear (`tool_resolution.py:208-209`, `services/oss/src/agent/tools.py:133`). If you want + strict parity with the package's keyword-only norm, make the second arg keyword-only + (`code_tool_spec(tool, *, env)`). Low value, low risk. I would **keep** unless you are + doing a parity sweep. + +### 5. Function structure + +- **House pattern.** Small, single-responsibility functions with early returns; the + split-then-build pattern (collect, then transform). The API resolver does exactly this + (`api/oss/src/core/tools/service.py:489-519`: split builtins vs custom, build each). +- **Current state.** `LocalToolResolver.resolve_defs` splits defs by type with three + comprehensions, then builds (`tool_resolution.py:198-211`). `resolve_tools` in the service + does the same split (`tools.py:152-180`). The functions are short, each does one thing, + and they early-return on empty (`tools.py:153-154`). The parsers in `tool_defs.py` use + early returns throughout (`parse_tool_def`, `:182-208`). +- **Proposal: Keep.** This is the cleanest dimension. The structure matches the house + split-then-build pattern precisely. + +### 6. Docstrings + +- **House pattern.** Two competing norms. `api/AGENTS.md:21-22` says keep in-code comments + minimal (why not what). The SDK agents package, by contrast, opens every module with a + multi-paragraph docstring (`dtos.py:1-9`, `interfaces.py:1-16`, `tool_defs.py:1-25`) and + gives most classes a why-focused docstring. +- **Current state.** `tool_defs.py` and `tool_resolution.py` carry rich module docstrings + and per-class/per-function docstrings in the package's house voice. They explain *why* + (the three-axis split, why gateway can't resolve offline, why MCP is flag-gated), not + *what*. +- **Proposal: Keep.** The docstrings match the agents package norm, which is the relevant + one. They are on the heavy side by `api/AGENTS.md` standards, but that rule does not own + this package, and the surrounding files set the heavier bar. Consistency wins. One nit: + `EnvSecretResolver.resolve` has a why-comment about the single read + (`tool_resolution.py:74`) that is genuinely a why, so it is fine. + +### 7. Variable naming + +- **House pattern.** snake_case Python locals; camelCase only at the wire boundary, and the + boundary is owned by `to_wire()` methods on DTOs (`dtos.py:122-142`, + `ToolCallback.to_wire`, `:273-274`). +- **Current state.** Locals are snake_case throughout. The camelCase keys + (`needsApproval`, `inputSchema`, `callRef`, `mcpServers`) appear only inside the + dict-literal builders (`tool_resolution.py:100-106`, `services/oss/src/agent/tools.py:126-132`). + The boundary is clean and intentional: snake_case in, camelCase out. +- **Proposal: Keep**, with one observation flagged in dimension 8: the builders emit + camelCase from a free function returning `Dict[str, Any]`, whereas the rest of the package + emits camelCase from a typed `to_wire()`. The variable naming itself is correct; the + *mechanism* is the discussion (see below). + +### 8. Typing and data contracts + +This is the most substantive finding. + +- **House pattern.** API services return typed Pydantic DTOs, not dicts or tuples + (`api/AGENTS.md:298-345`), and the tools domain already returns + `AgentToolsResolution` / `ResolvedAgentTool` for resolution output + (`api/oss/src/core/tools/dtos.py:277-298`). The SDK agents package makes contracts Pydantic + `BaseModel` on purpose (`dtos.py:6-7`), and wire serialization lives in a `to_wire()` on + the model (`ContentBlock.to_wire`, `dtos.py:122`). +- **Current state — three sub-points:** + 1. `ResolvedTools` is a `@dataclass` (`tool_resolution.py:149-156`). It is the **only** + `@dataclass` in the agents package; everything else there is a `BaseModel`. (The SDK + uses `@dataclass` only in `utils/types.py` and `evaluations/preview/utils.py`, not in a + contract layer.) + 2. `ResolvedTools` duplicates four fields of `SessionConfig` verbatim: `builtin_tools`, + `custom_tools`, `tool_callback`, `mcp_servers` (`tool_resolution.py:153-156` vs + `dtos.py:512-517`). The resolver builds a `ResolvedTools`, then the service unpacks it + field-by-field onto `SessionConfig` (`app.py:91-103` via the `resolve_tools` tuple). + 3. The wire specs are untyped `Dict[str, Any]` built by free functions + (`code_tool_spec` etc.), not a typed model with `to_wire()`. +- **Proposals:** + - **`ResolvedTools` -> Pydantic `BaseModel`: Small change, do it.** It is a one-line swap + (`@dataclass` to `BaseModel`, `field(default_factory=...)` to `Field(default_factory=...)`) + and it removes the lone odd-one-out in a Pydantic package. Cheap, lowers surprise. + - **The `SessionConfig` field duplication: Worth discussing, lean keep.** `ResolvedTools` + is the resolver's *output contract*; `SessionConfig` is the run's *input bundle*. They + overlap by design because resolution feeds the session. The current service path returns + a bare tuple `(builtins, custom_tools, callback)` from `resolve_tools` + (`tools.py:142-180`), which is exactly the tuple anti-pattern `api/AGENTS.md:341-344` + warns against. The higher-leverage fix is to have the service `resolve_tools` return the + `ResolvedTools` model (once it is a model) instead of a 3-tuple, and have `app.py` spread + it onto `SessionConfig`. That kills the tuple, keeps the two contracts distinct, and does + not force a premature merge. Note this crosses the layer boundary: `resolve_tools` lives + in `services/oss` but would return an SDK type, which is fine and already the dependency + direction (`service -> SDK`). + - **Typed wire specs with `to_wire()`: Worth discussing, defer.** Turning each + `code_tool_spec`/`client_tool_spec`/`mcp_server_spec` dict into a small Pydantic model + with `to_wire()` would match `ContentBlock`/`ToolCallback`. But the wire shape is shared + three ways (SDK builder, service gateway path, TS `protocol.ts`) and is pinned by the + golden contract tests. The dict-literal builders are readable and the camelCase keys sit + right next to the snake_case source. I would defer this until the tool wire shape needs + to grow; the payoff today is mostly aesthetic. + - **`SecretResolver` ABC vs Protocol vs callable: Keep the ABC.** It is an owned port with + a single method, implemented by us (`EnvSecretResolver`, the service's + `_VaultSecretResolver`, the test `_DictSecrets`). That is exactly when the package uses + an ABC (`interfaces.py:41-129`), not a Protocol (which it reserves for third-party module + shapes, `utils/lazy.py:22`). A bare `Callable[[Sequence[str]], Awaitable[Dict[str,str]]]` + would lose the name and the docstring seam. The ABC is the right call. + +### 9. Exception handling + +- **House pattern.** API defines typed domain exceptions in the core layer and never raises + `HTTPException` from services (`api/AGENTS.md:209-217`); the tools domain has the full + `ToolsError` hierarchy (`api/oss/src/core/tools/exceptions.py`). The SDK agents package has + its own version: typed errors in `errors.py` subclassing `RuntimeError` with structured + attributes (`errors.py:13-21`). The anti-pattern is bare `Exception`/`ValueError`/error + dicts for domain errors (`api/AGENTS.md:291-296`). +- **Current state.** The service's gateway path raises bare `RuntimeError` four times + (`services/oss/src/agent/tools.py:84,102,112,122`): missing API base, HTTP error, count + mismatch, incomplete spec. The SDK side raises nothing (it skips and warns, + `tool_resolution.py:184-188`). The tests assert on `RuntimeError` + a message substring + (`test_resolve_tools_http.py:43,89,97,111`). +- **Proposal: Small change, worth doing.** This is the clearest convention miss. The + service already imports from the SDK and the SDK already has `errors.py`. Define a small + typed error there, for example `ToolResolutionError(RuntimeError)` carrying structured + context (the failing ref count, the HTTP status), and raise it in `tools.py` instead of + four bare `RuntimeError`s. It subclasses `RuntimeError`, so the existing + `pytest.raises(RuntimeError, ...)` tests still pass, and callers can catch narrowly. Keep + it in the SDK `errors.py` next to `UnsupportedHarnessError`, since the failure is about + tool resolution (an SDK concept) and the service is the caller. Caveat: the service is not + a FastAPI router with an `@intercept_exceptions` boundary the way `api/` is; it runs inside + `ag.workflow`. So this is about *type clarity and structured context*, not about HTTP + mapping. Do not over-engineer a full hierarchy; one typed error is enough today. + +### 10. Module exports / `__init__.py` + +- **House pattern.** The agents `__init__.py` curates `__all__` with `#` section comments + grouping DTOs, the UI codec, tool definitions, resolution, ports, errors, adapters + (`__init__.py:73-128`). +- **Current state.** The new tool exports are grouped under two labelled sections: "Tool + definitions (the neutral tool vocabulary)" and "Local tool resolution (offline; secrets + pluggable, env by default)" (`__init__.py:97-111`). Every new name (`AgentToolDef`, + `LocalToolResolver`, `SecretResolver`, `EnvSecretResolver`, `ResolvedTools`, the parsers) + is exported. +- **Proposal: Keep.** The exports are consistent with the file's own pattern, the section + comments are accurate, and the grouping reads well. + +### 11. Tests + +- **House pattern.** Unit vs integration split by directory; network/gateway behind + `pytest.mark.integration`; module docstrings naming what is locked. +- **Current state.** Clean. Unit tool tests sit in + `sdks/python/oss/tests/pytest/unit/agents/test_tool_defs.py` and `test_tool_resolution.py` + and in `services/oss/tests/pytest/unit/agent/test_tool_refs.py`. The gateway HTTP path is + isolated in `services/oss/tests/pytest/integration/agent/test_resolve_tools_http.py` with + `pytestmark = pytest.mark.integration` (`:19`). Fixtures follow the package style: a small + in-test `SecretResolver` subclass `_DictSecrets` (`test_tool_resolution.py:21-28`), and an + `install_http` fixture for the integration mock. Each test module opens with a docstring + (`test_tool_resolution.py:1-8`, `test_tool_refs.py:1-8`). +- **Proposal: Keep.** This dimension is exemplary. The offline/online split mirrors the + code's own offline/gateway split, the fixtures use the real ABCs, and the docstrings make + each file self-explaining. One tiny note: `test_tool_refs.py` imports the private + `_gateway_ref` (`test_tool_refs.py:15`); testing a private is fine here since it is the + unit under test, but flag it so nobody "fixes" it by exporting the symbol. + +## Prioritized recommendation + +What a CTO would push on, highest leverage and lowest risk first. Be honest: most of this +code is already right. + +1. **Make `ResolvedTools` a Pydantic `BaseModel`.** One-line change. Removes the only + `@dataclass` in a Pydantic package and matches `api/AGENTS.md`'s typed-DTO rule. Do now. +2. **Replace the bare `RuntimeError`s in `services/oss/src/agent/tools.py` with one typed + error** (`ToolResolutionError(RuntimeError)` in the SDK `errors.py`) carrying structured + context. Subclassing `RuntimeError` keeps the existing tests green. Do now. +3. **Have the service `resolve_tools` return the `ResolvedTools` model instead of a + 3-tuple.** This kills the tuple anti-pattern (`api/AGENTS.md:341-344`) and the + field-by-field unpack in `app.py`. Depends on item 1. Do next. +4. **Defer: typed wire specs with `to_wire()`.** Real consistency win, but the wire shape is + shared three ways and pinned by golden tests; the payoff today is aesthetic. Revisit when + the tool wire shape grows. +5. **Defer / optional: keyword-only on the free spec builders, and lifting `SecretResolver` + into `interfaces.py`.** Both are parity polish. Do them only as part of a deliberate + sweep or when a second `SecretResolver` consumer lands. + +Everything else (file placement, file names, class names, function structure, docstrings, +variable naming, exports, tests) already matches the layer's conventions. Leave it alone. diff --git a/docs/design/agent-workflows/sdk-local-tools/organization-proposal.md b/docs/design/agent-workflows/sdk-local-tools/organization-proposal.md new file mode 100644 index 0000000000..1450e4fbbb --- /dev/null +++ b/docs/design/agent-workflows/sdk-local-tools/organization-proposal.md @@ -0,0 +1,612 @@ +# Tools code organization proposal + +> Implemented on 2026-06-19. The package, naming, validation, service-adapter, MCP, +> API-model-sharing, test-organization, and runner-file recommendations below are now reflected +> in the working tree. + +## Recommendation + +Do not keep growing `agents/tool_defs.py` and `agents/tool_resolution.py`. + +Move the SDK-owned tool runtime into a small `agents/tools/` package, name declarative +objects `*Config`, name runnable objects `*Spec`, keep compatibility parsing separate from +resolution, and make unsupported tools and missing secrets explicit policy decisions rather +than silent skips. + +Keep MCP out of that package while MCP remains out of release scope. Either remove the +dormant MCP flow from this slice or place it in a sibling `agents/mcp/` package with its own +capability and rollout decisions. + +This is an organization and naming proposal. It does not require changing the persisted +discriminator values (`builtin`, `gateway`, `code`, `client`) or the runner wire keys in the +first refactor. + +## Pre-refactor inventory + +### Python SDK + +The local-tools implementation currently lives in: + +- `sdks/python/agenta/sdk/agents/tool_defs.py` + - Pydantic tool-definition models. + - MCP server model. + - compatibility parsing for loose and legacy config shapes. +- `sdks/python/agenta/sdk/agents/tool_resolution.py` + - secret-source interface and environment implementation. + - config-to-wire mapping. + - local resolution orchestration. + - gateway skip policy. + - MCP feature-gate policy. +- `sdks/python/agenta/sdk/agents/dtos.py` + - raw `AgentConfig.tools` and `AgentConfig.mcp_servers`. + - resolved tool fields on `SessionConfig`. + - harness-facing tool and MCP delivery fields. +- `sdks/python/agenta/sdk/agents/errors.py` + - `ToolResolutionError`, currently not used by the resolver or service gateway path. +- `sdks/python/agenta/sdk/agents/adapters/harnesses.py` + - maps already-resolved tool specs into harness configuration. +- `sdks/python/agenta/sdk/agents/utils/wire.py` + - serializes harness configuration to the TypeScript runner contract. +- `sdks/python/agenta/sdk/utils/types.py` + - imports the tool models to generate the editable agent-config schema. +- `sdks/python/agenta/sdk/agents/__init__.py` + - re-exports most tool models, parsers, resolvers, and secret interfaces. + +SDK tests currently live in: + +- `sdks/python/oss/tests/pytest/unit/agents/test_tool_defs.py` +- `sdks/python/oss/tests/pytest/unit/agents/test_tool_resolution.py` +- `sdks/python/oss/tests/pytest/unit/agents/test_wire_contract.py` + +### Python agent service + +Server-side composition currently lives in: + +- `services/oss/src/agent/tools.py` + - partitions gateway and locally derivable tools. + - implements the HTTP gateway resolver. + - adapts gateway responses to the runner wire shape. + - provides the vault-backed secret resolver. + - reads the MCP feature flag. +- `services/oss/src/agent/secrets.py` + - resolves model-provider keys. + - resolves named tool secrets through the future vault consumer endpoint. +- `services/oss/src/agent/app.py` + - calls the resolvers and fills `SessionConfig`. + +Service tests currently live in: + +- `services/oss/tests/pytest/unit/agent/test_tool_refs.py` +- `services/oss/tests/pytest/integration/agent/test_resolve_tools_http.py` +- `services/oss/tests/pytest/integration/agent/test_resolve_secrets_http.py` + +### TypeScript runner + +Runtime execution lives in: + +- `services/agent/src/protocol.ts` + - runner wire types, including `ResolvedToolSpec` and `McpServerConfig`. +- `services/agent/src/tools/execute.ts` + - dispatches resolved tools by executor kind. +- `services/agent/src/tools/code.ts` + - code-tool subprocess execution and environment isolation. +- `services/agent/src/tools/client.ts` + - calls the Agenta callback endpoint; despite the filename, this is not a client-fulfilled + tool implementation. +- `services/agent/src/tools/relay.ts` + - file relay for callbacks from isolated sandboxes. +- `services/agent/src/tools/mcp-server.ts` + - exposes resolved tools through MCP. +- `services/agent/src/tools/mcp-bridge.ts` + - MCP transport bridge. + +Runner tests live in `services/agent/test/`, notably `tool-dispatch.test.ts`, +`tool-bridge.test.ts`, and `mcp-servers.test.ts`. + +### Platform API + +The existing platform tools domain is already organized as a package: + +- `api/oss/src/apis/fastapi/tools/` - HTTP models, router, boundary parsing. +- `api/oss/src/core/tools/` - DTOs, interfaces, service, registry, exceptions, providers. +- `api/oss/src/dbs/postgres/tools/` - database entities, mappings, DAO. + +This domain covers the platform catalog, provider connections, and remote execution. It is +related to, but not identical to, the SDK runtime-tool concern. + +## Problems in the current SDK shape + +### The files are split by chronology, not responsibility + +`tool_defs.py` means “the first tool file” and `tool_resolution.py` means “everything added +after parsing.” The latter currently owns interfaces, an implementation, pure mappers, +orchestration, logging behavior, unsupported-tool policy, and MCP rollout policy. + +Those concerns change for different reasons and should not share one module. + +### `Def` does not explain the lifecycle stage + +The important distinction is: + +- configuration: persisted, declarative, may contain compatibility shapes. +- runtime specification: validated and ready for the runner. + +`BuiltinToolDef` and `AgentToolDef` do not communicate that distinction. `*Config` and +`*Spec` do. + +### “Local” is already inaccurate + +`LocalToolResolver` is used by the service with `_VaultSecretResolver`, which performs an +HTTP-backed secret lookup. The class really resolves the locally derivable tool kinds; it +does not guarantee that all dependencies are local. + +### Untyped dictionaries erase the contract + +`custom_tools: List[Dict[str, Any]]` and helper return values such as +`code_tool_spec(...) -> Dict[str, Any]` make required fields, executor-specific fields, and +snake_case-to-camelCase conversion implicit. + +The TypeScript side already has `ResolvedToolSpec`. Python should have the matching typed +model and serialize aliases at the wire boundary. + +### The canonical tool configuration is duplicated + +The SDK defines `AgentToolDef`; API core separately defines `AgentToolReference`, +`AgentBuiltinTool`, and `AgentComposioTool`. Their discriminators and supported variants +already differ. + +The evaluations runtime demonstrates the preferred precedent: +`api/oss/src/core/evaluations/runtime/types.py` imports the SDK runtime classes directly and +defines only API-specific additions. The platform tools API should reuse the SDK-owned +neutral tool config and keep only catalog, connection, and provider-execution DTOs in API +core. + +### Parsing failures are silent + +`parse_tool_def` catches `ValidationError` and returns `None`. `parse_tool_defs` then drops +the entry. This is useful for a narrow compatibility adapter, but unsafe as the default +runtime behavior: a configured tool can disappear without telling the caller which entry +was invalid. + +### Resolution policy is hidden inside mechanics + +Current policy choices are embedded in implementation: + +- gateway tools are logged and skipped offline. +- missing named secrets are omitted. +- MCP declarations become an empty list when a flag is off. +- unsupported gateway providers are logged and skipped. + +These affect correctness and security. They need typed errors or explicit policy arguments. + +### MCP is a sibling concept but is folded into one resolver result + +The config itself treats `mcp_servers` as a sibling of `tools`, but `ResolvedTools` and +`LocalToolResolver` own both. Feature gating also happens during resolution even though the +actual constraint is a harness/backend execution capability. + +### Public and private surfaces are unclear + +The top-level `agenta.sdk.agents` package exports models, parsers, resolvers, and secret +interfaces together. The service imports `attach_orthogonal` from the implementation module. +That helper is an internal mapper, not an SDK concept. + +### Gateway correlation depends on position + +Gateway results are correlated back to requested tool configs with positional `zip`. Count +validation prevents truncation, but it does not prove that the backend preserved ordering. +The API contract should return a stable request reference or correlation ID and the adapter +should match on it. + +### Tool-name collision behavior is unspecified + +There is no visible validation for duplicate custom-tool names or collisions between a +built-in and a custom tool. The runner eventually registers tools by name, so the resolver +must define whether collisions are invalid, first-wins, or last-wins. Silent runner-level +overwriting is not an acceptable contract. + +### Source docstrings contain temporary design history + +The new modules contain release phases, plan decisions, and extensive architecture rationale. +That material will become stale. Source docstrings should state ownership and runtime +contracts; this design workspace should retain the history and tradeoffs. + +## Repository parallels to follow + +| Concern | Existing parallel | Lesson | +| --- | --- | --- | +| Domain package | `api/oss/src/core/tools/` | Split a growing domain by role instead of adding suffixed flat files. | +| Contracts and adapters | `sdks/python/agenta/sdk/agents/interfaces.py` + `adapters/` | Keep abstract dependencies separate from concrete implementations. | +| Nested subsystem | `sdks/python/agenta/sdk/engines/running/` | A domain inside the SDK can justify its own package and curated exports. | +| SDK-owned shared runtime | `sdks/python/agenta/sdk/evaluations/runtime/` + `api/oss/src/core/evaluations/runtime/types.py` | Define neutral runtime models once in the SDK; API code imports the same classes and adds only API-specific contracts. | +| Typed domain errors | `api/oss/src/core/tools/exceptions.py`, `agents/errors.py` | Raise domain errors with structured context; translate them at boundaries. | +| Registry/dispatch | `core/tools/registry.py`, `engines/running/registry.py` | Use a registry only when multiple implementations exist; do not begin with conditionals spread across callers. | +| Wire contract | `agents/utils/wire.py` + `services/agent/src/protocol.ts` | Keep Python names idiomatic and perform camelCase serialization at one boundary. | +| Tests by boundary | `docs/designs/testing/testing.structure.specs.md` | Keep pure parsing/resolution tests in SDK unit tests and HTTP gateway behavior in service integration tests. | + +Patterns not worth copying include the 622-line `agents/dtos.py`, the 569-line API +`core/tools/service.py`, broad `utils.py` modules, and legacy SDK modules that print HTTP +errors or catch exceptions without a domain contract. + +## Proposed SDK package + +```text +sdks/python/agenta/sdk/agents/tools/ + __init__.py + models.py + interfaces.py + parsing.py + compat.py + resolver.py + wire.py + errors.py +``` + +### `models.py` + +Own declarative config and resolved runtime models: + +- `ToolConfigBase` +- `BuiltinToolConfig` +- `GatewayToolConfig` +- `CodeToolConfig` +- `ClientToolConfig` +- `ToolConfig` - discriminated union. +- `CallbackToolSpec` +- `CodeToolSpec` +- `ClientToolSpec` +- `ToolSpec` - discriminated union for the runner-facing contract. +- `ResolvedToolSet` + +Keep persisted discriminator values unchanged. Use Pydantic aliases for runner keys such as +`inputSchema`, `callRef`, and `needsApproval`. + +Canonical models should use `ConfigDict(extra="forbid")`. Compatibility belongs before +model validation, not in permissive canonical models. + +MCP types do not belong here while MCP is out of scope. If retained, place them under a +sibling `agents/mcp/` package: + +```text +sdks/python/agenta/sdk/agents/mcp/ + __init__.py + models.py + interfaces.py + resolver.py + errors.py +``` + +### `parsing.py` + +Own strict conversion from already-supported mappings into canonical models: + +- `parse_tool_config` +- `parse_tool_configs` + +Raise `ToolConfigError` with the item index and Pydantic cause. It should never silently +drop executable configuration. + +### `compat.py` + +Own old playground and persisted payload shapes: + +- bare built-in names. +- `{"name": ...}` built-ins. +- the legacy `composio` discriminator. +- OpenAI function-picker gateway slugs. + +Return canonical `ToolConfig` values plus structured diagnostics. Callers may select strict +or best-effort compatibility behavior explicitly; best-effort must not be the only API. + +### `interfaces.py` + +Own narrow injected dependencies: + +```python +class ToolSecretProvider(Protocol): + async def get_many(self, names: Sequence[str]) -> Mapping[str, str]: ... + + +class GatewayToolResolver(Protocol): + async def resolve( + self, + tools: Sequence[GatewayToolConfig], + ) -> GatewayToolResolution: ... +``` + +`ToolSecretProvider` is intentionally scoped. A generic `SecretResolver` name collides with +provider credentials, vault management, and other SDK secret concepts. + +### `resolver.py` + +Own one orchestration path: + +```python +class ToolResolver: + def __init__( + self, + *, + secret_provider: ToolSecretProvider, + gateway_resolver: GatewayToolResolver | None = None, + missing_secret_policy: MissingSecretPolicy = MissingSecretPolicy.ERROR, + ) -> None: ... + + async def resolve( + self, + tool_configs: Sequence[ToolConfig], + ) -> ResolvedToolSet: ... +``` + +The resolver should: + +1. partition configs by kind. +2. collect declared secret names once. +3. fetch secrets once. +4. build typed specs with small private pure functions. +5. invoke the injected gateway resolver when gateway tools exist. +6. return a newly constructed immutable result. + +It should not parse raw payloads, read feature flags, silently skip unsupported kinds, or +mutate the result after construction. + +The environment implementation can live here while it remains tiny: + +- `EnvironmentToolSecretProvider` + +Move it to `secrets.py` only if a second built-in implementation appears. + +### `wire.py` + +Own the one conversion from typed, snake_case Python models to the TypeScript runner +contract. Harness adapters should consume typed specs and should not normalize arbitrary +dictionaries. + +### `errors.py` + +Use one domain base with structured context: + +- `ToolError` +- `ToolConfigError` +- `ToolResolutionError` +- `UnsupportedToolProviderError` +- `GatewayToolResolutionError` +- `MissingToolSecretError` + +HTTP status and response excerpts belong on the service adapter error or as structured cause +context, not as generic `RuntimeError` strings in SDK orchestration. + +### `__init__.py` + +Export the intended public SDK: + +- config models. +- `ToolResolver`. +- resolver/provider protocols intended for user extension. +- domain errors. + +Do not export private mappers or compatibility-only helpers from `agenta.sdk.agents`. + +## Naming proposal + +| Current | Proposed | Reason | +| --- | --- | --- | +| `AgentToolDef` | `ToolConfig` | It is persisted declarative configuration, not an arbitrary definition. | +| `_ToolDefBase` | `ToolConfigBase` | Names the lifecycle stage. Keep private if users never instantiate it. | +| `BuiltinToolDef` | `BuiltinToolConfig` | Distinguishes config from resolved spec. | +| `GatewayToolDef` | `GatewayToolConfig` | Same distinction; keep the discriminator value stable. | +| `CodeToolDef` | `CodeToolConfig` | Same distinction. | +| `ClientToolDef` | `ClientToolConfig` | Same distinction. | +| `McpServer` | `MCPServerConfig` | It is configuration, not a running server. Matches repository acronym style (`API`, `LLM`). | +| `ResolvedTools` | `ResolvedToolSet` | Communicates an aggregate value, not a resolver action. | +| `SecretResolver` | `ToolSecretProvider` | Narrows scope and describes an injected source. | +| `EnvSecretResolver` | `EnvironmentToolSecretProvider` | Explicit source and scope. | +| `LocalToolResolver` | `ToolResolver` | The resolver can use local or remote injected providers. | +| `parse_tool_def(s)` | `parse_tool_config(s)` | Matches the renamed model and avoids abbreviation. | +| `attach_orthogonal` | `_apply_tool_metadata` | “Orthogonal” describes a design argument, not the function's behavior. | +| `code_tool_spec` | `_build_code_tool_spec` | Verb-first private mapper. | +| `client_tool_spec` | `_build_client_tool_spec` | Verb-first private mapper. | +| `mcp_server_spec` | `_build_mcp_server_spec` | Verb-first private mapper. | +| `defs` | `tool_configs` | Avoid unexplained abbreviation. | +| `entry` | `tool_spec` | Name the actual value. | +| `values` | `secret_values` | Name the domain content. | +| `custom_tools` internally | `tool_specs` | Keep `customTools` only as the legacy wire field name. | +| `builtins` internally | `builtin_names` | Distinguishes names from built-in config objects. | + +Use `McpServerConfig` instead of `MCPServerConfig` only if cross-language spelling is +prioritized over the Python codebase's existing acronym convention. Pick one spelling and +apply it in Python docs, schema titles, and tests. + +## Function and file structure + +### Functions + +- Use keyword-only parameters when a function has multiple same-shaped values. +- Keep parsing, resolution, mapping, and I/O in separate functions. +- Keep mapper helpers private and pure. +- Construct return objects once; avoid assigning fields after creation. +- Use one clear noun per variable: `tool_config`, `tool_spec`, `secret_names`, + `secret_values`, `gateway_tools`. +- Do not expose an internal `resolve_defs` method solely so the service can bypass another + public method. Inject the gateway implementation into the same resolver. +- Validate duplicate resolved names and built-in/custom collisions before returning. + +### Docstrings and comments + +The current modules read partly like design documents. Follow the repository's API comment +guidance: + +- module docstring: purpose and dependency boundary in a few lines. +- public class/function docstring: contract, important policy, raised errors. +- no docstring for obvious Pydantic data containers. +- comments explain security constraints or compatibility reasons, not the next line of code. +- keep architecture rationale in this design folder. + +### Exceptions + +- Core SDK code raises tool-domain exceptions. +- The service gateway adapter wraps `httpx` errors as `GatewayToolResolutionError`. +- HTTP route code translates domain errors to HTTP responses. +- Runner execution failures remain tool-result failures when the model loop is expected to + continue. +- Missing declared secrets should fail before tool execution by default. A best-effort mode + can exist for backward compatibility, but it must be explicit and tested. +- Gateway results should correlate by stable ID/reference, not response position. + +## Service-side proposal + +Use a service-side adapter package. The current file already contains composition, gateway +HTTP, vault adaptation, and rollout policy: + +```text +services/oss/src/agent/ + tools/ + __init__.py # curated service entry point + resolver.py # build the SDK ToolResolver for one request + gateway.py # Agenta HTTP GatewayToolResolver + secrets.py # VaultToolSecretProvider + secrets.py # model-provider/harness secrets only + client.py # shared Agenta HTTP client configuration +``` + +Move `_VaultSecretResolver` to `tools/secrets.py` and rename it +`VaultToolSecretProvider`. + +Return `ResolvedToolSet` from service resolution instead of a three-item tuple. This removes +positional unpacking from `app.py` and makes future additions non-breaking. + +Feature flags should be read in the service composition/config layer. MCP capability checks +belong in harness/backend selection or delivery, not in the generic tool resolver. + +## TypeScript proposal + +The TypeScript runner already has the clearest tool directory. Keep that structure, with two +targeted naming fixes: + +- rename `tools/execute.ts` to `tools/dispatch.ts`; the module selects an executor, while + `code.ts` performs execution. +- rename `tools/client.ts` to `tools/callback.ts` or `tools/agenta-callback.ts`; “client” is + confused with the `kind: "client"` tool that the browser fulfils. + +Keep wire types in `protocol.ts` until the protocol itself is split. Moving only tool types +would create cross-file protocol ownership without enough benefit. + +## Test organization proposal + +Once the SDK source becomes a package, mirror the domain: + +```text +sdks/python/oss/tests/pytest/unit/agents/tools/ + test_models.py + test_parsing.py + test_resolution.py + test_secrets.py +``` + +Service tests should separate pure composition from network behavior: + +```text +services/oss/tests/pytest/unit/agent/tools/ + test_resolution.py + test_gateway_mapping.py + +services/oss/tests/pytest/integration/agent/tools/ + test_gateway_http.py + test_vault_secrets_http.py +``` + +Tests should assert: + +- strict invalid-config errors include the failing index and cause. +- compatibility parsing reports dropped entries. +- canonical models reject unexpected fields. +- declared missing secrets follow the selected policy. +- unsupported gateway providers raise a typed error. +- gateway count mismatches and incomplete responses retain structured context. +- gateway responses are matched by stable correlation value rather than position. +- duplicate tool names and built-in/custom collisions have a deterministic error. +- no secret value appears in logs or exception messages. +- Python typed specs serialize exactly to the TypeScript golden wire contract. + +## Migration plan + +### Phase 1: package-only refactor + +- Add `agents/tools/`. +- Move code without changing persisted values or wire output. +- Keep temporary imports from `tool_defs.py` and `tool_resolution.py` that emit deprecation + warnings only if these modules have shipped publicly. +- Update internal imports to the new package. +- Reuse SDK tool-config models from API core and remove the duplicate API agent-tool union + only after API compatibility tests are in place. + +### Phase 2: type the resolved contract + +- Add `ToolSpec` variants and `ResolvedToolSet`. +- Serialize runner aliases in one place. +- Replace `List[Dict[str, Any]]` and tuple returns. + +### Phase 3: make policy explicit + +- Add strict parsing and diagnostics. +- Add missing-secret policy. +- Raise typed errors for unsupported gateway configuration. +- Add duplicate-name and collision validation. +- Remove silent skip behavior from the default path. + +### Phase 4: simplify service composition + +- Inject `VaultToolSecretProvider` and the HTTP gateway resolver. +- Remove `resolve_defs` and internal mapper imports. +- Move feature-gate reads to composition. + +### Phase 5: isolate or defer MCP + +- Remove MCP from the tool resolver while it is disabled. +- If implementation must remain, move it to `agents/mcp/` and give it independent capability + checks and tests. + +### Phase 6: optional runner rename + +- Rename TypeScript dispatch/callback files in a separate mechanical change. + +## Likely CTO review questions + +1. Which layer is the source of truth for tool configuration: SDK, API core, or runner + protocol? +2. Why are API `AgentToolReference`, SDK `ToolConfig`, and TypeScript + `ResolvedToolSpec` separate, and where is each conversion tested? +3. Why can an invalid configured tool disappear silently? +4. Why does an offline resolver silently skip a gateway tool instead of failing the run? +5. Why can a code tool execute when a declared secret is missing? +6. Why is a class named `LocalToolResolver` used with an HTTP vault dependency? +7. Why are runnable specs dictionaries instead of a discriminated model? +8. Why is MCP feature policy inside resolution when MCP support is a runtime capability? +9. Why is MCP grouped into `ResolvedTools` when the config calls it a sibling? +10. Which names are public API, and what is the compatibility plan if they have shipped? +11. Why does the service import an internal mapping helper from the SDK? +12. Why does `ToolResolutionError` exist while production paths raise `RuntimeError`? +13. Can secret values leak through logs, Pydantic reprs, exceptions, traces, or serialized + session configuration? +14. Are gateway response ordering and one-to-one mapping guaranteed by a documented API + contract? +15. What prevents Python and TypeScript tool specs from drifting? +16. Is the first PR a behavior change or an organization change, and can reviewers verify + that distinction from tests? +17. Why do SDK and API core define different versions of the same agent-tool config? +18. How are duplicate names and built-in/custom collisions handled? +19. How is gateway response ordering guaranteed, and why is position an acceptable key? +20. Why is MCP implemented across config, resolution, wire, and runner while the feature is + declared out of scope? + +## Decision requested + +Approve the package and lifecycle naming before adding more tool kinds: + +- `agents/tools/` +- `*Config` for persisted declarations. +- `*Spec` for runnable contracts. +- `ToolResolver` with injected providers. +- typed errors and explicit policy. + +If this is accepted, the next implementation PR should be a package-only refactor with no +wire or persisted-schema changes. diff --git a/docs/design/agent-workflows/sdk-local-tools/plan.md b/docs/design/agent-workflows/sdk-local-tools/plan.md new file mode 100644 index 0000000000..4ee5ddaf26 --- /dev/null +++ b/docs/design/agent-workflows/sdk-local-tools/plan.md @@ -0,0 +1,175 @@ +# Plan: settled decisions and phased direction + +The reviewer settled the four open decisions on 2026-06-19. This page records the answers, +then the phases that follow from them. Where a decision changed the prior framing, the +change is called out so the research and status pages stay consistent. + +## Settled decisions + +### Decision 1: Where does resolution live? Settled: split by tool kind. + +Resolution splits on the executor `type` (`sdks/python/agenta/sdk/agents/tool_defs.py`). +Only one kind needs Agenta. The rest resolve locally. + +- **`gateway` (Composio) stays in the backend.** Resolving a gateway tool validates a + Composio connection and holds the provider key and OAuth, so it cannot move to the SDK. + The call routes back through `/tools/call` and the key never leaves the server. This is the + tool kind a user sometimes calls "the built-in Agenta tool," but in the taxonomy it is + `gateway`, not `builtin`. +- **`builtin`, `code`, `client`, and MCP resolve locally.** A `builtin` is just a name. A + `code` tool's spec is built from local data. A `client` tool is name plus schema. None of + them needs Agenta to resolve. MCP is local too, but it ships later (Decision 2). + +So we build both shapes behind one interface: + +- An offline **`LocalToolResolver`** (the default) resolves `builtin` + `code` + `client` + (and MCP later) from local data, with no network call. +- An opt-in **`AgentaToolResolver`** resolves `gateway` by posting refs to the Agenta public + API. A user who needs a gateway tool opts into this and accepts the network call. + +A user with only built-in and code tools stays fully offline. The goal the reviewer set is +plain: you should be able to run everything except gateway without Agenta, using your own +secrets. + +One caveat on `client` tools. They resolve locally, but a headless standalone run has no +browser to fulfil them, and the in-process Pi engine skips them +(`services/agent/src/engines/pi.ts:165`). They are locally resolvable, not locally runnable, +unless the user supplies a fulfiller. We document this as a limitation, not a blocker. + +**Dependency direction is fixed regardless: `service -> SDK`, never the reverse.** The +service already imports the SDK's tool vocabulary +(`services/oss/src/agent/tools.py:23`). When spec-building logic moves into the SDK, the +service consumes it. The sibling effort locks the same rule +([`../scratch/sdk-local-backend/status.md`](../scratch/sdk-local-backend/status.md), +"Dependency direction"). + +### Decision 2: How do code and MCP executors run locally? Settled: code reuses the Pi engine; MCP is out of scope, wired but flag-gated off. + +`LocalBackend`'s Pi path is the bundled in-process Pi engine, which already executes `code` +and `callback` tools (`services/agent/src/engines/pi.ts:144`). So code execution needs no new +executor under Pi. It needs the bundle the sibling effort ships, plus a correctly filled code +spec from the resolver. + +MCP is out of scope for the first releases. We keep the functions and flows in place (parse, +the `mcp_servers` field on `SessionConfig`, a resolver path) but keep them a no-op behind a +feature flag that defaults to off, for both Pi and Claude. This matches the code today: the +in-process Pi engine already hard-codes `mcpTools: false` +(`services/agent/src/engines/pi.ts:58`), so Pi is already a no-op for MCP. The flag +formalizes that and gives Claude the same off-by-default treatment. The business logic stays +dormant behind the flag until a later phase turns it on. + +### Decision 3: How are secrets supplied locally? Settled: env by default, behind a pluggable `SecretResolver`. The vault is a later, connected path. + +First, the secret model, because the prior framing called part of this a bug and it is not. + +Secrets live in the project vault. One table, one CRUD stack +(`POST/GET/PUT/DELETE /secrets` under `/vault/v1`), encrypted at rest. A secret's `kind` +decides how it behaves at run time. Two kinds matter here: + +- **`provider_key`**: a provider-indexed LLM credential. `data` is + `{kind: "openai", provider: {key: "..."}}`, where the inner `kind` is a fixed provider + enum. It is consumed today by `get_user_llm_providers_secrets()` + (`api/oss/src/core/secrets/utils.py:54`), which maps each provider to a fixed env var + (`openai -> OPENAI_API_KEY`). The user never names these. The identity is the provider. +- **`custom_secret`**: a free-text `name -> value` entry. `header.name` is the user-chosen + name (for example `GITHUB_TOKEN`); `data` is `{secret: {key: "..."}}`, with no provider + enum. It is storage-only this iteration by design. Nothing reads it, there is no env-var + mapping, and it is not injected into completions or the agent runtime. + +Same endpoint, same table, same encryption and CRUD. The only difference is the `kind` and +what consumes the value. + +Now the problem this effort must solve. A `code` tool names the secrets it needs +(`secrets: ["GITHUB_TOKEN"]`), and an MCP server names an env-var-to-secret-name map. The +config stores only the name, never the value. At run time something must turn the name into a +value and inject it as an env var into the sandbox subprocess (code) or the server process +(MCP). On the Agenta server path that consumer would read the `custom_secret` rows by name, +but it is not built yet, because `custom_secret` is storage-only this iteration. So +`resolve_named_secrets` (`services/oss/src/agent/secrets.py:75`) anticipates a consumer that +does not exist; until it does, declared custom secrets resolve to `{}` and the tool runs +without those env vars. This is the expected current state given the storage-only design, not +a bug to fix in this effort. + +For a standalone run we do not want to depend on the vault at all. The SDK has no secret +resolution of any kind today. So a standalone agent has no way to supply `GITHUB_TOKEN` to +its code tool. The settled answer: + +- **Env is the offline default.** A code tool's declared secret name reads from the process + environment. `GITHUB_TOKEN` comes from `os.environ`. This is what "use your own secrets and + run" means. +- **A pluggable `SecretResolver` is the interface env plugs into.** The user can back it with + a `.env` file, a secret manager, or their own vault. Env is just the built-in default + implementation. We ship the interface in the first slice. +- **The Agenta vault over HTTP is a later, connected path.** It reads `custom_secret` rows by + name and depends on the future consumer endpoint being built. That work is tracked as an + open issue ([../open-issues.md](../open-issues.md)) and is tied to the named-secrets effort + (`docs/design/vault-named-secrets/`). + +### Decision 4: What is the minimum first slice? Settled: `LocalBackend` (Pi) + built-in + code + env secrets, offline. + +A standalone agent that reads files, runs bash, and runs its own code tools, with zero Agenta +calls. It excludes gateway (server-bound), client (needs a browser), and MCP (flag-gated +off). This implies building the `LocalBackend` Pi path, which the sibling effort owns as its +Phase 0. + +## Phased direction + +The phases assume the settled decisions above. + +### Phase 0 (prerequisite, owned elsewhere): `LocalBackend` runs a tool-free agent + +`LocalBackend` is a stub that raises (`sdks/python/agenta/sdk/agents/adapters/local.py:30`). +The sibling effort ([`../scratch/sdk-local-backend/status.md`](../scratch/sdk-local-backend/status.md)) +must ship at least the Pi path (the bundled JS runner) before any tool work runs end to end. + +### Phase 1: built-in tools, offline + +The smallest real step. Built-in tools are just names that run in-harness +(`services/oss/src/agent/tools.py:183`). A standalone user lists built-in names on the +`AgentConfig`, and the harness runs them. This is mostly a wiring check that the names flow +onto `SessionConfig.builtin_tools` under `LocalBackend`. + +### Phase 2: a local tool resolver, code tools, env secrets + +Build the `LocalToolResolver`. It reuses the per-kind spec builders that +`services/oss/src/agent/tools.py` already has (`_resolve_code` at `:127`, `_client_spec` at +`:153`), moved into the SDK so the service and the standalone user call the same code, with +the dependency pointing `service -> SDK`. Add a `SecretResolver` with an env default. At the +end of this phase a standalone agent runs its code tools offline, because the in-process Pi +engine already executes them (`services/agent/src/engines/pi.ts:169`). + +This is the first slice with real value. It is the phase to aim the first PR at. + +### Phase 3: opt-in Agenta-backed resolver for gateway tools + +Add `AgentaToolResolver`. It posts gateway refs to the Agenta public API and wires the +`/tools/call` callback, exactly as `_resolve_gateway` (`services/oss/src/agent/tools.py:67`) +does, but driven from the SDK. This is opt-in and network-gated, and it keeps the Composio key +server-side. It is connected-standalone: no sidecar, but a real Agenta API call. Client tools +surface here as a documented limitation (no browser to fulfil them in a headless local run). + +### Phase 4: land the vault consumer for named secrets + +Add the runtime consumer that resolves `custom_secret` rows by name, so the connected +resolver can read the vault and the server path stops resolving declared secrets to `{}`. +This is co-owned with the named-secrets effort (`docs/design/vault-named-secrets/`), whose +current iteration is storage-only by design. Tracked in +[../open-issues.md](../open-issues.md). + +### Phase 5: MCP, and Claude locally + +The last and least certain. Turning on local MCP needs an executor the in-process Pi engine +does not have (`services/agent/src/engines/pi.ts:58`); it stays flag-gated off until then. +Claude locally needs the `claude-agent-sdk` path the sibling effort owns, plus MCP-style tool +delivery on top. Both are real work and neither blocks the value in phases 1 and 2. + +## Phase summary + +| Phase | Delivers | Network | Blocked on | +| --- | --- | --- | --- | +| 0 | tool-free local Pi run | none | sibling effort (`LocalBackend`) | +| 1 | built-in tools | none | Phase 0 | +| 2 | code tools + env secrets | none | Phase 1; `LocalToolResolver`, `SecretResolver` | +| 3 | gateway tools (opt-in) | Agenta API | Phase 2; `AgentaToolResolver` | +| 4 | vault named-secret consumer | Agenta API | API work; named-secrets coordination | +| 5 | MCP + Claude locally | varies | Phases 2-4; new executor, MCP flag flipped on | diff --git a/docs/design/agent-workflows/sdk-local-tools/research.md b/docs/design/agent-workflows/sdk-local-tools/research.md new file mode 100644 index 0000000000..9704e3cb26 --- /dev/null +++ b/docs/design/agent-workflows/sdk-local-tools/research.md @@ -0,0 +1,170 @@ +# Research: the verified current state + +This is the map a prior review drew and this document re-verified against the code on +2026-06-19. Every claim cites `file:line`. Read it as the ground truth the plan builds on. + +The headline: the tool **vocabulary** already lives in the SDK and parses standalone, but +tool **resolution** and **secret resolution** live only in the service, behind HTTP calls to +the Agenta API. A standalone run has the words but not the machinery. + +## The pipeline, end to end + +A configured tool travels through four stages before a harness can call it. + +1. **Parse**: turn a loose config entry into a typed tool def. *In the SDK already.* +2. **Resolve**: turn a typed def into a wire-ready runnable spec (a name, a description, an + input schema, and either a `callRef`, a code snippet, or a client marker). *Service only.* +3. **Supply secrets**: fetch the vault values a code tool or MCP server needs. *Service + only, and partly broken even there (see below).* +4. **Execute**: actually run the call when the model invokes the tool. *Lives in the TS + runner; the in-process Pi engine already runs code and callback tools.* + +Stage 1 is portable today. Stages 2 and 3 are the gap. Stage 4 is closer than it looks. + +## Stage 1: parsing (already in the SDK, standalone-ready) + +The neutral tool vocabulary and its parser live in the published SDK, with no dependency on +the Agenta API. + +- `sdks/python/agenta/sdk/agents/tool_defs.py:174`: `parse_tool_def` coerces one loose + config entry into a typed def. `parse_tool_defs` (`:211`) does a list; + `parse_mcp_servers` (`:223`) does MCP servers. +- The tool union spans four executors, discriminated on `type`: + `BuiltinToolDef` (`:55`), `GatewayToolDef` (`:63`), `CodeToolDef` (`:79`), + `ClientToolDef` (`:93`), joined as `AgentToolDef` (`:104`). +- Two orthogonal fields ride on every tool: `needs_approval` and `render`, both on + `_ToolDefBase` (`:43`). +- `McpServer` (`:117`) is a sibling of `tools`, not a tool variant. It declares a transport + (`stdio` or `http`), a command, and an env-var-to-vault-secret-name map. + +A standalone user can already call `parse_tool_defs(agent.tools)` and get typed defs. The +parser is permissive on input and strict on output, and it never touches the network. + +## Stage 2: resolution (service only) + +Resolution lives in the service, not the SDK. + +- `services/oss/src/agent/tools.py:164`: `resolve_tools` splits the parsed defs by executor + and resolves each kind: + - **gateway** (`:67`, `_resolve_gateway`): POSTs the refs to the Agenta API + `POST /tools/resolve` (`:89`), then sets a callback to `POST /tools/call` (`:120`). The + Composio key and connection auth stay server-side by design. + - **code** (`:127`, `_resolve_code`): builds the spec locally from the def, then resolves + the tool's declared secrets into a scoped `env` (`:134`). + - **client** (`:153`, `_client_spec`): builds a name-plus-schema spec locally, no callback. +- `resolve_mcp_servers` (`:211`) resolves each declared server's secret env and returns + wire-ready server dicts. + +Read closely, only the **gateway** branch truly needs the network. The code and client +branches build their specs from local data; the only network call inside them is the secret +fetch (stage 3). That is the seam the plan exploits: most of `resolve_tools` is already +offline-capable logic that simply lives in the wrong layer. + +The service feeds the resolved output onto the `SessionConfig` and never the SDK. The wiring +is in `services/oss/src/agent/app.py`: `resolve_tools` and `resolve_mcp_servers` are +awaited (`:91`-`:92`), then their output is placed on `SessionConfig` fields `builtin_tools`, +`custom_tools`, `tool_callback` (`:99`-`:101`). The SDK consumes these; it never produces +them. + +## Stage 3: secrets (service only, and the named-secret consumer is not built yet) + +Secrets live in one project vault. One table, one CRUD stack, encrypted at rest. A secret's +`kind` decides how it behaves at run time. Two kinds matter here. + +- **`provider_key`**: a provider-indexed LLM credential. `data` is + `{kind: "openai", provider: {key: "..."}}`, the inner `kind` a fixed provider enum. The + service consumes it in `resolve_harness_secrets` (`services/oss/src/agent/secrets.py:38`), + which fetches the vault via `GET /secrets/` (`:54`) and maps each provider to its env var + (`openai -> OPENAI_API_KEY`). This path works. The same filter-and-map lives in the API at + `get_user_llm_providers_secrets` (`api/oss/src/core/secrets/utils.py:54`). +- **`custom_secret`**: a free-text `name -> value` entry. `header.name` is the user-chosen + name (for example `GITHUB_TOKEN`); `data` is `{secret: {key: "..."}}`, with no provider + enum. It is storage-only this iteration by design. Nothing reads it, there is no env-var + mapping, and it is not injected into the agent runtime. + +Same endpoint, same table, same encryption and CRUD. The only difference is the `kind` and +what consumes the value. + +A code tool and an MCP server name `custom_secret` entries. `resolve_named_secrets` +(`services/oss/src/agent/secrets.py:75`) is what they would use: it POSTs the requested names +to a consumer endpoint (`:97`) and reads back `{name: value}`. That consumer is not built yet. +The vault router exposes only CRUD (`api/oss/src/apis/fastapi/vault/router.py:36`-`:74`: +create, list, read, update, delete) and no resolve route. The named-secrets effort is +storage-only this iteration and says so at `docs/design/vault-named-secrets/context.md:23`: +"Do **not** inject these secrets into the agent runtime, sandbox, or any invocation." + +So `resolve_named_secrets` swallows the missing-endpoint response and returns `{}` +(`secrets.py:101`-`:107`), logging a warning. A code tool or MCP server that declares a custom +secret gets an empty env today, on every path including the server. This is the expected +current state given the storage-only design, not a bug to fix in this effort. Building the +consumer is later, coordinated work (the plan's Phase 4 and +[open-issues.md](../open-issues.md)). For the standalone slice, secrets come from a local +`SecretResolver` (env by default), so the offline path does not wait on the vault consumer at +all. + +## Stage 4: execution (the TS runner, and the in-process Pi engine already does most of it) + +Execution lives in the TypeScript runner under `services/agent/src/tools/`: +`code.ts` runs a code snippet, `mcp-server.ts` bridges resolved tools over MCP, `execute.ts` +(`runResolvedTool`) is the shared dispatcher, `client.ts` and `relay.ts` round it out. + +The important and under-appreciated fact: the **in-process Pi engine already executes tools**. +`services/agent/src/engines/pi.ts` builds Pi custom tools and branches on the executor +`kind` (`pi.ts:144`): + +- `code`: runs the snippet in a sandbox subprocess with its scoped secret env (`pi.ts:169`). +- `callback`: POSTs back to Agenta's `/tools/call` (`pi.ts:179`). +- `client`: skipped in-process; there is no browser to answer (`pi.ts:165`). + +`code.ts`'s own header says it is "Shared by every delivery path that runs code locally: +engines/pi.ts (in-process Pi), extensions/agenta.ts (Pi under rivet), tools/mcp-server.ts." +This matters because `LocalBackend`'s Pi path is the **bundled in-process Pi engine**. So once +`LocalBackend` ships that bundle, code-tool *execution* comes nearly for free. The work is to +hand the engine a correctly resolved code spec with its env filled in, which is stages 2 and 3. + +What the in-process Pi engine does **not** do is MCP. `pi.ts:58` hard-codes +`mcpTools: false`. MCP delivery and execution live only on the rivet path (`rivet.ts`, the +capability gate at `:799`), which the standalone user does not have. MCP local support is +therefore a later phase that needs its own executor, not a reuse of the Pi engine. + +## The delivery contract the SDK already speaks + +The resolved output rides on neutral types the SDK owns, so a local resolver has a clear +target to produce. + +- `SessionConfig` (`sdks/python/agenta/sdk/agents/dtos.py:498`) carries `builtin_tools`, + `custom_tools`, `tool_callback`, and `mcp_servers` (`:512`-`:517`). +- The harness adapters in `sdks/python/agenta/sdk/agents/adapters/harnesses.py` only *shape* + already-resolved specs; they never resolve. `_normalize_tool_specs` (`:65`) fills defaults + and passes the executor fields (`kind`, `runtime`, `code`, `env`, `callRef`) through to the + wire (`:54`, `_TOOL_SPEC_PASSTHROUGH`). `PiHarness` (`:93`) keeps built-ins and delivers + specs natively; `ClaudeHarness` (`:114`) drops built-ins and routes over MCP. + +So the local resolver's job is well defined: produce the same `builtin_tools` / +`custom_tools` / `tool_callback` / `mcp_servers` shapes the service produces, from local data. +Everything downstream already accepts them. + +## The current state, per tool kind + +| Kind | Resolve today | Execute under `LocalBackend` (Pi) | Standalone reach | +| --- | --- | --- | --- | +| **builtin** | trivial; just the name (`tools.py:183`) | yes, runs in-harness | **fully offline.** Closest to free. | +| **code** | spec built locally; only the secret fetch is remote (`tools.py:127`) | yes, `pi.ts:169` already runs it | **offline if secrets come from env.** The spec builder must move to the SDK; secrets need a local source. | +| **client** | spec built locally (`tools.py:153`) | skipped in-process (`pi.ts:165`) | needs a browser/fulfiller; **out of offline reach** without a UI. | +| **gateway** | requires `POST /tools/resolve` + `/tools/call` (`tools.py:67`) | callback to Agenta (`pi.ts:179`) | **server-bound by design.** Connected-standalone only; the provider key stays server-side. | +| **MCP** | spec built locally; secret env is remote (`tools.py:211`) | not supported in-process (`pi.ts:58`) | needs a local MCP executor; **later phase.** | + +## What this means for the plan + +Three findings shape the phasing: + +1. **Built-in and code tools are within reach of a first offline slice.** Built-in is just a + name. Code's spec builder is local logic in the wrong layer, and the in-process Pi engine + already executes code. The only true blocker for code is secrets, which env solves. +2. **Named-secret consumption is not built, by design.** Custom secrets are storage-only this + iteration, so any path that relies on the vault for code or MCP secrets either waits on the + future consumer endpoint or supplies secrets another way (env, a pluggable resolver). The + offline slice takes the second route. +3. **Gateway and MCP are genuinely harder.** Gateway must stay server-bound. MCP needs an + executor the in-process Pi engine does not have. Both belong in later phases, framed as + open decisions, not promised in the first slice. diff --git a/docs/design/agent-workflows/sdk-local-tools/review/evidence/app-mcp-reassign.md b/docs/design/agent-workflows/sdk-local-tools/review/evidence/app-mcp-reassign.md new file mode 100644 index 0000000000..e4ed98202b --- /dev/null +++ b/docs/design/agent-workflows/sdk-local-tools/review/evidence/app-mcp-reassign.md @@ -0,0 +1,14 @@ +# Evidence – F-006 + +`services/oss/src/agent/app.py:91-92` +```python + resolved = await resolve_tools(agent_config.tools) + resolved.mcp_servers = await resolve_mcp_servers(agent_config.mcp_servers) +``` + +`resolve_tools` already returns mcp_servers empty by construction: + +`services/oss/src/agent/tools.py:147-158` (docstring + body) returns `ResolvedTools()` / +`ResolvedTools` whose `mcp_servers` defaults to `[]`; the function never populates it. Relevance: +the overwrite is redundant and relies on an implicit "resolve_tools never sets mcp_servers" +contract; info-level readability/coupling note only. diff --git a/docs/design/agent-workflows/sdk-local-tools/review/evidence/attach-orthogonal-mutation.md b/docs/design/agent-workflows/sdk-local-tools/review/evidence/attach-orthogonal-mutation.md new file mode 100644 index 0000000000..77b299368c --- /dev/null +++ b/docs/design/agent-workflows/sdk-local-tools/review/evidence/attach-orthogonal-mutation.md @@ -0,0 +1,21 @@ +# Evidence – F-005 + +`sdks/python/agenta/sdk/agents/tool_resolution.py:89-96` +```python +def attach_orthogonal(entry: Dict[str, Any], tool: Any) -> Dict[str, Any]: + """Carry the orthogonal axes (approval, render) onto a wire spec when set.""" + if getattr(tool, "needs_approval", False): + entry["needsApproval"] = True + render = getattr(tool, "render", None) + if render is not None: + entry["render"] = render + return entry +``` + +It is a public helper imported by the service: + +`services/oss/src/agent/tools.py:36` -> `from agenta.sdk.agents.tool_resolution import (... attach_orthogonal)` + +Mutates `entry` in place AND returns it. All current callers pass a fresh dict and use the +return value, so no live bug. Relevance: latent footgun for a future caller that passes a shared +dict; dual mutate-and-return contract is inconsistent with the package's other pure builders. diff --git a/docs/design/agent-workflows/sdk-local-tools/review/evidence/description-default-inconsistency.md b/docs/design/agent-workflows/sdk-local-tools/review/evidence/description-default-inconsistency.md new file mode 100644 index 0000000000..decca6d700 --- /dev/null +++ b/docs/design/agent-workflows/sdk-local-tools/review/evidence/description-default-inconsistency.md @@ -0,0 +1,29 @@ +# Evidence – F-004 + +SDK builders default description to the tool name: + +`sdks/python/agenta/sdk/agents/tool_resolution.py:99-108` +```python +def code_tool_spec(tool: CodeToolDef, env: Dict[str, str]) -> Dict[str, Any]: + entry: Dict[str, Any] = { + "name": tool.name, + "description": tool.description or tool.name, + ... +``` + +Gateway remap copies the backend field, which may be None: + +`services/oss/src/agent/tools.py:131-133` +```python + entry = { + "name": name, + "description": spec.get("description"), +``` + +Runner absorbs a None description downstream: + +`services/agent/src/engines/pi.ts:162` -> `description: spec.description ?? spec.name` +`services/agent/src/tools/mcp-server.ts:67` -> `description: s.description ?? s.name` + +Relevance: same "no description" condition yields `name` for code/client but `None` for gateway +on the wire; harmless at runtime, inconsistent on the contract. diff --git a/docs/design/agent-workflows/sdk-local-tools/review/evidence/gateway-no-logging.md b/docs/design/agent-workflows/sdk-local-tools/review/evidence/gateway-no-logging.md new file mode 100644 index 0000000000..b7e040c160 --- /dev/null +++ b/docs/design/agent-workflows/sdk-local-tools/review/evidence/gateway-no-logging.md @@ -0,0 +1,38 @@ +# Evidence – F-002 + +`services/oss/src/agent/tools.py:97-107` +```python + async with httpx.AsyncClient(timeout=TOOLS_TIMEOUT) as client: + response = await client.post( + f"{api_base}/tools/resolve", + json={"tools": refs}, + headers=headers, + ) + if response.status_code >= 400: + raise ToolResolutionError( + f"Tool resolution failed (HTTP {response.status_code}): {response.text[:500]}", + status=response.status_code, + ) +``` + +No `try/except` around the POST: a transport error raises a raw `httpx` exception (not +`ToolResolutionError`). No `log.*` on any branch in `_resolve_gateway`. + +Contrast the established pattern in the adjacent (out-of-scope) secret resolver: + +`services/oss/src/agent/secrets.py:94-111` +```python + try: + async with httpx.AsyncClient(timeout=TOOLS_TIMEOUT) as client: + response = await client.post(...) + if response.status_code >= 400: + log.warning("agent: named-secret resolve HTTP %s for %s", response.status_code, names) + return {} + except Exception: # pylint: disable=broad-except + log.warning("agent: named-secret resolve failed for %s", names, exc_info=True) + return {} +``` + +Relevance: the gateway path neither logs failures nor normalizes transport errors to the typed +error, breaking the "fail-fast raises a typed `ToolResolutionError`" invariant for the +transport-failure case. diff --git a/docs/design/agent-workflows/sdk-local-tools/review/evidence/gateway-orthogonal-untested.md b/docs/design/agent-workflows/sdk-local-tools/review/evidence/gateway-orthogonal-untested.md new file mode 100644 index 0000000000..3ec4b17039 --- /dev/null +++ b/docs/design/agent-workflows/sdk-local-tools/review/evidence/gateway-orthogonal-untested.md @@ -0,0 +1,28 @@ +# Evidence – F-001 + +Gateway specs get the orthogonal axes via `attach_orthogonal` at the end of the gateway remap: + +`services/oss/src/agent/tools.py:131-138` +```python + entry = { + "name": name, + "description": spec.get("description"), + "inputSchema": spec.get("input_schema"), + "callRef": call_ref, + "kind": "callback", + } + custom_tools.append(attach_orthogonal(entry, tool)) +``` + +The integration fixture has no orthogonal fields, so this branch never runs with the axes set: + +`services/oss/tests/pytest/integration/agent/test_resolve_tools_http.py:22` +```python +_GATEWAY = {"function": {"name": "tools__composio__github__GET_USER__c1"}} +``` + +A grep for `needs_approval|needsApproval|render` combined with `gateway|composio|callback` in +the gateway test files returns nothing — only the local code/client path asserts carry-back +(`test_tool_refs.py` / `test_tool_resolution.py`, the `pick` client tool). Relevance: the +gateway branch of `attach_orthogonal` is unexercised, so an approval-drop regression on +server-side tools ships green. diff --git a/docs/design/agent-workflows/sdk-local-tools/review/evidence/handler-resolution-error.md b/docs/design/agent-workflows/sdk-local-tools/review/evidence/handler-resolution-error.md new file mode 100644 index 0000000000..ba93a65d91 --- /dev/null +++ b/docs/design/agent-workflows/sdk-local-tools/review/evidence/handler-resolution-error.md @@ -0,0 +1,20 @@ +# Evidence – F-003 + +`services/oss/src/agent/app.py:90-114` +```python + msgs = to_messages(messages or (inputs or {}).get("messages") or []) + resolved = await resolve_tools(agent_config.tools) # can raise ToolResolutionError + resolved.mcp_servers = await resolve_mcp_servers(agent_config.mcp_servers) + + session_config = SessionConfig(...) + harness = make_harness(selection.harness, Environment(select_backend(selection))) + + if stream: + return _agent_stream(harness, session_config, msgs) +``` + +Resolution runs before the `stream` branch, so a `ToolResolutionError` is raised by the `_agent` +coroutine itself — for the stream path that is the coroutine the endpoint awaits to *get* the +async generator, not a part yielded inside the stream. No test drives `_agent(stream=True)` with +a raising `resolve_tools`. Relevance: failure-path surfacing on `/messages` is unspecified and +unpinned; it may differ from the JSON path. diff --git a/docs/design/agent-workflows/sdk-local-tools/review/findings.md b/docs/design/agent-workflows/sdk-local-tools/review/findings.md new file mode 100644 index 0000000000..5859999acd --- /dev/null +++ b/docs/design/agent-workflows/sdk-local-tools/review/findings.md @@ -0,0 +1,282 @@ +# findings.md – Confirmed Findings + +Sync metadata: scan-codebase, `depth=deep`, +`path=docs/design/agent-workflows/sdk-local-tools/review/`, branch `gitbutler/workspace`. +Severity scheme: critical / high / medium / low / info (harness `deliverables.md`). Each +finding also carries `Confidence` and `Status` from the shared findings schema. + +## Summary + +Six findings were identified and all six are resolved by the tools package refactor. + +There are no critical findings. There are no high findings. + +## Resolved Findings + +### F-001 [RESOLVED] – Gateway orthogonal-axis carry-back (needsApproval / render) is untested + +| Field | Value | +|---|---| +| Severity | medium | +| Confidence | high | +| Status | resolved | +| Origin | scan | +| Lens | verification | +| Category | Testing, Completeness | +| Criteria | general G-13/G-16 (Testing); Completeness (2) | +| Location | `services/oss/src/agent/tools.py:138`; `services/oss/tests/pytest/integration/agent/test_resolve_tools_http.py:47-85` | +| Files | tools.py, test_resolve_tools_http.py, test_tool_refs.py | + +**Condition:** `_resolve_gateway` calls `attach_orthogonal(entry, tool)` on every resolved +gateway spec (`tools.py:138`), so a gateway tool that declares `needs_approval` or `render` +should carry `needsApproval` / `render` onto its `kind="callback"` wire spec. No test sets +those fields on a gateway tool. The integration fixture `_GATEWAY` +(`test_resolve_tools_http.py:22`) is a bare picker slug with no orthogonal fields, and +`test_tool_refs.py` only exercises `_gateway_ref` (which does not apply the orthogonal axes — +those land later in `_resolve_gateway`). The gateway branch of `attach_orthogonal` is therefore +never executed under test. + +**Cause:** The test that locks orthogonal carry-back +(`test_resolve_tools_splits_builtin_code_client_offline`) covers only the local code/client +kinds. The gateway integration tests focus on the network path, count mismatch, and +incomplete-spec handling, and skip the approval/render axes. + +**Consequence:** A regression that drops `needsApproval` on a gateway tool (for example a +human-in-the-loop Composio action) would ship green. Approval is a safety gate; silently +dropping it on the gateway kind is the worst kind to lose, because those tools execute +server-side with the provider key. + +**Evidence:** evidence/gateway-orthogonal-untested.md + +**Suggested Fix:** + +- **Option A (preferred):** Extend `test_resolves_gateway_and_remaps_to_wire_shape` to pass a + gateway tool dict carrying `needs_approval: true` and a `render` dict, and assert the resolved + callback spec carries `needsApproval` and `render`. A typed gateway dict + (`{"type": "gateway", ...}`) is the cleanest vehicle. +- **Option B:** Add a focused unit test in `test_tool_refs.py` that calls `_resolve_gateway` + with a stubbed HTTP response and a gateway def with the axes set, asserting carry-back by + position. + +**Resolution:** `test_gateway_metadata_and_description_fallback_are_preserved` and +`test_gateway_specs_are_joined_by_call_ref_not_position` cover metadata carry-back. + +--- + +### F-002 [RESOLVED] – Gateway HTTP/timeout failures are raised but not logged + +| Field | Value | +|---|---| +| Severity | medium | +| Confidence | high | +| Status | resolved | +| Origin | scan | +| Lens | verification | +| Category | Observability | +| Criteria | Observability (10); services SV-26 | +| Location | `services/oss/src/agent/tools.py:97-120` | +| Files | tools.py | + +**Condition:** `_resolve_gateway` makes an outbound HTTP call to `POST /tools/resolve` with no +`try/except` around the `httpx` call and no log on any failure path. A network error (DNS, +connection refused, timeout) raises a raw `httpx` exception, not a `ToolResolutionError`, and +nothing is logged. The HTTP-status, count-mismatch, and incomplete-spec paths raise +`ToolResolutionError` but emit no log line either. + +**Cause:** The module logs only the unsupported-provider skip (`tools.py:165`). The gateway call +leans on the exception message reaching the caller, but the caller (`app.py:91`) does not catch +or log it. The adjacent secret resolver (`secrets.py`) wraps every `httpx` call in `try/except` +with a `log.warning` plus `exc_info`; the gateway path does not follow that established pattern. + +**Consequence:** When gateway resolution fails in production, the operator sees a generic +framework error with no structured breadcrumb, and a raw `httpx.ConnectError` / +`httpx.ReadTimeout` is not normalized to the typed `ToolResolutionError` the rest of the path +promises. The invariant "gateway fail-fast paths raise a typed `ToolResolutionError`" does not +hold for transport-level failures. + +**Evidence:** evidence/gateway-no-logging.md + +**Suggested Fix:** + +- **Option A (preferred):** Wrap the `httpx` POST in `try/except httpx.HTTPError` and re-raise as + `ToolResolutionError("Tool resolution request failed: ...")` so transport failures match the + typed-error invariant. Add `log.warning("agent: gateway tool resolution failed for %d + ref(s)", len(gateway_defs), exc_info=True)` on the failure paths, mirroring `secrets.py`. +- **Option B:** At minimum add a `log.warning` on the HTTP-status, count-mismatch, and + incomplete-spec branches so each fail-fast leaves a breadcrumb, and leave transport-error + normalization for a follow-up. + +**Resolution:** `AgentaGatewayToolResolver` logs each failure branch and wraps +`httpx.HTTPError` as `GatewayToolResolutionError`; integration tests verify both. + +--- + +### F-003 [RESOLVED] – Handler does not handle ToolResolutionError; stream path may surface it differently + +| Field | Value | +|---|---| +| Severity | medium | +| Confidence | medium | +| Status | resolved | +| Origin | scan | +| Lens | verification | +| Category | Completeness, Soundness | +| Criteria | Completeness (2); services SV-2 | +| Location | `services/oss/src/agent/app.py:91-92` | +| Files | app.py | + +**Condition:** `_agent` calls `resolve_tools` (and `resolve_mcp_servers`) at lines 91-92, before +it branches on `stream`. A `ToolResolutionError` raised here propagates straight out of `_agent`. +For the JSON `/invoke` path this is the intended fail-fast: the workflow wrapper turns the raised +exception into an error response. For the SSE `/messages` stream path, the exception is raised +from the coroutine the endpoint awaits to obtain the async generator, so the failure surfaces as +a coroutine raise, not as a stream `error` part. The two paths report the same root failure +through different channels, and there is no test asserting the stream path on a resolution +failure. + +**Cause:** Tool resolution is deliberately hoisted above the `stream` branch so both paths +fail-fast before any backend setup. That is the right call for fail-fast, but it leaves the +stream-path error-surfacing shape unspecified and untested. + +**Consequence:** A client on the streaming endpoint may receive a transport-level error instead +of a well-formed UI Message Stream `error` part, depending on how the endpoint adapter handles a +raising coroutine. This is a UX inconsistency on the failure path, not a data-loss bug. + +**Evidence:** evidence/handler-resolution-error.md + +**Suggested Fix:** + +- **Option A:** Confirm with the endpoint-adapter author how a raising `_agent` coroutine is + rendered on the `/messages` SSE path. If it is not a clean stream `error` part, catch + `ToolResolutionError` for the stream branch and emit an error part. +- **Option B:** Add a handler-level test that drives `_agent(stream=True)` with a `resolve_tools` + stub that raises `ToolResolutionError`, asserting the documented surfacing shape. + +**Resolution:** the protocol requires pre-stream failures to remain JSON. The messages route +now preserves batch error responses before applying streaming negotiation. SDK routing and +service handler tests cover both layers. + +--- + +### F-004 [RESOLVED] – description default differs between code tools and gateway tools + +| Field | Value | +|---|---| +| Severity | low | +| Confidence | high | +| Status | resolved | +| Origin | scan | +| Lens | verification | +| Category | Consistency | +| Criteria | Consistency (3); general G-7 | +| Location | `sdks/python/agenta/sdk/agents/tool_resolution.py:103`; `services/oss/src/agent/tools.py:133` | +| Files | tool_resolution.py, tools.py | + +**Condition:** `code_tool_spec` sets `"description": tool.description or tool.name` +(`tool_resolution.py:103`) and `client_tool_spec` does the same (`:117`). The gateway path sets +`"description": spec.get("description")` (`tools.py:133`), which can be `None`. So a code/client +tool with no description ships its name; a gateway tool with no backend description ships `None`. + +**Cause:** The SDK builders default to the name; the gateway remap copies the backend field +verbatim. + +**Consequence:** Low. The TS runner already defaults `description ?? name` when building Pi/MCP +tools (`services/agent/src/engines/pi.ts:162`, `mcp-server.ts:67`), so a `None` description is +absorbed downstream. The only visible effect is a slightly different wire payload for the same +"no description" condition across kinds, which contradicts the "identical wire specs" framing for +the description field specifically. + +**Evidence:** evidence/description-default-inconsistency.md + +**Suggested Fix:** + +- **Option A:** In `_resolve_gateway`, default the description to the resolved name: + `"description": spec.get("description") or name`, matching the SDK builders. +- **Option B:** Leave as-is and document that `description` may be `None` on the gateway wire + spec, relying on the runner default. + +**Resolution:** gateway specs now use `description or name`. + +--- + +### F-005 [RESOLVED] – attach_orthogonal mutates its argument in place and also returns it + +| Field | Value | +|---|---| +| Severity | low | +| Confidence | high | +| Status | resolved | +| Origin | scan | +| Lens | verification | +| Category | Soundness, Consistency | +| Criteria | Soundness (4); sdk SK-3 | +| Location | `sdks/python/agenta/sdk/agents/tool_resolution.py:89-96` | +| Files | tool_resolution.py | + +**Condition:** `attach_orthogonal(entry, tool)` mutates `entry` in place (sets +`entry["needsApproval"]` / `entry["render"]`) and returns the same dict. It is a public helper +(imported by the service at `tools.py:36`). The mutate-and-return pattern is easy to misuse: a +caller that reuses the input `entry` after the call sees it mutated, and the return value invites +the false belief that the input is untouched. + +**Cause:** Shaped for the call sites that always do `return attach_orthogonal(entry, tool)` with +a fresh `entry`, where in-place mutation is harmless. + +**Consequence:** Low. All three current call sites pass a freshly-built `entry` and use the +return value, so there is no live bug. The risk is latent and the dual contract is slightly +inconsistent with a package whose other builders return fresh dicts. + +**Evidence:** evidence/attach-orthogonal-mutation.md + +**Suggested Fix:** + +- **Option A:** Make it pure — `return {**entry, **extras}`. Removes the latent footgun at no + real cost. +- **Option B:** Keep in-place mutation but drop the return value and name it + `_attach_orthogonal_inplace`, signalling the contract. + +**Resolution:** canonical spec construction uses immutable Pydantic models and +`model_copy`; the legacy helper returns a copied dictionary. + +--- + +### F-006 [RESOLVED] – app.py redundantly re-assigns resolved.mcp_servers + +| Field | Value | +|---|---| +| Severity | info | +| Confidence | high | +| Status | resolved | +| Origin | scan | +| Lens | verification | +| Category | Consistency, Complexity | +| Criteria | Complexity (5); general G-9 | +| Location | `services/oss/src/agent/app.py:91-92` | +| Files | app.py, tools.py | + +**Condition:** `_agent` does `resolved = await resolve_tools(...)` then immediately +`resolved.mcp_servers = await resolve_mcp_servers(...)`. `resolve_tools` already returns +`ResolvedTools` with `mcp_servers=[]` (documented to leave MCP empty), so the handler overwrites +a field that is empty by construction, relying on an implicit contract. + +**Cause:** MCP resolution is intentionally a separate call (different flag gate), so the handler +stitches the two results together. + +**Consequence:** None functionally. Minor readability cost and a coupling that would silently +double-resolve if `resolve_tools` ever started populating `mcp_servers`. + +**Evidence:** evidence/app-mcp-reassign.md + +**Suggested Fix:** + +- **Option A:** Provide a single `resolve_all(tools, mcp_servers)` in `tools.py` that returns a + fully-populated `ResolvedTools`, keeping the two flag gates internal. +- **Option B:** Leave as-is; add a one-line comment that `resolve_tools` never sets + `mcp_servers`, making the contract explicit at the call site. + +**Resolution:** `resolve_agent_resources` returns a complete `ResolvedAgentResources` value and +the handler performs no post-resolution mutation. + +## Open Findings + +None. diff --git a/docs/design/agent-workflows/sdk-local-tools/review/metadata.json b/docs/design/agent-workflows/sdk-local-tools/review/metadata.json new file mode 100644 index 0000000000..e648a06cb3 --- /dev/null +++ b/docs/design/agent-workflows/sdk-local-tools/review/metadata.json @@ -0,0 +1,40 @@ +{ + "review_id": "sdk-local-tools-tool-resolution-2026-06-19", + "created_at": "2026-06-19T00:00:00Z", + "updated_at": "2026-06-19T00:00:00Z", + "scope_ref": "scope.md", + "activity": "scan-codebase", + "depth": "deep", + "path": "docs/design/agent-workflows/sdk-local-tools/review/", + "branch": "gitbutler/workspace", + "rubrics": ["sdk", "services", "general", "security", "architecture"], + "reviewed_files": [ + "sdks/python/agenta/sdk/agents/tool_resolution.py", + "sdks/python/agenta/sdk/agents/errors.py", + "sdks/python/agenta/sdk/agents/__init__.py", + "services/oss/src/agent/tools.py", + "services/oss/src/agent/app.py", + "sdks/python/oss/tests/pytest/unit/agents/test_tool_resolution.py", + "services/oss/tests/pytest/unit/agent/test_tool_refs.py", + "services/oss/tests/pytest/unit/agent/test_invoke_handler.py", + "services/oss/tests/pytest/integration/agent/test_resolve_tools_http.py" + ], + "files_in_scope": 9, + "files_reviewed": 9, + "coverage_pct": 100, + "finding_counts": { + "critical": 0, + "high": 0, + "medium": 3, + "low": 2, + "info": 1 + }, + "open_risks": 1, + "open_questions": 2, + "tests": { + "sdk_agents_unit": "118 passed", + "service_agent_unit_integration": "33 passed" + }, + "verdict": "pass with conditions", + "complete": true +} diff --git a/docs/design/agent-workflows/sdk-local-tools/review/plan.md b/docs/design/agent-workflows/sdk-local-tools/review/plan.md new file mode 100644 index 0000000000..609a0d4ea0 --- /dev/null +++ b/docs/design/agent-workflows/sdk-local-tools/review/plan.md @@ -0,0 +1,45 @@ +# plan.md – Review Plan + +## Strategy + +Findings-driven scan (`scan-codebase`, `depth=deep`) over the sdk-local-tools tool-resolution +change set. Rubrics applied: `sdk.md` (the SDK resolver + its public surface), `services.md` +(the server-side orchestration in `tools.py` / `app.py`), `general.md` as the baseline pass, +with `security.md` and `architecture.md` at baseline depth (the change touches secrets and a +service->SDK boundary). All 10 universal criteria applied. + +The change set is ~420 lines of meaningful logic across 5 implementation files plus 4 test +files. It sits just under the 400-line split threshold, so it is reviewed as one pass grouped +by risk. + +## Pass order (highest risk first) + +| # | Pass | Files | Rubric mapping | Budget | +|---|---|---|---|---| +| 1 | Gateway network path + fail-fast | `services/oss/src/agent/tools.py` (`_resolve_gateway`, `resolve_tools`) | services SV-2/6/10, security S-14/15, general G-1/2/3 | done | +| 2 | Local resolver + secret scoping | `tool_resolution.py` (`LocalToolResolver`, `SecretResolver`, spec builders) | sdk SK-3/22/26/27, security S-14, general G-2/3/5/10 | done | +| 3 | Typed error contract | `errors.py` (`ToolResolutionError`) | sdk SK-7/17, general G-7 | done | +| 4 | Public surface + exports | `__init__.py` | sdk SK-7/15 | done | +| 5 | Handler integration | `app.py` (`_agent`, `_agent_stream`) | services SV-2, general G-3/5 | done | +| 6 | Test coverage adequacy | the 4 test files | general G-13..G-17, sdk SK-22 | done | + +## Prerequisite automated checks + +- Run service agent unit + integration suites: `cd services && uv run python -m pytest + oss/tests/pytest/unit/agent oss/tests/pytest/integration/agent -n0 -q` — ran, 33 passed. +- Run SDK agents unit suite: `cd sdks/python && uv run python -m pytest + oss/tests/pytest/unit/agents -n0 -q` — ran, 118 passed. +- Secret-logging scan over the changed + adjacent files — ran, no secret values logged. + +## Invariants to verify (from the design) + +1. Dependency direction is `service -> SDK` only; the SDK never imports the service. +2. builtin/code/client resolve locally; only gateway hits the network. +3. The service produces identical wire specs (kind/runtime/code/env/callRef/needsApproval/render). +4. MCP is a no-op unless `AGENTA_AGENT_ENABLE_MCP` is set. +5. Secret env is scoped per tool; unresolved secrets are omitted, not errored. +6. Gateway fail-fast paths raise a typed `ToolResolutionError`. + +## Deferrals + +None. The change set fit one pass within budget. diff --git a/docs/design/agent-workflows/sdk-local-tools/review/progress.md b/docs/design/agent-workflows/sdk-local-tools/review/progress.md new file mode 100644 index 0000000000..8319d405e0 --- /dev/null +++ b/docs/design/agent-workflows/sdk-local-tools/review/progress.md @@ -0,0 +1,37 @@ +# progress.md – Review Progress + +| Pass / file group | Status | Notes | +|---|---|---| +| Read harness (instructions, criteria, guidelines, deliverables, rubrics, templates) | done | sdk/services/general/security/architecture rubrics loaded | +| Read scan-codebase skill + findings schema + lifecycle | done | depth=deep, path given | +| `services/oss/src/agent/tools.py` | done | gateway path, fail-fast, local delegation, MCP gating | +| `sdks/python/agenta/sdk/agents/tool_resolution.py` | done | resolver, secret scoping, spec builders | +| `sdks/python/agenta/sdk/agents/errors.py` | done | `ToolResolutionError` typed context | +| `sdks/python/agenta/sdk/agents/__init__.py` | done | exports complete and consistent | +| `services/oss/src/agent/app.py` | done | handler integration, fail-fast propagation | +| `test_tool_resolution.py` (SDK) | done | offline split, secret omit, MCP gate | +| `test_tool_refs.py` (service) | done | service split, MCP wire, gateway ref shape | +| `test_invoke_handler.py` (service) | done | cross-harness identity through handler | +| `test_resolve_tools_http.py` (integration) | done | gateway HTTP mock, count/incomplete fail-fast | +| Context: dtos.py, tool_defs.py, secrets.py, client.py, protocol.ts | done | wire shape parity confirmed | +| Run service unit + integration suites | done | 33 passed | +| Run SDK agents unit suite | done | 118 passed | +| Secret-logging scan | done | names only, no values | +| Invariant verification (6) | done | all hold; one positional-ordering coupling noted | +| Promote findings | done | 6 findings, 1 risk | +| Synthesis (summary, scorecard) | done | verdict: pass with conditions | + +Review complete. No items deferred. No items out of scope skipped within the named set. + +## Remediation + +| Item | Status | +|---|---| +| F-001 through F-006 | resolved | +| R-001 | mitigated | +| Q-001 and Q-002 | resolved | +| SDK agent/routing tests | 146 passed | +| Service agent tests | 34 passed | +| API unit tests | 859 passed | +| TypeScript tool tests | 3 passed | +| Extension bundle | built | diff --git a/docs/design/agent-workflows/sdk-local-tools/review/questions.md b/docs/design/agent-workflows/sdk-local-tools/review/questions.md new file mode 100644 index 0000000000..3e6d34609f --- /dev/null +++ b/docs/design/agent-workflows/sdk-local-tools/review/questions.md @@ -0,0 +1,6 @@ +# questions.md – Resolved Questions + +| ID | Question | Context / Evidence | Provisional suggestions | Status | +|---|---|---|---|---| +| Q-001 | How is a `ToolResolutionError` raised by the `_agent` coroutine surfaced on the `/messages` SSE path versus the JSON `/invoke` path? | Failures before stream creation are protocol-level JSON errors. Failures after stream creation use SSE error parts. The routing layer now preserves the former. | Covered by routing and handler tests. | resolved | +| Q-002 | Does `/tools/resolve` guarantee response ordering matches request order, and can it echo a ref identity for key-based matching? | Ordering is no longer trusted. The existing resolved `call_ref` is normalized and used as the correlation key. | Covered by a reversed-response integration test. | resolved | diff --git a/docs/design/agent-workflows/sdk-local-tools/review/risks.md b/docs/design/agent-workflows/sdk-local-tools/review/risks.md new file mode 100644 index 0000000000..7be024966b --- /dev/null +++ b/docs/design/agent-workflows/sdk-local-tools/review/risks.md @@ -0,0 +1,9 @@ +# risks.md – Systemic Risks + +| ID | Category | Description | Likelihood | Impact | Evidence | Mitigation | Status | +|---|---|---|---|---|---|---|---| +| R-001 | Architecture | Gateway resolution formerly carried approval/render metadata by response position. | Low | Medium | `services/oss/src/agent/tools/gateway.py` | Resolved specs are now matched to configs by normalized `call_ref`; a reversed-response integration test pins the behavior. | mitigated | + +## Notes + +- No open architecture risk remains from this review. diff --git a/docs/design/agent-workflows/sdk-local-tools/review/scope.md b/docs/design/agent-workflows/sdk-local-tools/review/scope.md new file mode 100644 index 0000000000..da85db0056 --- /dev/null +++ b/docs/design/agent-workflows/sdk-local-tools/review/scope.md @@ -0,0 +1,82 @@ +# scope.md – Review Scope + +## Objectives + +- Verify the sdk-local-tools tool-resolution change set preserves its design invariants: + dependency direction `service -> SDK`, local resolution of builtin/code/client, network only + for gateway, identical wire specs across paths, MCP gated off by default, per-tool scoped + secret env with omit-on-missing, and typed `ToolResolutionError` on gateway fail-fast paths. +- Surface correctness, soundness, completeness, consistency, security, and testing findings in + the changed files, judged against the surrounding code they integrate with. +- Confirm the two gaps flagged by the prior conventions review are resolved (Pydantic + `ResolvedTools`, typed error) and that the tests lock the behaviour they claim to. + +## Codebase / branch + +| Field | Value | +|---|---| +| Repository | agenta (monorepo) | +| Branch / commit | `gitbutler/workspace` (working tree; scope files are new vs `main`) | +| Review type | combination: sdk + services + security (baseline) + architecture (baseline) | + +## Inclusions + +Implementation: + +- `sdks/python/agenta/sdk/agents/tool_resolution.py` +- `sdks/python/agenta/sdk/agents/errors.py` +- `sdks/python/agenta/sdk/agents/__init__.py` +- `services/oss/src/agent/tools.py` +- `services/oss/src/agent/app.py` + +Tests: + +- `sdks/python/oss/tests/pytest/unit/agents/test_tool_resolution.py` +- `services/oss/tests/pytest/unit/agent/test_tool_refs.py` +- `services/oss/tests/pytest/unit/agent/test_invoke_handler.py` +- `services/oss/tests/pytest/integration/agent/test_resolve_tools_http.py` + +## Exclusions + +Read for integration judgement only, not reviewed as scope: + +- `sdks/python/agenta/sdk/agents/dtos.py` (`SessionConfig`, `ToolCallback`), `tool_defs.py`, + `adapters/harnesses.py` +- `services/oss/src/agent/secrets.py`, `services/oss/src/agent/client.py` +- `services/agent/src/protocol.ts` and the TS runner tool builders (wire-shape consumers) +- All other unrelated workspace changes on the branch. + +## Constraints and assumptions + +- Scope files are new on this branch; there is no `main` baseline to diff against, so + "identical wire specs" is judged as cross-path identity (local vs server-side), which the + tests assert, plus parity against the TS `ResolvedToolSpec` contract. +- Review is verification-oriented (scan-codebase, `depth=deep`). Runtime validation is limited + to the two named pytest targets; no live LLM, runner, or backend was exercised. + +## Stakeholders + +| Role | Name / Team | +|---|---| +| Requestor | mahmoud@agenta.ai | +| Author(s) | agent-workflows contributors | +| Reviewer(s) | `agent` (scan-codebase, code-review harness) | +| Decision owner | mahmoud@agenta.ai | + +## Timeline + +| Milestone | Target date | +|---|---| +| Review start | 2026-06-19 | +| Final deliverables | 2026-06-19 | + +## Success criteria + +- Every in-scope file read in full and judged against the 10 universal criteria. +- All design invariants explicitly confirmed or flagged. +- Findings carry severity, confidence, real `file:line`, evidence, and ≥1 remediation. +- Both named test suites run and their pass/fail recorded. + +## Approval + +Requestor reviews `summary.md` + `scorecard.md` and closes the review. diff --git a/docs/design/agent-workflows/sdk-local-tools/review/scorecard.md b/docs/design/agent-workflows/sdk-local-tools/review/scorecard.md new file mode 100644 index 0000000000..5afe196e7b --- /dev/null +++ b/docs/design/agent-workflows/sdk-local-tools/review/scorecard.md @@ -0,0 +1,30 @@ +# scorecard.md – Review Scorecard + +## Metrics + +| Metric | Value | Interpretation | +|---|---|---| +| Critical findings | 0 | No release-blocking defects. | +| High findings | 0 | No significant functional/security defects. | +| Medium findings | 3 | Two gateway-path gaps (test, observability) + one unspecified stream-failure surface. | +| Low findings | 2 | Cross-kind description default; mutate-and-return helper. | +| Info findings | 1 | Redundant `mcp_servers` re-assign. | +| Open risks | 1 | R-001 positional gateway carry-back (architecture/soundness). | +| Open questions | 2 | Q-001 stream error surface; Q-002 resolve ordering contract. | +| Files reviewed | 9 | 5 implementation + 4 tests. | +| Files in scope | 9 | | +| Review coverage | 100% | Every in-scope file read in full. | +| Tests run | 151 passed | SDK agents unit 118 + service unit/integration 33. | +| **Overall verdict** | **PASS WITH CONDITIONS** | Invariants hold; address the three mediums + R-001 before rollout. | + +## Verdict criteria + +| Verdict | Condition | +|---|---| +| Pass | No critical findings; <= 2 high findings with remediation plans in progress | +| Pass with conditions | No critical findings; > 2 high findings OR open questions blocking release | +| Fail | >= 1 critical finding OR scope not fully reviewed | + +Applied: zero critical and zero high, but two open questions (Q-001, Q-002) gate the disposition +of F-003 and R-001, and three medium findings warrant fixes before rollout. Verdict: pass with +conditions. diff --git a/docs/design/agent-workflows/sdk-local-tools/review/summary.md b/docs/design/agent-workflows/sdk-local-tools/review/summary.md new file mode 100644 index 0000000000..8f3eb763bd --- /dev/null +++ b/docs/design/agent-workflows/sdk-local-tools/review/summary.md @@ -0,0 +1,115 @@ +# summary.md – Review Summary + +## Overview + +**Reviewed:** the sdk-local-tools tool-resolution change set (see `scope.md`): the new SDK +`LocalToolResolver` / `SecretResolver` / `ResolvedTools` / spec builders +(`tool_resolution.py`), the typed `ToolResolutionError` (`errors.py`), the SDK exports +(`__init__.py`), the refactored server-side orchestration (`services/oss/src/agent/tools.py`), +the handler that consumes it (`app.py`), and the four accompanying test files. +**Goals:** verify the design invariants hold; surface correctness, soundness, completeness, +consistency, security, and testing findings; confirm the prior conventions review's two gaps are +closed. +**Constraints:** verification-oriented scan (`depth=deep`); runtime validation limited to the +two named pytest targets; no live LLM/runner/backend. +**Review type(s):** sdk + services + general baseline, with security and architecture at +baseline depth. +**Date range:** 2026-06-19 – 2026-06-19. + +--- + +## Health verdict + +> **PASS — CONDITIONS RESOLVED** + +All review findings and the architecture risk were addressed in the subsequent organization +refactor. The canonical implementation now lives in `agenta.sdk.agents.tools`, MCP is a sibling +subsystem, the service uses explicit gateway/vault adapters, and the runner file names describe +dispatch and callback responsibilities. + +Validation after the fixes: + +- SDK agent/routing tests: 146 passed. +- Service agent unit/integration tests: 34 passed. +- API unit tests: 859 passed. +- TypeScript tool dispatch/bridge/MCP tests: 3 passed. +- TypeScript extension bundle: built successfully. + +--- + +## Key findings + +### Critical and high severity + +- None. There are no critical findings and no high findings. + +### Medium severity + +- **F-001 resolved:** gateway approval/render metadata is tested, including a reordered + multi-tool response. +- **F-002 resolved:** HTTP failures are logged and normalized to + `GatewayToolResolutionError`. +- **F-003 resolved:** failures before stream creation remain JSON error envelopes with their + original HTTP status; routing and handler tests pin this contract. + +### Low / info severity + +- **F-004 resolved:** gateway descriptions default to the resolved name. +- **F-005 resolved:** canonical spec construction is immutable; the unshipped compatibility + helper and former flat modules were removed. +- **F-006 resolved:** `resolve_agent_resources` returns tools and MCP resources in one result. + +### Positive observations + +- All six invariants verified: service->SDK dependency direction (the SDK never imports the + service), local resolution for builtin/code/client with only gateway on the network, wire-spec + parity with the TS `ResolvedToolSpec` contract (kind/runtime/code/env/callRef/needsApproval/ + render), MCP gated off by `AGENTA_AGENT_ENABLE_MCP`, per-tool scoped secret env with + omit-on-missing, and typed `ToolResolutionError` on the gateway count/incomplete fail-fast + paths. +- The two prior conventions-review gaps are closed: `ResolvedToolSet` is a Pydantic model, and + the service raises a typed `ToolResolutionError(RuntimeError)` (so existing `except + RuntimeError` callers keep working) carrying structured `status` / `ref_count` / `spec_count`. +- Secrets are scoped per tool and never logged (only secret *names* appear in warnings); the + union-resolve-then-filter approach correctly prevents cross-tool env leakage. +- Test design is honest: the cross-harness handler test asserts both identical bodies *and* + divergent per-harness configs at the backend boundary, avoiding the tautology the comment calls + out; the integration suite covers the gateway count-mismatch and incomplete-spec fail-fast. +- Both suites pass: SDK agents unit 118 passed; service agent unit + integration 33 passed. + +--- + +## Key risks + +- **R-001 mitigated:** gateway responses are joined to configs by normalized `call_ref`, not + response position. + +--- + +## Open questions + +- **Q-001 answered:** pre-stream failures use JSON errors; only failures after streaming begins + become SSE error parts. +- **Q-002 answered:** ordering is no longer trusted; the existing `call_ref` is the correlation + identity. + +--- + +## Coverage and metrics + +| Metric | Value | +|---|---| +| Files in scope | 9 (5 implementation + 4 tests) | +| Files reviewed | 9 | +| Coverage | 100% | +| Critical findings | 0 | +| High findings | 0 | +| Medium findings | 3 | +| Low findings | 2 | +| Info findings | 1 | + +--- + +## Recommended next steps + +No review remediation remains. diff --git a/docs/design/agent-workflows/sdk-local-tools/status.md b/docs/design/agent-workflows/sdk-local-tools/status.md new file mode 100644 index 0000000000..5628574334 --- /dev/null +++ b/docs/design/agent-workflows/sdk-local-tools/status.md @@ -0,0 +1,104 @@ +# Status + +The source of truth for this effort. Update this page as the work moves. + +## Where this stands (2026-06-19) + +**Stage: organization refactor implemented and review findings resolved.** The SDK tool +runtime now uses `agenta.sdk.agents.tools`; MCP is a sibling `agenta.sdk.agents.mcp` +subsystem; the service owns HTTP/vault adapters under `services/oss/src/agent/tools/`. +Persisted discriminator values remain unchanged. The former flat `tool_defs.py` and +`tool_resolution.py` modules were removed because this API had not shipped; legacy persisted +configuration shapes are handled only at the explicit compatibility boundary in +`agents/tools/compat.py`. + +The prerequisite is also not done: `LocalBackend` is a stub that raises +(`sdks/python/agenta/sdk/agents/adapters/local.py:30`). The sibling effort owns it +([`../scratch/sdk-local-backend/status.md`](../scratch/sdk-local-backend/status.md)). + +## Organization review (2026-06-19) + +- Canonical declarations use `*Config`; runnable values use `*Spec`. +- Strict parsing and legacy compatibility are separate. +- Missing secrets, unsupported providers, duplicate names, and invalid gateway responses use + typed errors. +- API core reuses SDK-owned neutral tool config classes. +- Gateway metadata correlation uses `call_ref`, not response position. +- Pre-stream `/messages` failures preserve JSON error responses. +- TypeScript files are named `callback.ts` and `dispatch.ts`. + +## Behavior decisions retained (2026-06-19) + +The earlier review settled these product and dependency decisions. They remain useful, but +class names such as `LocalToolResolver` and `SecretResolver` below are historical placeholders; +the organization proposal recommends better public names. + +1. **Where resolution lives.** Both, behind one interface: an offline `LocalToolResolver` + (built-in, code, client, MCP later) and an opt-in `AgentaToolResolver` for gateway. + Gateway is the only kind that stays in the backend. Dependency direction `service -> SDK`. +2. **Code and MCP executors.** Code reuses the bundled in-process Pi engine. MCP is out of + scope for the first releases: wired but a no-op behind a feature flag that defaults to off, + for both Pi and Claude. +3. **Secrets.** Env by default, behind a pluggable `SecretResolver`. The vault is a later, + connected path that reads `custom_secret` rows by name once the consumer is built + ([open-issues.md](../open-issues.md)). Custom secrets are storage-only today by design. +4. **First slice.** `LocalBackend` (Pi) + built-in + code + env secrets, offline. + +## Validation + +- SDK agent/routing tests: 146 passed. +- Service agent tests: 34 passed. +- API unit tests: 859 passed. +- TypeScript tool tests: 3 passed. +- TypeScript extension bundle: built successfully. + +## Remaining prerequisite + +`LocalBackend` is still a separate prerequisite. End-to-end standalone execution remains +blocked on that sibling effort, not on tool organization or resolution. + +## Key files (the map for whoever continues) + +**SDK (where the new work lands):** +- `sdks/python/agenta/sdk/agents/tools/`: tool configuration, runtime specs, parsing, + compatibility conversion, resolution, wire serialization, and errors. +- `sdks/python/agenta/sdk/agents/mcp/`: the sibling MCP configuration and resolution domain. +- `sdks/python/agenta/sdk/agents/dtos.py`: `SessionConfig` and its tool fields (the delivery + contract, `:498`). +- `sdks/python/agenta/sdk/agents/adapters/harnesses.py`: the harness adapters that shape + already-resolved specs (`:65`). +- `sdks/python/agenta/sdk/agents/adapters/local.py`: `LocalBackend`, the stub + (`:24`, prerequisite). + +**Service (the reference resolver to share from, never to depend back on):** +- `services/oss/src/agent/tools/resolver.py`: service composition of SDK tool and MCP + resolvers. +- `services/oss/src/agent/tools/gateway.py`: Agenta gateway HTTP adapter. +- `services/oss/src/agent/tools/secrets.py`: named-secret HTTP adapter for tools and MCP. +- `services/oss/src/agent/secrets.py`: provider-key resolution for harness authentication. +- `services/oss/src/agent/app.py`: where the service builds `SessionConfig` (`:91`). + +**Runner (execution; already handles code/callback in-process):** +- `services/agent/src/engines/pi.ts`: the in-process Pi engine, `mcpTools: false` (`:58`), + branches on tool `kind` (`:144`). +- `services/agent/src/tools/code.ts`, `dispatch.ts`, `callback.ts`, `mcp-server.ts`: the + executors and callback adapter. + +**API (the future named-secret consumer):** +- `api/oss/src/apis/fastapi/vault/router.py`: CRUD only, no name-based resolve route yet + (lines 36 to 74). +- `docs/design/vault-named-secrets/context.md`: the named-secrets effort, + runtime-consumption out of scope this iteration (`:23`). + +## Known traps + +- **The vault named-secret consumer is not built.** Declared code/MCP custom secrets resolve + to `{}` today, server path included. This is the storage-only design this iteration, not a + bug (research.md, stage 3). The offline slice supplies secrets from env via `SecretResolver` + instead of the vault. +- **The in-process Pi engine has no MCP.** Do not assume bundling the Pi runner brings MCP; + it does not (`services/agent/src/engines/pi.ts:58`). +- **Client tools need a browser.** A headless local run cannot fulfil them; the in-process Pi + engine skips them (`services/agent/src/engines/pi.ts:165`). +- **Dependency direction.** The SDK must never call the service. Share spec-building logic by + moving it into the SDK, not by importing the service from the SDK. diff --git a/sdks/python/agenta/sdk/agents/__init__.py b/sdks/python/agenta/sdk/agents/__init__.py index 9df80c3fd9..19d0ac77b0 100644 --- a/sdks/python/agenta/sdk/agents/__init__.py +++ b/sdks/python/agenta/sdk/agents/__init__.py @@ -88,7 +88,11 @@ parse_tool_config, parse_tool_configs, ) -from .ui_messages import from_ui_messages, to_ui_message, ui_message_stream +from .adapters.vercel import ( + from_ui_messages, + to_ui_message, + ui_message_stream, +) __all__ = [ # DTOs @@ -107,7 +111,7 @@ "AgentEvent", "AgentResult", "AgentRun", - # UI message codec (the /messages egress adapter) + # Former flat Vercel adapter names (compatibility; new code uses adapters.vercel) "from_ui_messages", "to_ui_message", "ui_message_stream", diff --git a/sdks/python/agenta/sdk/agents/adapters/__init__.py b/sdks/python/agenta/sdk/agents/adapters/__init__.py index e8002be7af..30e555d82b 100644 --- a/sdks/python/agenta/sdk/agents/adapters/__init__.py +++ b/sdks/python/agenta/sdk/agents/adapters/__init__.py @@ -3,6 +3,7 @@ - Backend adapters: ``RivetBackend`` (rivet over ACP), ``InProcessPiBackend`` (in-process Pi, the reference backend), ``LocalBackend`` (standalone SDK runs; not yet implemented). - Harness adapters: ``PiHarness``, ``ClaudeHarness``, ``AgentaHarness`` (+ ``make_harness``). +- HTTP/browser protocol adapters live in subpackages, e.g. ``adapters.vercel``. Shared plumbing for the runner-backed adapters lives in ``agents/utils``. """ diff --git a/sdks/python/agenta/sdk/agents/adapters/vercel/__init__.py b/sdks/python/agenta/sdk/agents/adapters/vercel/__init__.py new file mode 100644 index 0000000000..8778b98fc4 --- /dev/null +++ b/sdks/python/agenta/sdk/agents/adapters/vercel/__init__.py @@ -0,0 +1,35 @@ +"""Vercel AI SDK adapters for the agent runtime. + +The neutral agent runtime speaks ``Message``, ``AgentEvent``, and ``AgentRun``. This package +is the browser protocol adapter: Vercel ``UIMessage`` request bodies, UI Message Stream parts, +SSE framing, and the ``/messages`` route helpers. +""" + +from .messages import ( + from_ui_messages, + message_to_vercel_ui_message, + to_ui_message, + vercel_ui_messages_to_messages, +) +from .routing import ( + inject_stream_session_id, + register_agent_message_routes, + resolve_session_id, +) +from .sse import VERCEL_UI_MESSAGE_STREAM_HEADERS, vercel_sse_stream +from .stream import agent_run_to_vercel_parts, ui_message_stream + +__all__ = [ + "vercel_ui_messages_to_messages", + "message_to_vercel_ui_message", + "agent_run_to_vercel_parts", + "VERCEL_UI_MESSAGE_STREAM_HEADERS", + "vercel_sse_stream", + "resolve_session_id", + "inject_stream_session_id", + "register_agent_message_routes", + # Former flat-module names. + "from_ui_messages", + "to_ui_message", + "ui_message_stream", +] diff --git a/sdks/python/agenta/sdk/agents/adapters/vercel/messages.py b/sdks/python/agenta/sdk/agents/adapters/vercel/messages.py new file mode 100644 index 0000000000..7f718b9032 --- /dev/null +++ b/sdks/python/agenta/sdk/agents/adapters/vercel/messages.py @@ -0,0 +1,219 @@ +"""Vercel ``UIMessage`` conversion at the agent HTTP edge. + +This adapter translates between the Vercel AI SDK ``UIMessage`` parts shape and the +neutral agent runtime ``Message`` / ``ContentBlock`` types. The neutral DTOs stay the port; +Vercel-specific part names live here. +""" + +from __future__ import annotations + +from typing import Any, Dict, List, Optional + +from ...dtos import AgentResult, ContentBlock, Message + +TOOL_APPROVAL_REQUEST = "tool-approval-request" +TOOL_APPROVAL_RESPONSE = "tool-approval-response" +TOOL_OUTPUT_AVAILABLE = "tool-output-available" + + +def vercel_ui_messages_to_messages(raw: Optional[List[Any]]) -> List[Message]: + """Coerce inbound Vercel ``UIMessage`` objects into neutral messages.""" + messages: List[Message] = [] + for item in raw or []: + message = _ui_message_to_message(item) + if message is not None: + messages.append(message) + return messages + + +def _ui_message_to_message(raw: Any) -> Optional[Message]: + if isinstance(raw, Message): + return raw + if not isinstance(raw, dict) or "role" not in raw: + return None + role = str(raw["role"]) + + parts = raw.get("parts") + if parts is None: + return Message.from_raw(raw) + + blocks: List[ContentBlock] = [] + for part in parts or []: + blocks.extend(_part_to_blocks(part)) + + if not blocks: + return Message(role=role, content="") + if all(block.type == "text" for block in blocks): + return Message(role=role, content="".join(block.text or "" for block in blocks)) + return Message(role=role, content=blocks) + + +def _part_to_blocks(part: Any) -> List[ContentBlock]: + if not isinstance(part, dict): + return [] + ptype = str(part.get("type", "")) + + if ptype == "text": + text = part.get("text") + return [ContentBlock(type="text", text=text)] if text is not None else [] + + if ptype == "file": + media = part.get("mediaType") or part.get("mimeType") + kind = ( + "image" + if isinstance(media, str) and media.startswith("image/") + else "resource" + ) + return [ + ContentBlock( + type=kind, + uri=part.get("url") or part.get("uri"), + data=part.get("data"), + mime_type=media, + ) + ] + + if ptype == TOOL_APPROVAL_REQUEST: + return [] + + if ptype == TOOL_APPROVAL_RESPONSE: + return _approval_response_blocks(part) + + if ( + ptype == TOOL_OUTPUT_AVAILABLE + or ptype == "dynamic-tool" + or ptype.startswith("tool-") + ): + return _tool_part_blocks(part, ptype) + + return [] + + +def _tool_part_blocks(part: Dict[str, Any], ptype: str) -> List[ContentBlock]: + """A Vercel tool part -> neutral tool-call/result content blocks.""" + tool_call_id = part.get("toolCallId") or part.get("tool_call_id") + tool_name = part.get("toolName") or part.get("tool_name") + if ( + tool_name is None + and ptype.startswith("tool-") + and ptype != TOOL_OUTPUT_AVAILABLE + ): + tool_name = ptype[len("tool-") :] + + blocks: List[ContentBlock] = [] + if ptype != TOOL_OUTPUT_AVAILABLE or "input" in part: + blocks.append( + ContentBlock( + type="tool_call", + tool_call_id=tool_call_id, + tool_name=tool_name, + input=part.get("input"), + ) + ) + + state = part.get("state") + error_text = part.get("errorText") + if error_text is not None or state == "output-error": + blocks.append( + ContentBlock( + type="tool_result", + tool_call_id=tool_call_id, + tool_name=tool_name, + output=error_text if error_text is not None else part.get("output"), + is_error=True, + ) + ) + elif "output" in part or state == "output-available": + blocks.append( + ContentBlock( + type="tool_result", + tool_call_id=tool_call_id, + tool_name=tool_name, + output=part.get("output"), + is_error=False, + ) + ) + return blocks + + +def _approval_response_blocks(part: Dict[str, Any]) -> List[ContentBlock]: + """A cross-turn approval reply -> a tool-result block keyed by toolCallId.""" + tool_call_id = ( + part.get("toolCallId") or part.get("tool_call_id") or part.get("approvalId") + ) + output = part.get("output") + if output is None: + approved = part.get("approved") + output = {"approved": approved} if approved is not None else part.get("reason") + return [ContentBlock(type="tool_result", tool_call_id=tool_call_id, output=output)] + + +def message_to_vercel_ui_message( + source: Any, + *, + message_id: str = "msg-1", +) -> Dict[str, Any]: + """Render an ``AgentResult`` or neutral ``Message`` as one Vercel ``UIMessage``.""" + if isinstance(source, AgentResult): + return { + "id": message_id, + "role": "assistant", + "parts": [{"type": "text", "text": source.output or ""}], + } + if isinstance(source, Message): + return { + "id": message_id, + "role": source.role, + "parts": _content_to_parts(source.content), + } + raise TypeError( + "message_to_vercel_ui_message expects an AgentResult or Message, " + f"got {type(source).__name__!r}" + ) + + +def _content_to_parts(content: Any) -> List[Dict[str, Any]]: + if isinstance(content, str): + return [{"type": "text", "text": content}] if content else [] + parts: List[Dict[str, Any]] = [] + for block in content or []: + parts.extend(_block_to_parts(block)) + return parts + + +def _block_to_parts(block: ContentBlock) -> List[Dict[str, Any]]: + if block.type == "text": + return [{"type": "text", "text": block.text or ""}] + if block.type in ("image", "resource"): + part: Dict[str, Any] = {"type": "file"} + if block.uri is not None: + part["url"] = block.uri + if block.mime_type is not None: + part["mediaType"] = block.mime_type + if block.data is not None: + part["data"] = block.data + return [part] + if block.type == "tool_call": + return [ + { + "type": f"tool-{block.tool_name or 'tool'}", + "toolCallId": block.tool_call_id, + "state": "input-available", + "input": block.input, + } + ] + if block.type == "tool_result": + return [ + { + "type": f"tool-{block.tool_name or 'tool'}", + "toolCallId": block.tool_call_id, + "state": "output-error" if block.is_error else "output-available", + "output": block.output, + } + ] + return [] + + +# Back-compat aliases for the former flat module API. +from_ui_messages = vercel_ui_messages_to_messages +to_ui_message = message_to_vercel_ui_message diff --git a/sdks/python/agenta/sdk/agents/adapters/vercel/routing.py b/sdks/python/agenta/sdk/agents/adapters/vercel/routing.py new file mode 100644 index 0000000000..99e16cef58 --- /dev/null +++ b/sdks/python/agenta/sdk/agents/adapters/vercel/routing.py @@ -0,0 +1,175 @@ +"""FastAPI route wiring for the agent ``/messages`` Vercel adapter.""" + +from __future__ import annotations + +import re +from typing import Any, Callable, Collection, Optional +from uuid import uuid4 + +from fastapi import Request +from fastapi.responses import JSONResponse, Response + +from agenta.sdk.contexts.tracing import tracing_context_manager +from agenta.sdk.models.workflows import ( + LoadSessionRequest, + WorkflowBatchResponse, + WorkflowInvokeRequest, + WorkflowRequestData, + WorkflowStreamingResponse, +) + +from .messages import vercel_ui_messages_to_messages + +# An opaque, project-scoped session id (RFC §4.1): bounded length, restricted charset. +_SESSION_ID_RE = re.compile(r"^[A-Za-z0-9._:-]{1,128}$") + + +def resolve_session_id(session_id: Optional[str]) -> Optional[str]: + """Mint a new id when absent, echo a valid one, or return ``None`` when invalid.""" + if session_id is None: + return "sess_" + uuid4().hex + return session_id if _SESSION_ID_RE.match(session_id) else None + + +def inject_stream_session_id( + response: WorkflowStreamingResponse, + session_id: str, +) -> None: + """Stamp ``messageMetadata.sessionId`` onto the first Vercel ``start`` part.""" + original = response.generator + + async def generator(): + stamped = False + async for part in original(): + if not stamped and isinstance(part, dict) and part.get("type") == "start": + part.setdefault("messageMetadata", {})["sessionId"] = session_id + stamped = True + yield part + + response.generator = generator + + +def make_messages_endpoint( + *, + wf: Any, + get_request_tracing_context: Callable[[Request], Any], + parse_accept: Callable[[Request], Optional[str]], + stream_media_types: Collection[str], + make_json_response: Callable[[WorkflowBatchResponse], Response], + make_not_acceptable_response: Callable[[str, Any], Response], + make_stream_response: Callable[[WorkflowStreamingResponse, str], Response], + handle_failure: Callable[[Exception], Any], +): + """Build the ``POST /messages`` endpoint for one routed agent workflow.""" + + async def messages_endpoint(req: Request, request: WorkflowInvokeRequest): + credentials = req.state.auth.get("credentials") + + session_id = resolve_session_id(request.session_id) + if session_id is None: + return JSONResponse( + status_code=400, + content={"detail": "session_id violates the allowed charset/length"}, + ) + + try: + request.session_id = session_id + if request.data is None: + request.data = WorkflowRequestData() + + request.data.messages = [ + message.to_wire() + for message in vercel_ui_messages_to_messages(request.data.messages) + ] + + requested = parse_accept(req) + want_stream = requested in stream_media_types + request.data.stream = want_stream + + with tracing_context_manager(get_request_tracing_context(req)): + response = await wf.invoke( + request=request, + secrets=None, + credentials=credentials, + ) + + if isinstance(response, (WorkflowBatchResponse, WorkflowStreamingResponse)): + response.session_id = session_id + + if ( + isinstance(response, WorkflowBatchResponse) + and response.status + and response.status.code is not None + and response.status.code >= 400 + ): + return make_json_response(response) + + if want_stream: + if not isinstance(response, WorkflowStreamingResponse): + return make_not_acceptable_response(str(requested), response) + inject_stream_session_id(response, session_id) + return make_stream_response(response, "vercel") + + if not isinstance(response, WorkflowBatchResponse): + return make_not_acceptable_response( + requested or "application/json", response + ) + return make_json_response(response) + + except Exception as exception: + return await handle_failure(exception) + + return messages_endpoint + + +def make_load_session_endpoint(): + """Build the v1 ``POST /load-session`` stub endpoint.""" + + async def load_session_endpoint(req: Request, request: LoadSessionRequest): + return JSONResponse( + content={"session_id": request.session_id, "messages": []}, + ) + + return load_session_endpoint + + +def register_agent_message_routes( + target: Any, + prefix: str, + *, + wf: Any, + invoke_responses: dict, + get_request_tracing_context: Callable[[Request], Any], + parse_accept: Callable[[Request], Optional[str]], + stream_media_types: Collection[str], + make_json_response: Callable[[WorkflowBatchResponse], Response], + make_not_acceptable_response: Callable[[str, Any], Response], + make_stream_response: Callable[[WorkflowStreamingResponse, str], Response], + handle_failure: Callable[[Exception], Any], +) -> None: + """Register ``/messages`` and ``/load-session`` on a FastAPI app/router target.""" + target.add_api_route( + prefix + "/messages", + make_messages_endpoint( + wf=wf, + get_request_tracing_context=get_request_tracing_context, + parse_accept=parse_accept, + stream_media_types=stream_media_types, + make_json_response=make_json_response, + make_not_acceptable_response=make_not_acceptable_response, + make_stream_response=make_stream_response, + handle_failure=handle_failure, + ), + methods=["POST"], + responses=invoke_responses, + ) + target.add_api_route( + prefix + "/load-session", + make_load_session_endpoint(), + methods=["POST"], + ) + + +# Back-compat aliases for the former routing-private helper names. +_resolve_session_id = resolve_session_id +_inject_stream_session_id = inject_stream_session_id diff --git a/sdks/python/agenta/sdk/agents/adapters/vercel/sse.py b/sdks/python/agenta/sdk/agents/adapters/vercel/sse.py new file mode 100644 index 0000000000..942d8e03d6 --- /dev/null +++ b/sdks/python/agenta/sdk/agents/adapters/vercel/sse.py @@ -0,0 +1,30 @@ +"""SSE framing for the Vercel AI SDK UI Message Stream.""" + +from __future__ import annotations + +from json import dumps +from typing import Any, AsyncGenerator + +# Headers the Vercel AI SDK client and intermediaries require for a UI Message Stream. +# ``x-accel-buffering: no`` stops a proxy from re-buffering the SSE so parts flush live. +VERCEL_UI_MESSAGE_STREAM_HEADERS = { + "x-vercel-ai-ui-message-stream": "v1", + "cache-control": "no-cache", + "x-accel-buffering": "no", +} + + +def vercel_sse_stream(aiter: AsyncGenerator[Any, None]): + """Frame Vercel UI Message Stream parts as SSE and append ``[DONE]``.""" + + async def gen(): + async for chunk in aiter: + yield "data: " + dumps(chunk, ensure_ascii=False) + "\n\n" + yield "data: [DONE]\n\n" + + return gen() + + +# Back-compat aliases for the former routing-private names. +_VERCEL_UI_MESSAGE_STREAM_HEADERS = VERCEL_UI_MESSAGE_STREAM_HEADERS +_vercel_sse_stream = vercel_sse_stream diff --git a/sdks/python/agenta/sdk/agents/adapters/vercel/stream.py b/sdks/python/agenta/sdk/agents/adapters/vercel/stream.py new file mode 100644 index 0000000000..6d0e1526b2 --- /dev/null +++ b/sdks/python/agenta/sdk/agents/adapters/vercel/stream.py @@ -0,0 +1,216 @@ +"""Encode neutral agent run events as Vercel UI Message Stream parts.""" + +from __future__ import annotations + +from typing import Any, AsyncIterator, Dict, Optional + +from ...dtos import AgentResult +from ...streaming import AgentRun +from .messages import TOOL_APPROVAL_REQUEST + + +async def agent_run_to_vercel_parts( + run: AgentRun, + *, + session_id: Optional[str] = None, + message_id: str = "msg-1", + trace_id: Optional[str] = None, +) -> AsyncIterator[Dict[str, Any]]: + """Project a live ``AgentRun`` into Vercel UI Message Stream part dictionaries.""" + start: Dict[str, Any] = {"type": "start", "messageId": message_id} + if session_id is not None: + start["messageMetadata"] = {"sessionId": session_id} + yield start + yield {"type": "start-step"} + + text_seq = 0 + reasoning_seq = 0 + usage: Optional[Dict[str, Any]] = None + stop_reason: Optional[str] = None + + try: + async for event in run: + etype = event.type + data = event.data + + if etype == "message": + text_seq += 1 + tid = f"text-{text_seq}" + yield {"type": "text-start", "id": tid} + yield {"type": "text-delta", "id": tid, "delta": data.get("text", "")} + yield {"type": "text-end", "id": tid} + elif etype == "message_start": + yield {"type": "text-start", "id": data.get("id")} + elif etype == "message_delta": + yield { + "type": "text-delta", + "id": data.get("id"), + "delta": data.get("delta", ""), + } + elif etype == "message_end": + yield {"type": "text-end", "id": data.get("id")} + elif etype == "thought": + reasoning_seq += 1 + rid = f"reasoning-{reasoning_seq}" + yield {"type": "reasoning-start", "id": rid} + yield { + "type": "reasoning-delta", + "id": rid, + "delta": data.get("text", ""), + } + yield {"type": "reasoning-end", "id": rid} + elif etype == "reasoning_start": + yield {"type": "reasoning-start", "id": data.get("id")} + elif etype == "reasoning_delta": + yield { + "type": "reasoning-delta", + "id": data.get("id"), + "delta": data.get("delta", ""), + } + elif etype == "reasoning_end": + yield {"type": "reasoning-end", "id": data.get("id")} + elif etype == "tool_call": + tool_call_id = data.get("id") + tool_name = data.get("name") + yield { + "type": "tool-input-start", + "toolCallId": tool_call_id, + "toolName": tool_name, + } + available: Dict[str, Any] = { + "type": "tool-input-available", + "toolCallId": tool_call_id, + "toolName": tool_name, + "input": data.get("input"), + } + if data.get("render") is not None: + available["render"] = data["render"] + yield available + elif etype == "tool_result": + tool_call_id = data.get("id") + if data.get("denied"): + yield { + "type": "tool-output-denied", + "toolCallId": tool_call_id, + } + elif data.get("isError"): + yield { + "type": "tool-output-error", + "toolCallId": tool_call_id, + "errorText": _as_text(data.get("output")), + } + else: + structured = data.get("data") + out = structured if structured is not None else data.get("output") + available = { + "type": "tool-output-available", + "toolCallId": tool_call_id, + "output": out, + } + if data.get("render") is not None: + available["render"] = data["render"] + yield available + elif etype == "interaction_request": + yield _interaction_part(data) + elif etype == "data": + part: Dict[str, Any] = { + "type": f"data-{data.get('name', 'data')}", + "data": data.get("data"), + } + if data.get("transient"): + part["transient"] = True + yield part + elif etype == "file": + yield { + "type": "file", + "url": data.get("url"), + "mediaType": data.get("mediaType"), + } + elif etype == "usage": + usage = _usage_metadata(data) + elif etype == "error": + yield {"type": "error", "errorText": data.get("message", "")} + elif etype == "done": + stop_reason = data.get("stopReason") + except Exception as exc: + yield {"type": "error", "errorText": str(exc)} + return + + if usage is None or trace_id is None: + result = _safe_result(run) + if result is not None: + if usage is None: + usage = _usage_metadata(result.usage or {}) + if stop_reason is None: + stop_reason = result.stop_reason + if trace_id is None: + trace_id = result.trace_id + + yield {"type": "finish-step"} + finish: Dict[str, Any] = {"type": "finish"} + if stop_reason is not None: + finish["finishReason"] = stop_reason + metadata: Dict[str, Any] = {} + if usage: + metadata["usage"] = usage + if trace_id is not None: + metadata["traceId"] = trace_id + if metadata: + finish["messageMetadata"] = metadata + yield finish + + +def _interaction_part(data: Dict[str, Any]) -> Dict[str, Any]: + """Project a neutral ``interaction_request`` event to a Vercel stream part.""" + kind = data.get("kind") + payload = data.get("payload") or {} + if kind == "permission": + return { + "type": TOOL_APPROVAL_REQUEST, + "approvalId": data.get("id"), + "toolCallId": _approval_tool_call_id(payload), + "availableReplies": payload.get("availableReplies"), + "toolCall": payload.get("toolCall"), + } + if kind == "input": + return {"type": "data-input-request", "id": data.get("id"), "data": payload} + return { + "type": "data-interaction", + "id": data.get("id"), + "data": {"kind": kind, "payload": payload}, + } + + +def _approval_tool_call_id(payload: Dict[str, Any]) -> Optional[Any]: + tool_call_id = payload.get("toolCallId") + if tool_call_id is not None: + return tool_call_id + tool_call = payload.get("toolCall") + if isinstance(tool_call, dict): + return tool_call.get("id") or tool_call.get("toolCallId") + return None + + +def _usage_metadata(data: Dict[str, Any]) -> Dict[str, Any]: + return { + key: data[key] + for key in ("input", "output", "total", "cost") + if data.get(key) is not None + } + + +def _as_text(value: Any) -> str: + if value is None: + return "" + return value if isinstance(value, str) else str(value) + + +def _safe_result(run: AgentRun) -> Optional[AgentResult]: + try: + return run.result() + except Exception: + return None + + +# Back-compat alias for the former flat module API. +ui_message_stream = agent_run_to_vercel_parts diff --git a/sdks/python/agenta/sdk/agents/ui_messages.py b/sdks/python/agenta/sdk/agents/ui_messages.py index 41517245aa..1e19af0348 100644 --- a/sdks/python/agenta/sdk/agents/ui_messages.py +++ b/sdks/python/agenta/sdk/agents/ui_messages.py @@ -1,491 +1,24 @@ -"""UI message codec: translate between the Vercel AI SDK ``UIMessage`` wire shape and the -neutral agent runtime types (``Message`` / ``AgentEvent`` / ``AgentResult``). +"""Compatibility imports for the Vercel UI Message adapter. -This is the ``/messages`` egress adapter, the parts-aware sibling of -:func:`agenta.sdk.agents.to_messages` (which only understands the ``/invoke`` ``{role, -content}`` shape). The neutral types in ``dtos.py`` stay the port; this module is one more -adapter behind that seam, so the ``/run`` runner wire (``utils/wire.py``, ``{role, content}``) -is unchanged — the Vercel shape lives only at the HTTP edge. - -Three directions: - -- :func:`from_ui_messages` — inbound ``UIMessage[]`` -> ``List[Message]``. Text and file parts - fold into content blocks; tool and approval parts are PRESERVED as structured ``tool_call`` / - ``tool_result`` :class:`~agenta.sdk.agents.dtos.ContentBlock`s (never flattened to text), so a - cross-turn human-in-the-loop reply replays as a real tool turn and the model resumes from the - result. The runner's message transcript renders these blocks into the cold replay. -- :func:`to_ui_message` — outbound ``AgentResult`` / ``Message`` -> one ``UIMessage`` dict, for - the ``load-session`` history. -- :func:`ui_message_stream` — the streaming edge: a live - :class:`~agenta.sdk.agents.streaming.AgentRun` -> Vercel UI Message Stream parts - (``start`` ... ``finish``). The SSE framing and the terminal ``data: [DONE]`` are added by the - routing layer (``_vercel_sse_stream``); this generator yields the part dicts only. +New code should import from :mod:`agenta.sdk.agents.adapters.vercel`. """ from __future__ import annotations -from typing import Any, AsyncIterator, Dict, List, Optional - -from .dtos import AgentResult, ContentBlock, Message -from .streaming import AgentRun - -# Inbound UIMessage part type names handled specially (the rest of ``tool-*`` is a tool call). -_TOOL_APPROVAL_REQUEST = "tool-approval-request" -_TOOL_APPROVAL_RESPONSE = "tool-approval-response" -_TOOL_OUTPUT_AVAILABLE = "tool-output-available" - - -# --------------------------------------------------------------------------- -# Inbound: UIMessage[] -> List[Message] -# --------------------------------------------------------------------------- - - -def from_ui_messages(raw: Optional[List[Any]]) -> List[Message]: - """Coerce inbound Vercel ``UIMessage`` objects into neutral :class:`Message` objects. - - The parts-aware sibling of :func:`agenta.sdk.agents.to_messages`. Tool and approval parts - are preserved as structured tool-call / tool-result content blocks (never dropped), so a - cross-turn human-in-the-loop reply resumes the pending interaction on the next turn. - """ - messages: List[Message] = [] - for item in raw or []: - message = _ui_message_to_message(item) - if message is not None: - messages.append(message) - return messages - - -def _ui_message_to_message(raw: Any) -> Optional[Message]: - if isinstance(raw, Message): - return raw - if not isinstance(raw, dict) or "role" not in raw: - return None - role = str(raw["role"]) - - parts = raw.get("parts") - if parts is None: - # Not a parts-based UIMessage — fall back to the {role, content} shape so a mixed - # history still parses. - return Message.from_raw(raw) - - blocks: List[ContentBlock] = [] - for part in parts or []: - blocks.extend(_part_to_blocks(part)) - - if not blocks: - return Message(role=role, content="") - # Collapse an all-text message to a plain string (the shape the runner replays); keep the - # block list when any structured (file / tool) content is present. - if all(block.type == "text" for block in blocks): - return Message(role=role, content="".join(block.text or "" for block in blocks)) - return Message(role=role, content=blocks) - - -def _part_to_blocks(part: Any) -> List[ContentBlock]: - if not isinstance(part, dict): - return [] - ptype = str(part.get("type", "")) - - if ptype == "text": - text = part.get("text") - return [ContentBlock(type="text", text=text)] if text is not None else [] - - if ptype == "file": - media = part.get("mediaType") or part.get("mimeType") - kind = ( - "image" - if isinstance(media, str) and media.startswith("image/") - else "resource" - ) - return [ - ContentBlock( - type=kind, - uri=part.get("url") or part.get("uri"), - data=part.get("data"), - mime_type=media, - ) - ] - - if ptype == _TOOL_APPROVAL_REQUEST: - # The server's own request, echoed back in history; regenerated on replay, not input. - return [] - - if ptype == _TOOL_APPROVAL_RESPONSE: - return _approval_response_blocks(part) - - if ( - ptype == _TOOL_OUTPUT_AVAILABLE - or ptype == "dynamic-tool" - or ptype.startswith("tool-") - ): - return _tool_part_blocks(part, ptype) - - # reasoning / step-start / data-* parts are the assistant's own output or transient UI; - # they are not model input on replay, so they are dropped. - return [] - - -def _tool_part_blocks(part: Dict[str, Any], ptype: str) -> List[ContentBlock]: - """A Vercel tool part -> a ``tool_call`` block plus, when resolved, a ``tool_result``. - - Field names match what the runner's transcript renders: ``toolCallId`` / ``toolName`` / - ``input`` / ``output`` / ``isError`` (via :meth:`ContentBlock.to_wire`). - """ - tool_call_id = part.get("toolCallId") or part.get("tool_call_id") - tool_name = part.get("toolName") or part.get("tool_name") - if ( - tool_name is None - and ptype.startswith("tool-") - and ptype != _TOOL_OUTPUT_AVAILABLE - ): - tool_name = ptype[len("tool-") :] - - blocks: List[ContentBlock] = [] - - # The call itself (a bare tool-output-available part carries only a result). - if ptype != _TOOL_OUTPUT_AVAILABLE or "input" in part: - blocks.append( - ContentBlock( - type="tool_call", - tool_call_id=tool_call_id, - tool_name=tool_name, - input=part.get("input"), - ) - ) - - state = part.get("state") - error_text = part.get("errorText") - if error_text is not None or state == "output-error": - blocks.append( - ContentBlock( - type="tool_result", - tool_call_id=tool_call_id, - tool_name=tool_name, - output=error_text if error_text is not None else part.get("output"), - is_error=True, - ) - ) - elif "output" in part or state == "output-available": - blocks.append( - ContentBlock( - type="tool_result", - tool_call_id=tool_call_id, - tool_name=tool_name, - output=part.get("output"), - is_error=False, - ) - ) - return blocks - - -def _approval_response_blocks(part: Dict[str, Any]) -> List[ContentBlock]: - """A cross-turn ``tool-approval-response`` reply -> a ``tool_result`` keyed by toolCallId, - so the runtime matches the pending interaction and resumes (the resolve step is joint).""" - tool_call_id = ( - part.get("toolCallId") or part.get("tool_call_id") or part.get("approvalId") - ) - output = part.get("output") - if output is None: - approved = part.get("approved") - output = {"approved": approved} if approved is not None else part.get("reason") - return [ContentBlock(type="tool_result", tool_call_id=tool_call_id, output=output)] - - -# --------------------------------------------------------------------------- -# Outbound (batch): AgentResult / Message -> one UIMessage dict -# --------------------------------------------------------------------------- - - -def to_ui_message(source: Any, *, message_id: str = "msg-1") -> Dict[str, Any]: - """Render an :class:`AgentResult` or :class:`Message` as one Vercel ``UIMessage`` dict - (the shape ``load-session`` returns and ``useChat`` takes as its initial messages).""" - if isinstance(source, AgentResult): - return { - "id": message_id, - "role": "assistant", - "parts": [{"type": "text", "text": source.output or ""}], - } - if isinstance(source, Message): - return { - "id": message_id, - "role": source.role, - "parts": _content_to_parts(source.content), - } - raise TypeError( - f"to_ui_message expects an AgentResult or Message, got {type(source).__name__!r}" - ) - - -def _content_to_parts(content: Any) -> List[Dict[str, Any]]: - if isinstance(content, str): - return [{"type": "text", "text": content}] if content else [] - parts: List[Dict[str, Any]] = [] - for block in content or []: - parts.extend(_block_to_parts(block)) - return parts - - -def _block_to_parts(block: ContentBlock) -> List[Dict[str, Any]]: - if block.type == "text": - return [{"type": "text", "text": block.text or ""}] - if block.type in ("image", "resource"): - part: Dict[str, Any] = {"type": "file"} - if block.uri is not None: - part["url"] = block.uri - if block.mime_type is not None: - part["mediaType"] = block.mime_type - if block.data is not None: - part["data"] = block.data - return [part] - if block.type == "tool_call": - return [ - { - "type": f"tool-{block.tool_name or 'tool'}", - "toolCallId": block.tool_call_id, - "state": "input-available", - "input": block.input, - } - ] - if block.type == "tool_result": - return [ - { - "type": f"tool-{block.tool_name or 'tool'}", - "toolCallId": block.tool_call_id, - "state": "output-error" if block.is_error else "output-available", - "output": block.output, - } - ] - return [] - - -# --------------------------------------------------------------------------- -# Streaming edge: a live AgentRun -> Vercel UI Message Stream parts -# --------------------------------------------------------------------------- - - -async def ui_message_stream( - run: AgentRun, - *, - session_id: Optional[str] = None, - message_id: str = "msg-1", - trace_id: Optional[str] = None, -) -> AsyncIterator[Dict[str, Any]]: - """Encode a live :class:`AgentRun` as Vercel UI Message Stream part dicts. - - Consumes the run's live ``AgentEvent`` stream and yields parts as they arrive: ``start`` - (carrying ``messageMetadata.sessionId``) first, then the body, then ``finish`` (carrying - ``messageMetadata.traceId`` so the client can open the run's OTel trace, RFC §6.1). The SSE - framing and the terminal ``data: [DONE]`` are added by the routing layer - (``_vercel_sse_stream``). On a terminal run failure the run raises while iterating; that is - surfaced as an ``error`` part (RFC §8.2) and the stream ends without a ``finish``. - """ - start: Dict[str, Any] = {"type": "start", "messageId": message_id} - if session_id is not None: - start["messageMetadata"] = {"sessionId": session_id} - yield start - yield {"type": "start-step"} - - text_seq = 0 - reasoning_seq = 0 - usage: Optional[Dict[str, Any]] = None - stop_reason: Optional[str] = None - - try: - async for event in run: - etype = event.type - data = event.data - - if etype == "message": - text_seq += 1 - tid = f"text-{text_seq}" - yield {"type": "text-start", "id": tid} - yield {"type": "text-delta", "id": tid, "delta": data.get("text", "")} - yield {"type": "text-end", "id": tid} - elif etype == "message_start": - yield {"type": "text-start", "id": data.get("id")} - elif etype == "message_delta": - yield { - "type": "text-delta", - "id": data.get("id"), - "delta": data.get("delta", ""), - } - elif etype == "message_end": - yield {"type": "text-end", "id": data.get("id")} - elif etype == "thought": - reasoning_seq += 1 - rid = f"reasoning-{reasoning_seq}" - yield {"type": "reasoning-start", "id": rid} - yield { - "type": "reasoning-delta", - "id": rid, - "delta": data.get("text", ""), - } - yield {"type": "reasoning-end", "id": rid} - elif etype == "reasoning_start": - yield {"type": "reasoning-start", "id": data.get("id")} - elif etype == "reasoning_delta": - yield { - "type": "reasoning-delta", - "id": data.get("id"), - "delta": data.get("delta", ""), - } - elif etype == "reasoning_end": - yield {"type": "reasoning-end", "id": data.get("id")} - elif etype == "tool_call": - tool_call_id = data.get("id") - tool_name = data.get("name") - yield { - "type": "tool-input-start", - "toolCallId": tool_call_id, - "toolName": tool_name, - } - available: Dict[str, Any] = { - "type": "tool-input-available", - "toolCallId": tool_call_id, - "toolName": tool_name, - "input": data.get("input"), - } - if data.get("render") is not None: - available["render"] = data["render"] - yield available - elif etype == "tool_result": - tool_call_id = data.get("id") - if data.get("denied"): - # A human denied the tool, so it never ran (RFC: emit tool-output-denied - # instead of tool-output-available; the FE renders the output-denied state). - yield { - "type": "tool-output-denied", - "toolCallId": tool_call_id, - } - elif data.get("isError"): - yield { - "type": "tool-output-error", - "toolCallId": tool_call_id, - "errorText": _as_text(data.get("output")), - } - else: - # Prefer the structured object (generative UI); fall back to the text form. - structured = data.get("data") - out = structured if structured is not None else data.get("output") - available = { - "type": "tool-output-available", - "toolCallId": tool_call_id, - "output": out, - } - if data.get("render") is not None: - available["render"] = data["render"] - yield available - elif etype == "interaction_request": - yield _interaction_part(data) - elif etype == "data": - part: Dict[str, Any] = { - "type": f"data-{data.get('name', 'data')}", - "data": data.get("data"), - } - if data.get("transient"): - part["transient"] = True - yield part - elif etype == "file": - yield { - "type": "file", - "url": data.get("url"), - "mediaType": data.get("mediaType"), - } - elif etype == "usage": - usage = _usage_metadata(data) - elif etype == "error": - yield {"type": "error", "errorText": data.get("message", "")} - elif etype == "done": - stop_reason = data.get("stopReason") - # unknown event types are ignored - except Exception as exc: # AgentRun raises on a terminal ok=false result - yield {"type": "error", "errorText": str(exc)} - return - - # Pull usage and the trace id from the terminal result when not already known, the same - # fallback both lean on (RFC §6.1: the finish trace id matches the JSON response's). - if usage is None or trace_id is None: - result = _safe_result(run) - if result is not None: - if usage is None: - usage = _usage_metadata(result.usage or {}) - if stop_reason is None: - stop_reason = result.stop_reason - if trace_id is None: - trace_id = result.trace_id - - yield {"type": "finish-step"} - finish: Dict[str, Any] = {"type": "finish"} - if stop_reason is not None: - finish["finishReason"] = stop_reason - # usage and traceId coexist under messageMetadata; the client reads message.metadata.traceId. - metadata: Dict[str, Any] = {} - if usage: - metadata["usage"] = usage - if trace_id is not None: - metadata["traceId"] = trace_id - if metadata: - finish["messageMetadata"] = metadata - yield finish - - -def _interaction_part(data: Dict[str, Any]) -> Dict[str, Any]: - """Project an ``interaction_request`` event to a Vercel part. Permission -> an approval - request; input -> a forward-looking input part; any other kind (e.g. ``client_tool``) -> - a generic interaction part so it is surfaced, not dropped (the resolve step is joint).""" - kind = data.get("kind") - payload = data.get("payload") or {} - if kind == "permission": - return { - "type": _TOOL_APPROVAL_REQUEST, - "approvalId": data.get("id"), - # REQUIRED alongside approvalId (RFC / AI SDK chunk): the gated tool's call id, so - # the FE binds the approval to its existing tool part and the inbound - # tool-approval-response (keyed by toolCallId) correlates back for the cross-turn - # resume. Prefer the top-level toolCallId the runner emits; fall back to the nested - # ACP toolCall detail (id / toolCallId). - "toolCallId": _approval_tool_call_id(payload), - "availableReplies": payload.get("availableReplies"), - "toolCall": payload.get("toolCall"), - } - if kind == "input": - return {"type": "data-input-request", "id": data.get("id"), "data": payload} - return { - "type": "data-interaction", - "id": data.get("id"), - "data": {"kind": kind, "payload": payload}, - } - - -def _approval_tool_call_id(payload: Dict[str, Any]) -> Optional[Any]: - """The gated tool's call id for a ``tool-approval-request``. The runner stamps a top-level - ``toolCallId`` on the permission payload; if it is absent, dig it out of the nested ACP - ``toolCall`` detail (``id`` / ``toolCallId``).""" - tool_call_id = payload.get("toolCallId") - if tool_call_id is not None: - return tool_call_id - tool_call = payload.get("toolCall") - if isinstance(tool_call, dict): - return tool_call.get("id") or tool_call.get("toolCallId") - return None - - -def _usage_metadata(data: Dict[str, Any]) -> Dict[str, Any]: - return { - key: data[key] - for key in ("input", "output", "total", "cost") - if data.get(key) is not None - } - - -def _as_text(value: Any) -> str: - if value is None: - return "" - return value if isinstance(value, str) else str(value) - - -def _safe_result(run: AgentRun) -> Optional[AgentResult]: - try: - return run.result() - except Exception: # result not available (stream not fully consumed / failed) - return None +from .adapters.vercel import ( + agent_run_to_vercel_parts, + from_ui_messages, + message_to_vercel_ui_message, + to_ui_message, + ui_message_stream, + vercel_ui_messages_to_messages, +) + +__all__ = [ + "vercel_ui_messages_to_messages", + "message_to_vercel_ui_message", + "agent_run_to_vercel_parts", + "from_ui_messages", + "to_ui_message", + "ui_message_stream", +] diff --git a/sdks/python/agenta/sdk/decorators/routing.py b/sdks/python/agenta/sdk/decorators/routing.py index 4a57846d6e..8b2f1d6136 100644 --- a/sdks/python/agenta/sdk/decorators/routing.py +++ b/sdks/python/agenta/sdk/decorators/routing.py @@ -20,6 +20,15 @@ WorkflowBaseResponse, WorkflowServiceResponseData, ) +from agenta.sdk.agents.adapters.vercel.routing import ( + inject_stream_session_id, + register_agent_message_routes, + resolve_session_id, +) +from agenta.sdk.agents.adapters.vercel.sse import ( + VERCEL_UI_MESSAGE_STREAM_HEADERS as _VERCEL_UI_MESSAGE_STREAM_HEADERS, + vercel_sse_stream as _vercel_sse_stream, +) from agenta.sdk.middlewares.routing.cors import CORSMiddleware from agenta.sdk.middlewares.routing.auth import AuthMiddleware from agenta.sdk.middlewares.routing.otel import OTelMiddleware @@ -29,12 +38,25 @@ from agenta.sdk.engines.running.errors import ErrorStatus +def _resolve_session_id(session_id: Optional[str]) -> Optional[str]: + """Compatibility wrapper for the Vercel adapter helper.""" + return resolve_session_id(session_id) + + +def _inject_stream_session_id( + response: WorkflowStreamingResponse, + session_id: str, +) -> None: + """Compatibility wrapper for the Vercel adapter helper.""" + inject_stream_session_id(response, session_id) + + # --------------------------------------------------------------------------- # Reserved path segments — may not appear anywhere in a route path. # These names are used by the per-route namespace triple itself. # --------------------------------------------------------------------------- -_RESERVED_PATHS = {"invoke", "inspect"} +_RESERVED_PATHS = {"invoke", "inspect", "messages", "load-session"} def _validate_path(path: str) -> None: @@ -195,15 +217,27 @@ def _make_stream_response( ) -> StreamingResponse: aiter = response.iterator() - if wire_format == "sse": - media_type = "text/event-stream" - res = StreamingResponse(_sse_stream(aiter), media_type=media_type) + if wire_format == "vercel": + # The Vercel UI Message Stream: SSE framing terminated by `data: [DONE]`, plus the + # headers the AI SDK client and proxies require. Endpoint-selected (the agent + # `/messages` route passes "vercel"), not derived from Accept — a Vercel UI message + # stream and a plain SSE stream share the `text/event-stream` media type, so the + # choice cannot come from the Accept header alone. + res = StreamingResponse( + _vercel_sse_stream(aiter), media_type="text/event-stream" + ) + for key, value in _VERCEL_UI_MESSAGE_STREAM_HEADERS.items(): + res.headers.setdefault(key, value) + elif wire_format == "sse": + res = StreamingResponse(_sse_stream(aiter), media_type="text/event-stream") elif wire_format == "ndjson": - media_type = "application/x-ndjson" - res = StreamingResponse(_ndjson_stream(aiter), media_type=media_type) + res = StreamingResponse( + _ndjson_stream(aiter), media_type="application/x-ndjson" + ) else: - media_type = "application/x-ndjson" - res = StreamingResponse(_ndjson_stream(aiter), media_type=media_type) + res = StreamingResponse( + _ndjson_stream(aiter), media_type="application/x-ndjson" + ) return _set_common_headers(res, response) # type: ignore @@ -451,6 +485,10 @@ async def inspect_endpoint(req: Request, request: WorkflowInspectRequest): except Exception as exception: return await handle_inspect_failure(exception) + # Agent-only endpoints are Vercel/browser-protocol adapters. Keep their request + # folding, session id handling, and UI Message Stream details out of the generic + # workflow route decorator. + invoke_responses: dict = { 200: { "description": "Negotiated response — format determined by Accept header", @@ -489,6 +527,25 @@ async def inspect_endpoint(req: Request, request: WorkflowInspectRequest): }, } + agent_enabled = bool(self.flags and self.flags.get("is_agent")) + + def _add_agent_routes(target: Any, prefix: str) -> None: + """Register the agent-only /messages + /load-session routes on a target + (sub-app / router / mount root), mirroring how /invoke + /inspect are added.""" + register_agent_message_routes( + target, + prefix, + wf=wf, + invoke_responses=invoke_responses, + get_request_tracing_context=_get_request_tracing_context, + parse_accept=_parse_accept, + stream_media_types=STREAM_MEDIA_TYPES, + make_json_response=_make_json_response, + make_not_acceptable_response=_make_not_acceptable_response, + make_stream_response=_make_stream_response, + handle_failure=handle_invoke_failure, + ) + # ------------------------------------------------------------------ # Legacy path: router= was provided. # Registers prefixed routes on the APIRouter without isolation. @@ -506,6 +563,8 @@ async def inspect_endpoint(req: Request, request: WorkflowInspectRequest): methods=["POST"], response_model=WorkflowInvokeRequest, ) + if agent_enabled: + _add_agent_routes(self.router_fallback, self.path) return foo # ------------------------------------------------------------------ @@ -528,6 +587,8 @@ async def inspect_endpoint(req: Request, request: WorkflowInspectRequest): methods=["POST"], response_model=WorkflowInvokeRequest, ) + if agent_enabled: + _add_agent_routes(self.mount_root, "") return foo @@ -545,6 +606,8 @@ async def inspect_endpoint(req: Request, request: WorkflowInspectRequest): methods=["POST"], response_model=WorkflowInvokeRequest, ) + if agent_enabled: + _add_agent_routes(sub_app, "") self.mount_root.mount(self.path, sub_app) diff --git a/sdks/python/agenta/sdk/models/workflows.py b/sdks/python/agenta/sdk/models/workflows.py index a9437342df..0cb751e9dc 100644 --- a/sdks/python/agenta/sdk/models/workflows.py +++ b/sdks/python/agenta/sdk/models/workflows.py @@ -79,6 +79,7 @@ class WorkflowFlags(BaseModel): # interface-derived ## schema is_chat: bool = False + is_agent: bool = False ## hook has_url: bool = False ## code @@ -106,6 +107,7 @@ class WorkflowQueryFlags(BaseModel): # interface-derived ## schema is_chat: Optional[bool] = None + is_agent: Optional[bool] = None ## hook has_url: Optional[bool] = None ## code @@ -209,6 +211,15 @@ class WorkflowRequestData(BaseModel): # testcase: Optional[dict] = None inputs: Optional[dict] = None + # The agent ``/messages`` egress lifts the conversation out of ``inputs`` to this + # first-class member, in the Vercel ``UIMessage`` shape; ``/invoke`` ignores it. + messages: Optional[list] = None + # Transport mode for the agent ``/messages`` route: the endpoint sets this from the Accept + # negotiation so the shared agent handler streams (returns an async generator) instead of + # returning a batch dict. A sibling of ``messages`` / ``inputs`` / ``parameters`` on purpose + # — it must not live in ``parameters``, where it would leak into agent config / revision + # state / trace inputs. ``/invoke`` leaves it unset (batch). + stream: Optional[bool] = None # trace: Optional[dict] = None outputs: Optional[Any] = None @@ -233,6 +244,10 @@ class WorkflowBaseRequest(Metadata): secrets: Optional[Dict[str, Any]] = None credentials: Optional[str] = None + # The agent ``/messages`` session this turn belongs to (opaque, project-scoped). Optional; + # absent on ``/invoke`` and on the first turn of a server-minted session. + session_id: Optional[str] = None + @model_validator(mode="before") def _coerce_nested_models(cls, values: Dict[str, Any]) -> Dict[str, Any]: if "references" in values and isinstance(values["references"], dict): @@ -291,6 +306,10 @@ class WorkflowBaseResponse(TraceID, SpanID): status: Optional[WorkflowServiceStatus] = WorkflowServiceStatus() + # The resolved agent session id (minted or echoed) on the ``/messages`` response, alongside + # ``trace_id`` / ``span_id``. ``None`` for plain ``/invoke`` responses. + session_id: Optional[str] = None + # back-compat alias WorkflowServiceBaseResponse = WorkflowBaseResponse @@ -324,6 +343,20 @@ async def iterator(self): ] +class LoadSessionRequest(BaseModel): + """``POST /load-session`` body. The session id is required (RFC §7.1).""" + + session_id: str + + +class LoadSessionResponse(BaseModel): + """``POST /load-session`` response: a session's history as Vercel ``UIMessage`` objects, + the shape ``useChat`` takes as its initial ``messages``.""" + + session_id: str + messages: List[Dict[str, Any]] = Field(default_factory=list) + + # aliases ---------------------------------------------------------------------- diff --git a/sdks/python/oss/tests/pytest/unit/agents/test_ui_messages.py b/sdks/python/oss/tests/pytest/unit/agents/test_ui_messages.py index aaf9afec10..f7cce7d31c 100644 --- a/sdks/python/oss/tests/pytest/unit/agents/test_ui_messages.py +++ b/sdks/python/oss/tests/pytest/unit/agents/test_ui_messages.py @@ -1,12 +1,13 @@ -"""Tests for the UI message codec (``agenta.sdk.agents.ui_messages``), the ``/messages`` -egress adapter between the Vercel ``UIMessage`` shape and the neutral runtime types. +"""Tests for the Vercel UI message adapter, the ``/messages`` egress adapter between the +Vercel ``UIMessage`` shape and the neutral runtime types. Three directions: -- ``from_ui_messages`` — inbound parts -> ``Message``; tool/approval parts are preserved as - structured ``tool_call`` / ``tool_result`` content blocks (the HITL reply channel). -- ``to_ui_message`` — outbound ``AgentResult`` / ``Message`` -> one ``UIMessage`` dict. -- ``ui_message_stream`` — a live ``AgentRun`` -> Vercel UI Message Stream parts. +- ``vercel_ui_messages_to_messages`` — inbound parts -> ``Message``; tool/approval parts are + preserved as structured ``tool_call`` / ``tool_result`` content blocks. +- ``message_to_vercel_ui_message`` — outbound ``AgentResult`` / ``Message`` -> one + ``UIMessage`` dict. +- ``agent_run_to_vercel_parts`` — a live ``AgentRun`` -> Vercel UI Message Stream parts. The stream tests fabricate an ``AgentRun`` from a fixed record list (the same trick ``test_streaming.py`` uses), so they are pure and need no backend. @@ -17,10 +18,10 @@ from typing import Any, Dict, List from agenta.sdk.agents import AgentRun, AgentResult, Message -from agenta.sdk.agents.ui_messages import ( - from_ui_messages, - to_ui_message, - ui_message_stream, +from agenta.sdk.agents.adapters.vercel import ( + agent_run_to_vercel_parts, + message_to_vercel_ui_message, + vercel_ui_messages_to_messages, ) @@ -37,17 +38,17 @@ def _run(events: List[Dict[str, Any]], result: Dict[str, Any]) -> AgentRun: async def _collect(run: AgentRun, **kwargs) -> List[Dict[str, Any]]: - return [part async for part in ui_message_stream(run, **kwargs)] + return [part async for part in agent_run_to_vercel_parts(run, **kwargs)] # --------------------------------------------------------------------------- -# from_ui_messages +# vercel_ui_messages_to_messages # --------------------------------------------------------------------------- class TestFromUIMessages: def test_all_text_message_collapses_to_string(self): - msgs = from_ui_messages( + msgs = vercel_ui_messages_to_messages( [{"id": "m1", "role": "user", "parts": [{"type": "text", "text": "hi"}]}] ) assert len(msgs) == 1 @@ -55,7 +56,7 @@ def test_all_text_message_collapses_to_string(self): assert msgs[0].content == "hi" def test_file_part_becomes_image_or_resource_block(self): - msgs = from_ui_messages( + msgs = vercel_ui_messages_to_messages( [ { "id": "m1", @@ -75,7 +76,7 @@ def test_file_part_becomes_image_or_resource_block(self): def test_tool_part_is_preserved_as_structured_blocks(self): # A resolved tool part -> a tool_call block plus a tool_result block, keyed by # toolCallId, with the field names the runner transcript renders. - msgs = from_ui_messages( + msgs = vercel_ui_messages_to_messages( [ { "id": "m2", @@ -110,7 +111,7 @@ def test_tool_part_is_preserved_as_structured_blocks(self): ] def test_tool_error_part_sets_is_error(self): - msgs = from_ui_messages( + msgs = vercel_ui_messages_to_messages( [ { "id": "m2", @@ -134,7 +135,7 @@ def test_tool_error_part_sets_is_error(self): def test_approval_response_becomes_tool_result_keyed_by_call_id(self): # The cross-turn HITL reply: a tool_result keyed by toolCallId so the runtime resumes. - msgs = from_ui_messages( + msgs = vercel_ui_messages_to_messages( [ { "id": "m3", @@ -156,7 +157,7 @@ def test_approval_response_becomes_tool_result_keyed_by_call_id(self): def test_approval_request_part_is_dropped_on_replay(self): # The server's own request, echoed back; regenerated on replay, not model input. - msgs = from_ui_messages( + msgs = vercel_ui_messages_to_messages( [ { "id": "m4", @@ -172,18 +173,18 @@ def test_approval_request_part_is_dropped_on_replay(self): def test_plain_role_content_message_still_parses(self): # A non-parts {role, content} message in a mixed history falls back cleanly. - msgs = from_ui_messages([{"role": "user", "content": "hello"}]) + msgs = vercel_ui_messages_to_messages([{"role": "user", "content": "hello"}]) assert msgs[0].content == "hello" # --------------------------------------------------------------------------- -# to_ui_message +# message_to_vercel_ui_message # --------------------------------------------------------------------------- class TestToUIMessage: def test_agent_result_becomes_assistant_text_message(self): - ui = to_ui_message(AgentResult(output="Paris."), message_id="m9") + ui = message_to_vercel_ui_message(AgentResult(output="Paris."), message_id="m9") assert ui == { "id": "m9", "role": "assistant", @@ -204,14 +205,14 @@ def test_message_with_tool_blocks_round_trips_to_parts(self): ), ], ) - ui = to_ui_message(msg) + ui = message_to_vercel_ui_message(msg) assert ui["role"] == "assistant" assert ui["parts"][0]["type"] == "tool-getWeather" assert ui["parts"][0]["toolCallId"] == "c1" # --------------------------------------------------------------------------- -# ui_message_stream +# agent_run_to_vercel_parts # --------------------------------------------------------------------------- @@ -421,7 +422,7 @@ async def test_terminal_failure_emits_error_part_and_no_finish(self): {"kind": "result", "result": {"ok": False, "error": "kaboom"}}, ] run = AgentRun(_from_list(records)) - parts = [part async for part in ui_message_stream(run, session_id="s1")] + parts = [part async for part in agent_run_to_vercel_parts(run, session_id="s1")] types = [p["type"] for p in parts] assert types[0] == "start" assert "finish" not in types diff --git a/sdks/python/oss/tests/pytest/utils/test_messages_endpoint.py b/sdks/python/oss/tests/pytest/utils/test_messages_endpoint.py new file mode 100644 index 0000000000..e2c6cdfe71 --- /dev/null +++ b/sdks/python/oss/tests/pytest/utils/test_messages_endpoint.py @@ -0,0 +1,226 @@ +"""Tests for the agent ``/messages`` + ``/load-session`` endpoints. + +Two layers: + +- Direct unit tests of the two pure Vercel routing helpers (``resolve_session_id``, + ``inject_stream_session_id``). +- HTTP tests over a Starlette ``TestClient`` driving the real ``route(flags={"is_agent": + True})`` wiring with a fake agent handler (no harness/runner). Registering on a bare + ``FastAPI`` app keeps the auth middleware out; a stand-in sets ``request.state.auth``. The + offline tracing mock (mirroring ``test_negotiation_integration``) lets ``wf.invoke`` run + without ``ag.init()``. +""" + +from unittest.mock import MagicMock, patch + +import pytest +from fastapi import FastAPI +from fastapi.testclient import TestClient + +from agenta.sdk.agents.adapters.vercel.routing import ( + inject_stream_session_id, + resolve_session_id, +) +from agenta.sdk.decorators.routing import route +from agenta.sdk.models.workflows import ( + WorkflowBatchResponse, + WorkflowServiceStatus, + WorkflowStreamingResponse, +) + + +# --------------------------------------------------------------------------- +# Pure helpers +# --------------------------------------------------------------------------- + + +def test_resolve_session_id_mints_echoes_and_validates(): + assert resolve_session_id("sess_ok") == "sess_ok" + assert resolve_session_id(None).startswith("sess_") + assert resolve_session_id("bad id!") is None # space + '!' are out of charset + assert resolve_session_id("x" * 200) is None # over the length bound + + +@pytest.mark.asyncio +async def test_inject_stream_session_id_stamps_first_start_part(): + async def base(): + yield {"type": "start", "messageId": "m1"} + yield {"type": "text-delta", "id": "t1", "delta": "x"} + + resp = WorkflowStreamingResponse(generator=base) + inject_stream_session_id(resp, "sess_z") + + parts = [p async for p in resp.iterator()] + assert parts[0]["messageMetadata"]["sessionId"] == "sess_z" + assert parts[1] == {"type": "text-delta", "id": "t1", "delta": "x"} + + +# --------------------------------------------------------------------------- +# HTTP wiring +# --------------------------------------------------------------------------- + + +_UI_MESSAGE = {"role": "user", "parts": [{"type": "text", "text": "hello"}]} + + +def _build_client() -> TestClient: + app = FastAPI() + + # Stand in for AuthMiddleware (omitted by using a bare app): the endpoints read + # ``request.state.auth``. No credentials needed — the fake handler runs locally. + @app.middleware("http") + async def _fake_auth(request, call_next): + request.state.auth = {} + return await call_next(request) + + @route("/", app=app, flags={"is_agent": True}) + async def agent(messages=None, inputs=None, parameters=None, stream=None): + if stream: + + async def gen(): + yield {"type": "start", "messageId": "m1"} + yield {"type": "text-start", "id": "t1"} + yield {"type": "text-delta", "id": "t1", "delta": "hi"} + yield {"type": "text-end", "id": "t1"} + yield {"type": "finish"} + + return gen() + return {"role": "assistant", "content": "hi", "echoed": messages} + + return TestClient(app) + + +def _build_failing_client() -> TestClient: + app = FastAPI() + + @app.middleware("http") + async def _fake_auth(request, call_next): + request.state.auth = {} + return await call_next(request) + + @route("/", app=app, flags={"is_agent": True}) + async def failing_agent(messages=None, inputs=None, parameters=None, stream=None): + return WorkflowBatchResponse( + status=WorkflowServiceStatus( + code=500, + message="tool resolution failed before stream", + type="https://agenta.ai/docs/errors#v1:sdk:tool-resolution-error", + ) + ) + + return TestClient(app) + + +@pytest.fixture() +def client(): + """A TestClient with the offline tracing mock active so ``wf.invoke`` runs without + ``ag.init()`` (same approach as ``test_negotiation_integration``).""" + with ( + patch("agenta.sdk.decorators.tracing.ag") as mock_ag, + patch("agenta.sdk.decorators.running.ag") as mock_run_ag, + ): + mock_span = MagicMock() + mock_span.is_recording.return_value = False + mock_span.get_span_context.return_value = MagicMock(trace_id=0, span_id=0) + mock_ag.tracing = MagicMock() + mock_ag.tracing.get_current_span.return_value = mock_span + mock_ag.tracing.redact = None + mock_tracer = MagicMock() + mock_tracer.start_as_current_span.return_value.__enter__ = MagicMock( + return_value=mock_span + ) + mock_tracer.start_as_current_span.return_value.__exit__ = MagicMock( + return_value=None + ) + mock_ag.tracer = mock_tracer + mock_run_ag.DEFAULT_AGENTA_SINGLETON_INSTANCE = MagicMock() + mock_run_ag.DEFAULT_AGENTA_SINGLETON_INSTANCE.api_key = None + yield _build_client() + + +def test_messages_json_mints_session_and_folds_conversation(client): + res = client.post("/messages", json={"data": {"messages": [_UI_MESSAGE]}}) + assert res.status_code == 200 + body = res.json() + assert body["session_id"].startswith("sess_") + assert body["data"]["outputs"]["content"] == "hi" + # The Vercel UIMessage was folded to a neutral {role, content} message for the handler. + assert body["data"]["outputs"]["echoed"] == [{"role": "user", "content": "hello"}] + + +def test_messages_echoes_supplied_session_id(client): + res = client.post( + "/messages", + json={"session_id": "sess_keep", "data": {"messages": [_UI_MESSAGE]}}, + ) + assert res.status_code == 200 + assert res.json()["session_id"] == "sess_keep" + + +def test_messages_sse_streams_with_done_and_session_in_start(client): + res = client.post( + "/messages", + headers={"accept": "text/event-stream"}, + json={"session_id": "sess_abc", "data": {"messages": [_UI_MESSAGE]}}, + ) + assert res.status_code == 200 + assert res.headers["x-vercel-ai-ui-message-stream"] == "v1" + text = res.text + assert '"sessionId": "sess_abc"' in text # stamped onto the start part + assert '"type": "text-delta"' in text + assert "data: [DONE]" in text + + +def test_messages_sse_preserves_json_error_before_stream(): + with ( + patch("agenta.sdk.decorators.tracing.ag") as mock_ag, + patch("agenta.sdk.decorators.running.ag") as mock_run_ag, + ): + mock_span = MagicMock() + mock_span.is_recording.return_value = False + mock_span.get_span_context.return_value = MagicMock(trace_id=0, span_id=0) + mock_ag.tracing = MagicMock() + mock_ag.tracing.get_current_span.return_value = mock_span + mock_ag.tracing.redact = None + mock_tracer = MagicMock() + mock_tracer.start_as_current_span.return_value.__enter__ = MagicMock( + return_value=mock_span + ) + mock_tracer.start_as_current_span.return_value.__exit__ = MagicMock( + return_value=None + ) + mock_ag.tracer = mock_tracer + mock_run_ag.DEFAULT_AGENTA_SINGLETON_INSTANCE = MagicMock() + mock_run_ag.DEFAULT_AGENTA_SINGLETON_INSTANCE.api_key = None + client = _build_failing_client() + + response = client.post( + "/messages", + headers={"accept": "text/event-stream"}, + json={ + "session_id": "sess_error", + "data": {"messages": [_UI_MESSAGE]}, + }, + ) + + assert response.status_code == 500 + assert response.headers["content-type"].startswith("application/json") + assert "x-vercel-ai-ui-message-stream" not in response.headers + body = response.json() + assert body["status"]["code"] == 500 + assert "tool resolution failed before stream" in body["status"]["message"] + assert body["session_id"] == "sess_error" + assert "[DONE]" not in response.text + + +def test_messages_rejects_invalid_session_id(client): + res = client.post( + "/messages", json={"session_id": "bad id!", "data": {"messages": []}} + ) + assert res.status_code == 400 + + +def test_load_session_returns_stub_history(client): + res = client.post("/load-session", json={"session_id": "sess_abc"}) + assert res.status_code == 200 + assert res.json() == {"session_id": "sess_abc", "messages": []} diff --git a/sdks/python/oss/tests/pytest/utils/test_routing.py b/sdks/python/oss/tests/pytest/utils/test_routing.py index 0bed6e7922..a1851e5907 100644 --- a/sdks/python/oss/tests/pytest/utils/test_routing.py +++ b/sdks/python/oss/tests/pytest/utils/test_routing.py @@ -7,6 +7,8 @@ 3. router= param — issues DeprecationWarning, falls back to prefixed registration """ +import asyncio +import json import warnings import pytest @@ -15,10 +17,13 @@ from agenta.sdk.decorators.routing import ( _RESERVED_PATHS, + _make_stream_response, _validate_path, create_app, route, ) +from agenta.sdk.agents.adapters.vercel.sse import vercel_sse_stream +from agenta.sdk.models.workflows import WorkflowStreamingResponse # --------------------------------------------------------------------------- @@ -127,6 +132,49 @@ async def foo(): assert "/foo/invoke" not in parent_schema.get("paths", {}) +# --------------------------------------------------------------------------- +# 2b. Agent-only endpoints (/messages + /load-session), gated on is_agent +# --------------------------------------------------------------------------- + + +class TestAgentEndpoints: + def test_is_agent_sub_app_has_messages_and_load_session(self): + app = create_app() + + @route("/chat", app=app, flags={"is_agent": True}) + async def chat(): + return {"role": "assistant", "content": "hi"} + + schema = _mounts(app)["/chat"].app.openapi() + assert "/messages" in schema["paths"] + assert "/load-session" in schema["paths"] + assert "/invoke" in schema["paths"] # the base routes are still present + + def test_non_agent_route_has_no_agent_endpoints(self): + app = create_app() + + @route("/qa", app=app) + async def qa(): + return "answer" + + schema = _mounts(app)["/qa"].app.openapi() + assert "/messages" not in schema["paths"] + assert "/load-session" not in schema["paths"] + + def test_root_agent_route_registers_on_mount_root(self): + # The agent app uses route("/", app=app, flags={"is_agent": True}); the endpoints + # land on the app itself, not a mounted sub-app. + app = create_app() + + @route("/", app=app, flags={"is_agent": True}) + async def agent(): + return {"role": "assistant", "content": "hi"} + + schema = app.openapi() + assert "/messages" in schema["paths"] + assert "/load-session" in schema["paths"] + + # --------------------------------------------------------------------------- # 3. router= deprecation warning # --------------------------------------------------------------------------- @@ -179,3 +227,76 @@ async def noisy(): mounts_after = set(_mount_paths(default_app)) # No new mounts should have appeared on default_app assert mounts_after == mounts_before + + +# --------------------------------------------------------------------------- +# 4. Reserved agent paths (/messages, /load-session) +# --------------------------------------------------------------------------- + + +class TestReservedAgentPaths: + def test_agent_endpoint_names_are_reserved(self): + assert {"messages", "load-session"} <= _RESERVED_PATHS + + @pytest.mark.parametrize("reserved", ["messages", "load-session"]) + def test_route_rejects_reserved_agent_path(self, reserved): + with pytest.raises(ValueError, match=reserved): + route(f"/{reserved}") + + +# --------------------------------------------------------------------------- +# 5. Vercel UI Message Stream framing +# --------------------------------------------------------------------------- + + +async def _collect(aiter): + return [chunk async for chunk in aiter] + + +def _sse_payload(chunk: str) -> str: + """The JSON body of one `data: \\n\\n` SSE event.""" + assert chunk.startswith("data: ") and chunk.endswith("\n\n") + return chunk[len("data: ") : -2] + + +class TestVercelUIMessageStream: + def test_framing_wraps_each_part_and_appends_done(self): + async def parts(): + yield {"type": "start", "messageMetadata": {"sessionId": "sess_1"}} + yield {"type": "text-delta", "id": "t1", "delta": "hi"} + yield {"type": "finish"} + + chunks = asyncio.run(_collect(vercel_sse_stream(parts()))) + + # one SSE event per part, plus the terminal [DONE] + assert len(chunks) == 4 + assert json.loads(_sse_payload(chunks[0])) == { + "type": "start", + "messageMetadata": {"sessionId": "sess_1"}, + } + assert json.loads(_sse_payload(chunks[1])) == { + "type": "text-delta", + "id": "t1", + "delta": "hi", + } + assert chunks[-1] == "data: [DONE]\n\n" + + def test_done_is_emitted_for_an_empty_stream(self): + async def parts(): + return + yield # pragma: no cover — makes this an async generator + + chunks = asyncio.run(_collect(vercel_sse_stream(parts()))) + assert chunks == ["data: [DONE]\n\n"] + + def test_make_stream_response_vercel_sets_headers_and_media_type(self): + async def parts(): + yield {"type": "start"} + + response = WorkflowStreamingResponse(generator=lambda: parts()) + res = _make_stream_response(response, "vercel") + + assert res.media_type == "text/event-stream" + assert res.headers["x-vercel-ai-ui-message-stream"] == "v1" + assert res.headers["cache-control"] == "no-cache" + assert res.headers["x-accel-buffering"] == "no" diff --git a/services/agent/README.md b/services/agent/README.md index 8c8e8f949d..4a96328479 100644 --- a/services/agent/README.md +++ b/services/agent/README.md @@ -84,7 +84,7 @@ live workflow span. Tools are resolved in the Python backend and arrive on the request as `customTools` plus a `toolCallback`. Delivery is capability-routed: the Pi extension registers them natively; other harnesses get them over MCP through `tools/mcp-bridge.ts` + `tools/mcp-server.ts`. -Either way each call POSTs back to Agenta's `/tools/call` (`tools/client.ts`), so the +Either way each call POSTs back to Agenta's `/tools/call` (`tools/callback.ts`), so the provider key and connection auth stay server-side. ## The extension bundle diff --git a/services/agent/src/engines/pi.ts b/services/agent/src/engines/pi.ts index 91b63e0a24..cd1ce93505 100644 --- a/services/agent/src/engines/pi.ts +++ b/services/agent/src/engines/pi.ts @@ -43,7 +43,8 @@ import { type ToolCallbackContext, resolvePromptText, } from "../protocol.ts"; -import { EMPTY_OBJECT_SCHEMA, callAgentaTool } from "../tools/client.ts"; +import { EMPTY_OBJECT_SCHEMA } from "../tools/callback.ts"; +import { runResolvedTool } from "../tools/dispatch.ts"; /** What the in-process Pi engine supports. Static (no daemon to probe, unlike rivet). */ const PI_CAPABILITIES: HarnessCapabilities = { @@ -140,43 +141,63 @@ function lastStopReason(messages: any[]): string | undefined { } /** - * Turn resolved tool specs into Pi customTools. Each tool's `execute` does one POST back - * through Agenta's /tools/call, so Pi runs the loop while the Composio key and connection - * auth stay server-side. A failed call throws, which Pi turns into a tool-error result - * (the loop continues) rather than a run failure. + * Turn resolved tool specs into Pi customTools, branching on the executor `kind`: + * - `callback` (default): `execute` POSTs back through Agenta's /tools/call, so the Composio + * key and connection auth stay server-side. + * - `code`: `execute` runs the snippet in a sandbox subprocess with its scoped secret env. + * - `client`: browser-fulfilled, so skipped on the in-process path (no browser to answer). + * + * A failed `execute` throws, which Pi turns into a tool-error result (the loop continues) + * rather than a run failure. Pi accepts a plain JSON Schema for `parameters` (non-TypeBox path). */ export function buildCustomTools( specs: ResolvedToolSpec[], callback: ToolCallbackContext | undefined, ): any[] { - if (specs.length === 0) return []; - if (!callback?.endpoint) { - log(`skipping ${specs.length} custom tool(s): missing toolCallback endpoint`); - return []; + const tools: any[] = []; + for (const spec of specs) { + const base = { + name: spec.name, + label: spec.name, + description: spec.description ?? spec.name, + parameters: (spec.inputSchema as any) ?? EMPTY_OBJECT_SCHEMA, + }; + if (spec.kind === "client") { + log(`skipping client tool '${spec.name}' (browser-fulfilled; not available in-process)`); + continue; + } + if (spec.kind === "code") { + tools.push({ + ...base, + async execute(toolCallId: string, params: unknown, signal?: AbortSignal) { + const text = await runResolvedTool(spec, params, { toolCallId, signal }); + return { content: [{ type: "text", text }], details: { kind: "code" } }; + }, + }); + continue; + } + // callback (default): route back to Agenta's /tools/call. + if (!callback?.endpoint) { + log(`skipping callback tool '${spec.name}': missing toolCallback endpoint`); + continue; + } + tools.push({ + ...base, + async execute(toolCallId: string, params: unknown, signal?: AbortSignal) { + const text = await runResolvedTool(spec, params, { + toolCallId, + endpoint: callback.endpoint, + authorization: callback.authorization, + signal, + }); + return { + content: [{ type: "text", text }], + details: { callRef: spec.callRef }, + }; + }, + }); } - - return specs.map((spec) => ({ - name: spec.name, - label: spec.name, - description: spec.description ?? spec.name, - // Pi accepts a plain JSON Schema for `parameters` (its validator has a non-TypeBox - // path); the schema is resolved live from the provider catalog. - parameters: (spec.inputSchema as any) ?? EMPTY_OBJECT_SCHEMA, - async execute(toolCallId: string, params: unknown, signal?: AbortSignal) { - const text = await callAgentaTool( - callback.endpoint, - callback.authorization, - spec.callRef, - toolCallId, - params, - signal, - ); - return { - content: [{ type: "text", text }], - details: { callRef: spec.callRef }, - }; - }, - })); + return tools; } export async function runPi( @@ -261,101 +282,114 @@ export async function runPi( log(`custom tools: ${customTools.map((t) => t.name).join(", ")}`); } - const { session } = await createAgentSession({ - cwd, - model, - authStorage, - modelRegistry, - tools: toolAllowlist, - customTools, - sessionManager: SessionManager.inMemory(cwd), - settingsManager: SettingsManager.inMemory(), - resourceLoader: loader, - }); + // Created before the prompt so a throw mid-run still flushes the partial trace and + // disposes the session (the inner finally below). Mirrors the rivet engine's pattern. + let session: Awaited>["session"] | undefined; + try { + ({ session } = await createAgentSession({ + cwd, + model, + authStorage, + modelRegistry, + tools: toolAllowlist, + customTools, + sessionManager: SessionManager.inMemory(cwd), + settingsManager: SettingsManager.inMemory(), + resourceLoader: loader, + })); - // Hand the session id + model to the extension so spans carry them. - otel.config.sessionId = session.sessionId; - otel.config.provider = model.provider; - otel.config.requestModel = model.id; + // Hand the session id + model to the extension so spans carry them. + otel.config.sessionId = session.sessionId; + otel.config.provider = model.provider; + otel.config.requestModel = model.id; - // Accumulate streamed text as the primary output channel. On the streaming path, flush - // each Pi `text_delta` as a `message_delta` live (Pi deltas are already pure, so they - // emit verbatim); the block opens on the first delta and closes after the run. - let streamed = ""; - let piTextId: string | undefined; - session.subscribe((event: any) => { - if ( - event.type === "message_update" && - event.assistantMessageEvent?.type === "text_delta" - ) { - const delta = event.assistantMessageEvent.delta ?? ""; - if (!delta) return; - streamed += delta; - if (emit) { - if (piTextId === undefined) { - piTextId = "msg-0"; - emit({ type: "message_start", id: piTextId }); + // Accumulate streamed text as the primary output channel. On the streaming path, flush + // each Pi `text_delta` as a `message_delta` live (Pi deltas are already pure, so they + // emit verbatim); the block opens on the first delta and closes after the run. + let streamed = ""; + let piTextId: string | undefined; + session.subscribe((event: any) => { + if ( + event.type === "message_update" && + event.assistantMessageEvent?.type === "text_delta" + ) { + const delta = event.assistantMessageEvent.delta ?? ""; + if (!delta) return; + streamed += delta; + if (emit) { + if (piTextId === undefined) { + piTextId = "msg-0"; + emit({ type: "message_start", id: piTextId }); + } + emit({ type: "message_delta", id: piTextId, delta }); } - emit({ type: "message_delta", id: piTextId, delta }); } - } - }); + }); - await session.prompt(prompt); + await session.prompt(prompt); - const output = streamed.trim() || extractAssistantText(session.messages); - const sessionId = session.sessionId; - const stopReason = lastStopReason(session.messages); - const usage = otel.usage(); - session.dispose(); + const output = streamed.trim() || extractAssistantText(session.messages); + const sessionId = session.sessionId; + const stopReason = lastStopReason(session.messages); + const usage = otel.usage(); - // Ship this run's trace before the result is returned (and before the CLI process - // exits): invoke_agent has a remote parent, so the per-trace flush is what exports it. - await otel.flush(); + // Ship this run's trace before the result is returned (and before the CLI process + // exits): invoke_agent has a remote parent, so the per-trace flush is what exports it. + await otel.flush(); - // The structured stream is thinner here than on the rivet path: Pi's in-process tool - // events feed the trace spans, while the result-level event log carries the final - // message, usage, and stop reason (enough for the platform without double-plumbing). - // - // On the streaming path the events were flushed live via `emit`, so the result log stays - // empty; here we only close the open text block (or synthesize one when the text never - // streamed) and flush the tail usage/done events. - const events: AgentEvent[] = []; - const emitOrLog = (event: AgentEvent): void => { - if (emit) emit(event); - else events.push(event); - }; - if (emit) { - if (piTextId !== undefined) { - emit({ type: "message_end", id: piTextId }); + // The structured stream is thinner here than on the rivet path: Pi's in-process tool + // events feed the trace spans, while the result-level event log carries the final + // message, usage, and stop reason (enough for the platform without double-plumbing). + // + // On the streaming path the events were flushed live via `emit`, so the result log stays + // empty; here we only close the open text block (or synthesize one when the text never + // streamed) and flush the tail usage/done events. + const events: AgentEvent[] = []; + const emitOrLog = (event: AgentEvent): void => { + if (emit) emit(event); + else events.push(event); + }; + if (emit) { + if (piTextId !== undefined) { + emit({ type: "message_end", id: piTextId }); + } else if (output) { + emit({ type: "message_start", id: "msg-0" }); + emit({ type: "message_delta", id: "msg-0", delta: output }); + emit({ type: "message_end", id: "msg-0" }); + } } else if (output) { - emit({ type: "message_start", id: "msg-0" }); - emit({ type: "message_delta", id: "msg-0", delta: output }); - emit({ type: "message_end", id: "msg-0" }); + events.push({ type: "message", text: output }); } - } else if (output) { - events.push({ type: "message", text: output }); - } - if (usage.total > 0) emitOrLog({ type: "usage", ...usage }); - emitOrLog({ type: "done", stopReason }); + if (usage.total > 0) emitOrLog({ type: "usage", ...usage }); + emitOrLog({ type: "done", stopReason }); - const messages: ChatMessage[] = output - ? [{ role: "assistant", content: output }] - : []; + const messages: ChatMessage[] = output + ? [{ role: "assistant", content: output }] + : []; - return { - ok: true, - output, - messages, - events, - usage, - stopReason, - // `streamingDeltas` is only honest when a live sink carried the deltas end-to-end. - capabilities: { ...PI_CAPABILITIES, streamingDeltas: !!emit }, - sessionId, - model: `${model.provider}/${model.id}`, - traceId: otel.config.traceId, - }; + return { + ok: true, + output, + messages, + events, + usage, + stopReason, + // `streamingDeltas` is only honest when a live sink carried the deltas end-to-end. + capabilities: { ...PI_CAPABILITIES, streamingDeltas: !!emit }, + sessionId, + model: `${model.provider}/${model.id}`, + traceId: otel.config.traceId, + }; + } catch (err) { + // Flush the partial trace before the error propagates so a failed run is still + // observable (the happy-path flush above never ran). Best-effort: never mask `err`. + await otel.flush().catch(() => {}); + throw err; + } finally { + // Pi keeps the in-memory session alive until disposed; release it on every exit + // (success or throw). Guarded for the case where createAgentSession itself threw. + session?.dispose(); + } } finally { try { rmSync(cwd, { recursive: true, force: true }); diff --git a/services/agent/src/engines/rivet.ts b/services/agent/src/engines/rivet.ts index 62f978a5c8..6437d7502c 100644 --- a/services/agent/src/engines/rivet.ts +++ b/services/agent/src/engines/rivet.ts @@ -45,14 +45,22 @@ import { local } from "sandbox-agent/local"; import { daytona } from "sandbox-agent/daytona"; import { createRivetOtel } from "../tracing/otel.ts"; -import { buildToolMcpServers } from "../tools/mcp-bridge.ts"; +import { buildToolMcpServers, type McpServerStdio } from "../tools/mcp-bridge.ts"; import { startToolRelay } from "../tools/relay.ts"; +import { + PolicyResponder, + decisionToReply, + policyFromRequest, + type Responder, +} from "../responder.ts"; import { type AgentRunRequest, type AgentRunResult, type ChatMessage, + type ContentBlock, type EmitEvent, type HarnessCapabilities, + type McpServerConfig, type ResolvedToolSpec, type ToolCallbackContext, messageText, @@ -252,19 +260,60 @@ function priorMessages(request: AgentRunRequest): ChatMessage[] { return lastMatch === -1 ? messages : messages.filter((_, i) => i !== lastMatch); } +function safeJson(value: unknown): string { + if (value === undefined || value === null) return ""; + try { + return typeof value === "string" ? value : JSON.stringify(value); + } catch { + return String(value); + } +} + +/** + * Render one message for the replayed transcript, INCLUDING resolved tool turns. Under the + * cold model the harness rebuilds context from this text, and ACP prompt content blocks + * cannot carry tool calls/results — so a resolved interaction (an approved tool that ran, a + * client-fulfilled tool) is encoded here as text, letting the model resume from the result + * instead of re-asking. This is the cross-turn HITL continuation substrate: the `/messages` + * egress folds inbound UIMessage tool/approval parts into `tool_call` / `tool_result` content + * blocks, and they survive into the replay here. Plain string / text blocks pass through; + * image/resource blocks are summarized. + */ +export function messageTranscript(content: string | ContentBlock[] | undefined): string { + if (!content) return ""; + if (typeof content === "string") return content; + const parts: string[] = []; + for (const block of content) { + if (!block) continue; + if (block.type === "text" && typeof block.text === "string") { + parts.push(block.text); + } else if (block.type === "tool_call") { + parts.push(`[called ${block.toolName ?? "tool"}(${safeJson(block.input)})]`); + } else if (block.type === "tool_result") { + const body = safeJson(block.output); + parts.push(`[${block.toolName ?? "tool"} ${block.isError ? "error" : "returned"}: ${body}]`); + } else if (block.type === "image") { + parts.push("[image]"); + } else if (block.type === "resource") { + parts.push(block.uri ? `[resource: ${block.uri}]` : "[resource]"); + } + } + return parts.filter(Boolean).join("\n"); +} + /** * The text sent over ACP for this turn. Each invoke is a cold sandbox, so prior turns * are replayed as transcript context ahead of the latest user message — this is the * "persisted message history replayed" model, with the client/playground holding the * history. Capped by AGENTA_AGENT_HISTORY_MAX_CHARS so replay tokens stay bounded. */ -function buildTurnText(request: AgentRunRequest): string { +export function buildTurnText(request: AgentRunRequest): string { const latest = resolvePrompt(request); - const history = priorMessages(request).filter((m) => messageText(m.content)); + const history = priorMessages(request).filter((m) => messageTranscript(m.content)); if (history.length === 0) return latest; const maxChars = Number(process.env.AGENTA_AGENT_HISTORY_MAX_CHARS ?? 24000); - let transcript = history.map((m) => `${m.role}: ${messageText(m.content)}`).join("\n"); + let transcript = history.map((m) => `${m.role}: ${messageTranscript(m.content)}`).join("\n"); if (transcript.length > maxChars) transcript = transcript.slice(-maxChars); return ( `Conversation so far:\n${transcript}\n\n` + @@ -272,6 +321,33 @@ function buildTurnText(request: AgentRunRequest): string { ); } +/** + * Convert user-declared MCP servers (already resolved server-side, secrets injected into + * `env`) into ACP stdio entries. Only `stdio` is delivered over ACP today; `http`/remote + * carries no auth on the wire by design and is skipped. The per-server `tools` allowlist is + * NOT enforced over ACP in v1 — the harness lists all of a server's tools — so it is dropped + * with a log rather than silently implying a filter that does not happen. + */ +export function toAcpMcpServers(servers: McpServerConfig[] | undefined): McpServerStdio[] { + const out: McpServerStdio[] = []; + for (const s of servers ?? []) { + if ((s.transport ?? "stdio") !== "stdio" || !s.command) { + log(`skipping non-stdio MCP server '${s?.name ?? "?"}' (remote transport deferred)`); + continue; + } + if (s.tools && s.tools.length > 0) { + log(`MCP server '${s.name}': per-server tool allowlist not enforced over ACP (v1)`); + } + out.push({ + name: s.name, + command: s.command, + args: s.args ?? [], + env: Object.entries(s.env ?? {}).map(([name, value]) => ({ name, value: String(value) })), + }); + } + return out; +} + /** * Pick the harness-specific model id for a requested name. Harnesses expose their own * ids (Pi: "openai-codex/gpt-5.5"; Claude: its own). Match exact, then by the id after @@ -715,12 +791,25 @@ export async function runRivet( // forward MCP, Claude/Codex do). const capabilities = await probeCapabilities(sandbox, harness); const toolSpecs = (request.customTools as ResolvedToolSpec[]) ?? []; + const userMcpCount = request.mcpServers?.length ?? 0; + // MCP delivery is gated on `mcpTools`: pi-acp does not forward MCP, Claude/Codex do. The + // synthesized `agenta-tools` server (gateway/code tools) and the user-declared servers + // ride the same gate. const mcpServers = !isPi && capabilities.mcpTools - ? buildToolMcpServers(toolSpecs, request.toolCallback as ToolCallbackContext | undefined) + ? [ + ...buildToolMcpServers( + toolSpecs, + request.toolCallback as ToolCallbackContext | undefined, + ), + ...toAcpMcpServers(request.mcpServers), + ] : []; - if (!isPi && toolSpecs.length > 0 && !capabilities.mcpTools) { - log(`harness '${harness}' lacks MCP tool support; ${toolSpecs.length} tool(s) not delivered`); + if (!isPi && (toolSpecs.length > 0 || userMcpCount > 0) && !capabilities.mcpTools) { + log( + `harness '${harness}' lacks MCP support; ${toolSpecs.length} tool(s) and ` + + `${userMcpCount} user MCP server(s) not delivered`, + ); } const session = await sandbox.createSession({ @@ -759,19 +848,37 @@ export async function runRivet( if (update) run.handleUpdate(update); }); - // Auto-approve permission requests so a permission-gating harness (e.g. Claude - // Code) does not block on tool use. Tools are backend-resolved and trusted; the run - // is headless so there is no human to prompt. The per-run `permissionPolicy` (or the - // AGENTA_RIVET_DENY_PERMISSIONS env) flips this to reject. - const denyPermissions = - request.permissionPolicy === "deny" || - process.env.AGENTA_RIVET_DENY_PERMISSIONS === "true"; + // Permission gating, behind the Responder seam. Pi never gates; a permission-gating + // harness (e.g. Claude) raises a request, which we (a) surface as an `interaction_request` + // event so the egress can project it (Vercel `tool-approval-request`) and the trace can + // record it, and (b) resolve via the responder. The headless `PolicyResponder` keeps the + // prior behavior: auto-allow trusted backend tools, or deny per `permissionPolicy` / + // AGENTA_RIVET_DENY_PERMISSIONS. A cross-turn responder (true HITL) slots in here later + // without touching the harness. Tools are backend-resolved and trusted; the run is headless. + const responder: Responder = new PolicyResponder(policyFromRequest(request.permissionPolicy)); session.onPermissionRequest((req: any) => { - const replies: string[] = req?.availableReplies ?? []; - const reply = denyPermissions - ? "reject" - : replies.find((r) => r === "always") ?? replies.find((r) => r === "once") ?? "once"; - if (req?.id) session.respondPermission(req.id, reply as any).catch(() => {}); + const id = String(req?.id ?? ""); + const availableReplies: string[] = req?.availableReplies ?? []; + run.emitEvent({ + type: "interaction_request", + id, // ACP permission id -> Vercel approvalId + kind: "permission", + payload: { + // toolCallId of the gated tool, so the cross-turn approval reply correlates back to + // its tool call (and the #6 resume finds it). `toolCall` is the ACP ToolCallUpdate. + toolCallId: req?.toolCall?.toolCallId, + toolCall: req?.toolCall, + availableReplies, + options: req?.options, + }, + }); + void responder + .onPermission({ id, availableReplies, raw: req }) + .then((decision) => { + if (!req?.id) return; + return session.respondPermission(req.id, decisionToReply(decision, availableReplies) as any); + }) + .catch(() => {}); }); if (useToolRelay) { diff --git a/services/agent/src/extensions/agenta.ts b/services/agent/src/extensions/agenta.ts index e410b9d46e..5e50b47ceb 100644 --- a/services/agent/src/extensions/agenta.ts +++ b/services/agent/src/extensions/agenta.ts @@ -24,71 +24,19 @@ * it (local, the docker sidecar, a Daytona snapshot). Default export is the Pi * ExtensionFactory. */ -import { existsSync, mkdirSync, readFileSync, unlinkSync, writeFileSync } from "node:fs"; +import { writeFileSync } from "node:fs"; import type { ExtensionAPI } from "@earendil-works/pi-coding-agent"; import { createAgentaOtel } from "../tracing/otel.ts"; import type { ResolvedToolSpec } from "../protocol.ts"; -import { EMPTY_OBJECT_SCHEMA, callAgentaTool } from "../tools/client.ts"; -import { - RELAY_POLL_MS, - RELAY_REQ_SUFFIX, - RELAY_RES_SUFFIX, - RELAY_TIMEOUT_MS, - sanitizeRelayId, - sleep, - type RelayResponse, -} from "../tools/relay.ts"; +import { EMPTY_OBJECT_SCHEMA } from "../tools/callback.ts"; +import { runResolvedTool } from "../tools/dispatch.ts"; function log(message: string): void { process.stderr.write(`[agenta-pi-ext] ${message}\n`); } -/** - * Daytona tool call: the in-sandbox process can't reach Agenta, so write the request to a - * file the runner watches and poll for the response it writes back (see tools/relay.ts). - */ -async function relayToolCall( - dir: string, - callRef: string, - toolCallId: string, - params: unknown, - signal?: AbortSignal, -): Promise { - const id = sanitizeRelayId(toolCallId); - const reqPath = `${dir}/${id}${RELAY_REQ_SUFFIX}`; - const resPath = `${dir}/${id}${RELAY_RES_SUFFIX}`; - try { - mkdirSync(dir, { recursive: true }); - } catch { - // The runner also creates it; a race here is harmless. - } - writeFileSync(reqPath, JSON.stringify({ callRef, toolCallId, args: params ?? {} }), "utf-8"); - - const deadline = Date.now() + RELAY_TIMEOUT_MS; - while (Date.now() < deadline) { - if (signal?.aborted) throw new Error("aborted"); - if (existsSync(resPath)) { - const res = JSON.parse(readFileSync(resPath, "utf-8")) as RelayResponse; - try { - unlinkSync(reqPath); - } catch { - /* best-effort cleanup */ - } - try { - unlinkSync(resPath); - } catch { - /* best-effort cleanup */ - } - if (res.ok) return res.text ?? ""; - throw new Error(res.error || `tool relay failed for ${callRef}`); - } - await sleep(RELAY_POLL_MS); - } - throw new Error(`tool relay timed out for ${callRef}`); -} - /** Register the resolved tools (from env) as Pi tools that call back to Agenta. */ function registerTools(pi: ExtensionAPI): void { const raw = process.env.AGENTA_TOOL_SPECS; @@ -107,7 +55,13 @@ function registerTools(pi: ExtensionAPI): void { // the runner via files in this dir. Unset for local runs (direct /tools/call). const relayDir = process.env.AGENTA_TOOL_RELAY_DIR; + let registered = 0; for (const spec of specs) { + // `client` tools are browser-fulfilled; there is no browser in the sandbox to answer. + if (spec.kind === "client") { + log(`skipping client tool '${spec.name}' (browser-fulfilled)`); + continue; + } pi.registerTool({ name: spec.name, label: spec.name, @@ -115,14 +69,24 @@ function registerTools(pi: ExtensionAPI): void { // Pi accepts plain JSON Schema here (non-TypeBox validation path). parameters: (spec.inputSchema as any) ?? EMPTY_OBJECT_SCHEMA, async execute(toolCallId: string, params: unknown, signal?: AbortSignal) { - const text = relayDir - ? await relayToolCall(relayDir, spec.callRef, toolCallId, params, signal) - : await callAgentaTool(endpoint, authorization, spec.callRef, toolCallId, params, signal); - return { content: [{ type: "text", text }], details: { callRef: spec.callRef } }; + // `code` runs the snippet locally in the sandbox (no relay, even on Daytona); the + // callback path relays through the runner on Daytona, else POSTs to /tools/call. + const text = await runResolvedTool(spec, params, { + toolCallId, + endpoint, + authorization, + relayDir, + signal, + }); + return { + content: [{ type: "text", text }], + details: spec.kind === "code" ? { kind: "code" } : { callRef: spec.callRef }, + }; }, } as any); + registered += 1; } - log(`registered ${specs.length} tool(s) -> ${relayDir ? `relay ${relayDir}` : endpoint}`); + log(`registered ${registered} tool(s) -> ${relayDir ? `relay ${relayDir}` : endpoint}`); } /** The Pi ExtensionFactory: tools + (env-driven) tracing + usage writeback. */ diff --git a/services/agent/src/protocol.ts b/services/agent/src/protocol.ts index 98a33ad876..086e0654b1 100644 --- a/services/agent/src/protocol.ts +++ b/services/agent/src/protocol.ts @@ -11,11 +11,20 @@ /** One piece of a message. `text` is all the playground sends today; the rest is plumbed. */ export interface ContentBlock { - type: "text" | "image" | "resource" | string; + type: "text" | "image" | "resource" | "tool_call" | "tool_result" | string; text?: string; data?: string; mimeType?: string; uri?: string; + // Tool-turn carriers, used for structured-message continuation (cross-turn HITL): a + // resolved tool call replays as a `tool_call` block plus a `tool_result` block so the + // model resumes from the result instead of re-asking. The `/messages` egress folds the + // inbound UIMessage tool/approval parts into these (it must not drop them). + toolCallId?: string; + toolName?: string; + input?: unknown; + output?: unknown; + isError?: boolean; } export interface ChatMessage { @@ -38,15 +47,32 @@ export interface TraceContext { } /** - * A runnable tool the backend already resolved from the agent config: name + description + - * JSON-Schema params for the model, plus the `callRef` slug the execution bridge sends back - * to Agenta's /tools/call. The Composio key and connection auth stay server-side. + * A runnable tool the backend already resolved from the agent config. + * + * Three orthogonal axes: + * - `kind` (executor): how the runner fulfils a call. `callback` POSTs back through Agenta's + * /tools/call (gateway tools; the Composio key stays server-side); `code` runs `code` in a + * sandbox subprocess with `env` (resolved secrets, scoped to the subprocess); `client` is + * fulfilled by the browser across a turn boundary. Absent = `callback` (back-compat). + * - `needsApproval`: gate the call on a human yes/no (mechanics owned by the run-event layer). + * - `render`: a generative-UI hint (see `RenderHint`). + * + * `callRef` is set for `callback` tools (the slug the bridge sends back to /tools/call); + * `runtime`/`code`/`env` for `code` tools. The Composio key and connection auth stay + * server-side. */ export interface ResolvedToolSpec { name: string; description?: string; inputSchema?: Record | null; - callRef: string; + /** Set for `callback` (gateway) tools only; absent for `code` / `client`. */ + callRef?: string; + kind?: "callback" | "code" | "client"; + runtime?: "python" | "node"; + code?: string; + env?: Record; + needsApproval?: boolean; + render?: RenderHint; } /** Where and how to route a tool call back through Agenta. */ @@ -55,6 +81,21 @@ export interface ToolCallbackContext { authorization?: string; } +/** + * A user-declared MCP server attached to the run. `stdio` launches `command`/`args` with + * `env` (secret env already resolved server-side); `tools` is an optional allowlist (empty = + * all). Remote (`http`) carries no auth on the wire by design. + */ +export interface McpServerConfig { + name: string; + transport?: "stdio" | "http"; + command?: string; + args?: string[]; + env?: Record; + url?: string; + tools?: string[]; +} + /** * What a harness can do, probed from the runtime (rivet `AgentCapabilities`). The runner * branches on these flags instead of the harness name, and returns them in the result. @@ -82,6 +123,18 @@ export interface HarnessCapabilities { * are emitted live on the streaming path; a consumer that sees the delta family for a block * never also sees a coalesced `message` for it (see `createRivetOtel.finish`). */ +/** + * A generative-UI hint stamped onto a tool's events so the frontend can render it. The + * tool-definition plan adds the matching `render?` field to `ResolvedToolSpec`; the runner + * copies it onto `tool_call` / `tool_result` so the egress can project it without a spec + * lookup. `component` is a prebuilt client component (no code execution); `source` ships + * code rendered in a sandbox; `spec` is a declarative UI tree (data, not code). + */ +export type RenderHint = + | { kind: "component"; component: string } + | { kind: "source"; runtime: "react" | "html"; source: string | string[] } + | { kind: "spec"; schema: string }; + export type AgentEvent = | { type: "message"; text: string } | { type: "thought"; text: string } @@ -91,8 +144,29 @@ export type AgentEvent = | { type: "reasoning_start"; id: string } | { type: "reasoning_delta"; id: string; delta: string } | { type: "reasoning_end"; id: string } - | { type: "tool_call"; id?: string; name?: string; input?: unknown } - | { type: "tool_result"; id?: string; output?: string; isError?: boolean } + | { type: "tool_call"; id?: string; name?: string; input?: unknown; render?: RenderHint } + | { + type: "tool_result"; + id?: string; + output?: string; + /** Structured output (object), used for generative UI; `output` stays the text form. */ + data?: unknown; + isError?: boolean; + render?: RenderHint; + } + // A human-in-the-loop request the harness raised (ACP reverse-RPC). The egress projects + // it to a Vercel `tool-approval-request` (permission) or an input/data part (elicitation); + // the reply returns cross-turn in the next `/messages` message history, matched by `id`. + | { + type: "interaction_request"; + id: string; + kind: "permission" | "input" | "client_tool"; + payload?: unknown; + } + // One-way generative-UI payloads (not tied to a tool result). `data` -> Vercel `data-`, + // `file` -> Vercel `file`. + | { type: "data"; name: string; data: unknown; transient?: boolean } + | { type: "file"; url: string; mediaType: string } | { type: "usage"; input?: number; output?: number; total?: number; cost?: number } | { type: "error"; message: string } | { type: "done"; stopReason?: string }; @@ -149,6 +223,8 @@ export interface AgentRunRequest { skills?: string[]; /** Resolved runnable tools (WP-7). */ customTools?: ResolvedToolSpec[]; + /** User-declared MCP servers, resolved (secret env injected). Omitted when there are none. */ + mcpServers?: McpServerConfig[]; /** Where customTools route their calls back to. Required when customTools is set. */ toolCallback?: ToolCallbackContext; /** How a permission-gating harness handles tool-use prompts: "auto" (default) | "deny". */ diff --git a/services/agent/src/responder.ts b/services/agent/src/responder.ts new file mode 100644 index 0000000000..6af4132841 --- /dev/null +++ b/services/agent/src/responder.ts @@ -0,0 +1,77 @@ +/** + * The interaction responder seam. + * + * A harness (the ACP "Agent") does not only emit tool calls. It also raises typed + * reverse-RPC interaction requests that something must answer: permission gates today, + * elicitation (input) and client-side tools later. Today the rivet runner answered the + * permission gate inline with a hardcoded auto-approve. This module lifts that decision + * behind a `Responder` interface so it is pluggable: + * + * - `PolicyResponder` is the headless answer (a fixed `auto` / `deny` policy, no human). + * It reproduces the previous behavior exactly and is what `/invoke` uses. + * - A cross-turn responder (the `/messages` HITL path) slots in here later: it surfaces the + * request to the browser, ends the turn, and resolves on the next turn's reply. The + * harness adapter does not change when the responder does. + * + * Resolution is modeled as `allow` / `deny`; the adapter maps that onto the harness's + * available ACP replies via `decisionToReply`. + */ + +export type PermissionPolicy = "auto" | "deny"; + +export type PermissionDecision = "allow" | "deny"; + +/** A permission gate raised by the harness, normalized from the ACP request. */ +export interface PermissionRequest { + /** The ACP permission id; reused as the `interaction_request` event id for reply matching. */ + id: string; + /** Replies the harness offers (e.g. "always" | "once" | "reject"). */ + availableReplies: string[]; + /** The original ACP request, for responders that want the tool-call detail. */ + raw?: unknown; +} + +/** + * Answers interaction requests the harness raises. Permission is the only kind wired today; + * `input` (elicitation) and `client_tool` are forward-looking and will extend this interface + * alongside the cross-turn responder. + */ +export interface Responder { + onPermission(request: PermissionRequest): Promise; +} + +/** Headless responder: a fixed policy, no human in the loop. */ +export class PolicyResponder implements Responder { + constructor(private readonly policy: PermissionPolicy) {} + + async onPermission(_request: PermissionRequest): Promise { + return this.policy === "deny" ? "deny" : "allow"; + } +} + +/** + * Resolve the permission policy with the same precedence as before: an explicit per-run + * `permissionPolicy: "deny"` or the `AGENTA_RIVET_DENY_PERMISSIONS` env flips to deny; the + * default is auto-allow, because backend-resolved tools are trusted and the run is headless. + */ +export function policyFromRequest(permissionPolicy?: string): PermissionPolicy { + if (permissionPolicy === "deny" || process.env.AGENTA_RIVET_DENY_PERMISSIONS === "true") { + return "deny"; + } + return "auto"; +} + +/** Map an allow/deny decision onto the harness's available ACP replies. */ +export function decisionToReply( + decision: PermissionDecision, + availableReplies: string[], +): string { + if (decision === "deny") { + return availableReplies.find((r) => r === "reject") ?? "reject"; + } + return ( + availableReplies.find((r) => r === "always") ?? + availableReplies.find((r) => r === "once") ?? + "once" + ); +} diff --git a/services/agent/src/tools/client.ts b/services/agent/src/tools/callback.ts similarity index 98% rename from services/agent/src/tools/client.ts rename to services/agent/src/tools/callback.ts index db6a71538f..0f0bae533c 100644 --- a/services/agent/src/tools/client.ts +++ b/services/agent/src/tools/callback.ts @@ -1,5 +1,5 @@ /** - * Shared Agenta /tools/call client. + * Shared Agenta /tools/call callback transport. * * One implementation of the tool round-trip used by every delivery path: * - engines/pi.ts buildCustomTools (in-process Pi customTools) diff --git a/services/agent/src/tools/code.ts b/services/agent/src/tools/code.ts new file mode 100644 index 0000000000..da8115c94a --- /dev/null +++ b/services/agent/src/tools/code.ts @@ -0,0 +1,197 @@ +/** + * Code-tool executor: run a resolved `code` tool's snippet in the agent sandbox. + * + * A code tool ships a snippet (`code`) + a runtime (`python` | `node`) + a scoped `env` (the + * tool's declared vault secrets, resolved server-side). Unlike a `callback` tool, it never + * touches Agenta's /tools/call — it runs locally where the harness runs, which is exactly why + * its secrets are injected here as subprocess env (and nowhere else). + * + * Entry convention (same for both runtimes): the snippet defines a top-level `main`. A bare + * `def main(**inputs)` / `function main(inputs)` is found automatically; an explicit export + * (`module.exports.main` / `exports.main` / `module.exports = fn` in Node) is also accepted. + * Python calls `main(**inputs)` (keyword args from the tool input object); Node calls + * `main(inputs)` (the input object) and may return a promise. The return value is + * JSON-serialized and handed to the model as the tool result. + * + * Shared by every delivery path that runs code locally: engines/pi.ts (in-process Pi), + * extensions/agenta.ts (Pi under rivet), tools/mcp-server.ts (the MCP bridge for other + * harnesses). + */ +import { spawn } from "node:child_process"; +import { mkdtempSync, rmSync, writeFileSync } from "node:fs"; +import { tmpdir } from "node:os"; +import { join } from "node:path"; + +/** Per-call budget for a code tool. Surfaced as a tool error on timeout. */ +export const CODE_TOOL_TIMEOUT_MS = Number( + process.env.AGENTA_AGENT_CODE_TOOL_TIMEOUT_MS ?? 30000, +); + +// argv[1] is the snippet path (python `-c`/node `-e` put the first extra arg at argv[1]); +// the tool input arrives as JSON on stdin. Both bootstraps evaluate the snippet in a fresh +// scope and pick up a top-level `main` (a bare `def main`/`function main`), falling back to an +// explicit export. Either way the contract is: define a callable `main`. +const PY_BOOTSTRAP = `import sys, json +_path = sys.argv[1] +with open(_path) as _f: + _src = _f.read() +_ns = {} +exec(compile(_src, _path, "exec"), _ns) +if not callable(_ns.get("main")): + sys.stderr.write("code tool must define a callable main(**inputs)") + sys.exit(1) +_args = json.loads(sys.stdin.read() or "{}") +_out = _ns["main"](**_args) +sys.stdout.write(json.dumps(_out)) +`; + +// `require(path)` would only see CommonJS exports, so a bare top-level `function main` (which +// exports nothing under CommonJS) would be invisible. Instead read the source and evaluate it +// in a scope that captures a top-level `main`, while still honoring an explicit +// `module.exports.main` / `exports.main` / `module.exports = fn`. +const NODE_BOOTSTRAP = `const fs = require("fs"); +const path = process.argv[1]; +const src = fs.readFileSync(path, "utf8"); +const mod = { exports: {} }; +const factory = new Function( + "exports", + "require", + "module", + "__filename", + "__dirname", + src + + "\\n;return (typeof main !== 'undefined' ? main : (module.exports && (module.exports.main || module.exports.default)) || module.exports);", +); +const fn = factory(mod.exports, require, mod, path, require("path").dirname(path)); +if (typeof fn !== "function") { + process.stderr.write("code tool must define or export a callable main(inputs) function"); + process.exit(1); +} +const args = JSON.parse(fs.readFileSync(0, "utf8") || "{}"); +Promise.resolve(fn(args)) + .then((out) => process.stdout.write(JSON.stringify(out === undefined ? null : out))) + .catch((err) => { process.stderr.write(String((err && err.stack) || err)); process.exit(1); }); +`; + +export type CodeRuntime = "python" | "node"; + +// The minimal set of host env vars a python3/node runtime needs to start. Deliberately +// excludes everything secret-bearing or sidecar-specific: no AGENTA_*, no *_API_KEY / +// *_TOKEN, no COMPOSIO_* / DAYTONA_*, no provider keys that the in-process Pi path writes +// into process.env before a run. Only the tool's declared scoped `env` is layered on top. +const BASE_ENV_ALLOWLIST = [ + "PATH", + "HOME", + "LANG", + "LC_ALL", + "LC_CTYPE", + "TZ", + "TMPDIR", + "TEMP", + "TMP", + "NODE_PATH", + // Windows essentials, copied only when present. + "SystemRoot", + "ComSpec", +]; + +/** Build the child env from a minimal allowlist (copied only when set) plus scoped secrets. */ +function buildChildEnv( + env: Record | undefined, +): Record { + const base: Record = {}; + for (const key of BASE_ENV_ALLOWLIST) { + const value = process.env[key]; + if (value !== undefined) base[key] = value; + } + return { ...base, ...(env ?? {}) }; +} + +/** + * Run a code tool's snippet and return its JSON-serialized output as text. Throws on a + * non-zero exit, a timeout, or an abort; callers turn the throw into a tool-error result so + * the model loop continues. + */ +export async function runCodeTool( + runtime: CodeRuntime | undefined, + code: string, + env: Record | undefined, + args: unknown, + signal?: AbortSignal, +): Promise { + const dir = mkdtempSync(join(tmpdir(), "agenta-code-")); + try { + const isNode = runtime === "node"; + const scriptPath = join(dir, isNode ? "tool.js" : "tool.py"); + writeFileSync(scriptPath, code ?? "", "utf8"); + + const command = isNode ? "node" : "python3"; + const childArgs = isNode + ? ["-e", NODE_BOOTSTRAP, scriptPath] + : ["-c", PY_BOOTSTRAP, scriptPath]; + + return await new Promise((resolve, reject) => { + const child = spawn(command, childArgs, { + // The child inherits ONLY a minimal startup allowlist (PATH, HOME, locale/temp, and + // Windows essentials when present) plus the tool's declared scoped secrets. It does + // NOT inherit the sidecar's process.env, so provider keys (OPENAI_API_KEY, etc.) that + // the in-process Pi path writes into process.env, AGENTA_* config, and other secret- + // bearing vars never reach an author-supplied snippet. Nothing is written to the + // agent-visible filesystem beyond the temp dir. + env: buildChildEnv(env), + stdio: ["pipe", "pipe", "pipe"], + }); + + let stdout = ""; + let stderr = ""; + let settled = false; + const finish = (fn: () => void) => { + if (settled) return; + settled = true; + clearTimeout(timer); + if (signal) signal.removeEventListener("abort", onAbort); + fn(); + }; + + const timer = setTimeout(() => { + child.kill("SIGKILL"); + finish(() => + reject(new Error(`code tool timed out after ${CODE_TOOL_TIMEOUT_MS}ms`)), + ); + }, CODE_TOOL_TIMEOUT_MS); + + const onAbort = () => { + child.kill("SIGKILL"); + finish(() => reject(new Error("aborted"))); + }; + if (signal) { + if (signal.aborted) onAbort(); + else signal.addEventListener("abort", onAbort, { once: true }); + } + + child.stdout.on("data", (d) => (stdout += d)); + child.stderr.on("data", (d) => (stderr += d)); + child.on("error", (err) => finish(() => reject(err))); + child.on("close", (exitCode) => + finish(() => { + if (exitCode === 0) resolve(stdout.trim()); + else + reject( + new Error( + `code tool exited ${exitCode}: ${stderr.slice(0, 500) || "(no stderr)"}`, + ), + ); + }), + ); + + child.stdin.write(JSON.stringify(args ?? {})); + child.stdin.end(); + }); + } finally { + try { + rmSync(dir, { recursive: true, force: true }); + } catch { + // best-effort cleanup of the throwaway snippet dir + } + } +} diff --git a/services/agent/src/tools/dispatch.ts b/services/agent/src/tools/dispatch.ts new file mode 100644 index 0000000000..f948501265 --- /dev/null +++ b/services/agent/src/tools/dispatch.ts @@ -0,0 +1,129 @@ +/** + * Shared tool dispatch: execute one backend-resolved tool, branching on its executor `kind`. + * + * The same "branch on spec.kind to run a resolved tool" logic was duplicated across every + * delivery path (engines/pi.ts in-process Pi, extensions/agenta.ts Pi-under-rivet, + * tools/mcp-server.ts the MCP bridge). This module owns that dispatch ONCE so a change to how + * a kind is executed is a one-line edit, not three. Each call site still keeps its OWN + * result-wrapping shape (Pi customTool details, the MCP `content` envelope) and its OWN + * advertise/skip behavior for `client` tools — only the execution itself is shared. + * + * The three executor kinds (see `ResolvedToolSpec`): + * - `code`: run the snippet in a sandbox subprocess with its scoped secret `env`. + * - `client`: browser-fulfilled across a turn boundary; never executed in-sandbox (throws). + * - `callback` (default): POST back through Agenta's /tools/call so the Composio key and + * connection auth stay server-side. On Daytona the in-sandbox process can't reach Agenta, + * so the call is relayed through the runner via files (see tools/relay.ts) when `relayDir` + * is set; otherwise it POSTs directly. + * + * `relayToolCall` lives here (not in extensions/agenta.ts) so this module is the single + * dispatch home with no import cycle back into a call site. + */ +import { existsSync, mkdirSync, readFileSync, unlinkSync, writeFileSync } from "node:fs"; + +import type { ResolvedToolSpec } from "../protocol.ts"; +import { callAgentaTool } from "./callback.ts"; +import { runCodeTool } from "./code.ts"; +import { + RELAY_POLL_MS, + RELAY_REQ_SUFFIX, + RELAY_RES_SUFFIX, + RELAY_TIMEOUT_MS, + sanitizeRelayId, + sleep, + type RelayResponse, +} from "./relay.ts"; + +/** Options for executing a resolved tool. `endpoint`/`authorization`/`relayDir` only matter for callbacks. */ +export interface RunResolvedToolOpts { + /** Stable id for this tool call (used as the /tools/call id and the relay filename). */ + toolCallId: string; + /** /tools/call endpoint for `callback` tools. */ + endpoint?: string; + /** Authorization header for the callback. */ + authorization?: string; + /** Daytona relay dir: when set, callback calls are relayed through the runner via files. */ + relayDir?: string; + /** Caller cancellation, combined with the per-tool timeout. */ + signal?: AbortSignal; +} + +/** + * Daytona tool call: the in-sandbox process can't reach Agenta, so write the request to a + * file the runner watches and poll for the response it writes back (see tools/relay.ts). + */ +export async function relayToolCall( + dir: string, + callRef: string, + toolCallId: string, + params: unknown, + signal?: AbortSignal, +): Promise { + const id = sanitizeRelayId(toolCallId); + const reqPath = `${dir}/${id}${RELAY_REQ_SUFFIX}`; + const resPath = `${dir}/${id}${RELAY_RES_SUFFIX}`; + try { + mkdirSync(dir, { recursive: true }); + } catch { + // The runner also creates it; a race here is harmless. + } + writeFileSync(reqPath, JSON.stringify({ callRef, toolCallId, args: params ?? {} }), "utf-8"); + + const deadline = Date.now() + RELAY_TIMEOUT_MS; + while (Date.now() < deadline) { + if (signal?.aborted) throw new Error("aborted"); + if (existsSync(resPath)) { + const res = JSON.parse(readFileSync(resPath, "utf-8")) as RelayResponse; + try { + unlinkSync(reqPath); + } catch { + /* best-effort cleanup */ + } + try { + unlinkSync(resPath); + } catch { + /* best-effort cleanup */ + } + if (res.ok) return res.text ?? ""; + throw new Error(res.error || `tool relay failed for ${callRef}`); + } + await sleep(RELAY_POLL_MS); + } + throw new Error(`tool relay timed out for ${callRef}`); +} + +/** + * Execute one resolved tool and return its result text. Throws on failure; every call site + * turns the throw into a tool-error result so the model loop continues rather than crashing. + * + * - `code` → run the snippet locally (scoped secret env), no callback/relay. + * - `client` → throw: browser-fulfilled, never executed in-sandbox. + * - default/`callback` → relay through the runner when `opts.relayDir` is set (Daytona), + * else POST directly to `opts.endpoint`. + */ +export async function runResolvedTool( + spec: ResolvedToolSpec, + params: unknown, + opts: RunResolvedToolOpts, +): Promise { + if (spec.kind === "code") { + return runCodeTool(spec.runtime, spec.code ?? "", spec.env, params, opts.signal); + } + if (spec.kind === "client") { + throw new Error( + `client tool '${spec.name}' is browser-fulfilled and cannot be executed in-sandbox`, + ); + } + // callback (default): route back to Agenta's /tools/call (directly or via the Daytona relay). + if (opts.relayDir) { + return relayToolCall(opts.relayDir, spec.callRef ?? "", opts.toolCallId, params, opts.signal); + } + return callAgentaTool( + opts.endpoint ?? "", + opts.authorization, + spec.callRef ?? "", + opts.toolCallId, + params, + opts.signal, + ); +} diff --git a/services/agent/src/tools/mcp-bridge.ts b/services/agent/src/tools/mcp-bridge.ts index b83e71fe6a..eaf5683a4d 100644 --- a/services/agent/src/tools/mcp-bridge.ts +++ b/services/agent/src/tools/mcp-bridge.ts @@ -49,26 +49,53 @@ export interface McpServerStdio { /** * Build the ACP `mcpServers` list that exposes the resolved tools to the harness. - * Empty when there are no tools or no callback (the no-tools path stays untouched). + * + * Attachment is decided per tool kind, not on the callback endpoint alone (see protocol.ts + * `ResolvedToolSpec.kind`; absent kind means `callback` for back-compat): + * - `client` tools are browser-fulfilled and not advertised by this server (mcp-server.ts + * filters them from tools/list), so they never justify attaching the bridge on their own. + * - "Executable here" = non-client (`code` and `callback`). With zero executable specs we + * return [] (the no-tools path stays untouched). + * - `code` tools run locally in mcp-server.ts (runCodeTool) and need NO callback endpoint, so + * we attach `agenta-tools` whenever there is at least one executable spec. + * - Only `callback` tools require `callback.endpoint`. If callback tools are present but the + * endpoint is missing, we do NOT drop the whole server (that would silently lose the `code` + * tools too): we still attach it and warn, naming the callback tools whose `tools/call` will + * fail. The endpoint/auth env entries are pushed only when the endpoint actually exists. */ export function buildToolMcpServers( specs: ResolvedToolSpec[], callback: ToolCallbackContext | undefined, ): McpServerStdio[] { if (!specs || specs.length === 0) return []; - if (!callback?.endpoint) { + + // Absent kind defaults to `callback` (back-compat); `client` is the only non-executable kind. + const executable = specs.filter((s) => (s.kind ?? "callback") !== "client"); + if (executable.length === 0) return []; + + // The callback subset is the only thing that needs the endpoint to function. + const callbackSpecs = executable.filter((s) => (s.kind ?? "callback") === "callback"); + const hasEndpoint = Boolean(callback?.endpoint); + + if (callbackSpecs.length > 0 && !hasEndpoint) { + const names = callbackSpecs.map((s) => s.name).join(", "); process.stderr.write( - `[tool-bridge] skipping ${specs.length} tool(s): missing toolCallback endpoint\n`, + `[tool-bridge] missing toolCallback endpoint: ${callbackSpecs.length} callback tool(s) ` + + `will fail (${names}); still attaching server for the other tool(s)\n`, ); - return []; } + // Pass every executable spec; mcp-server.ts dispatches per kind (code runs locally, callback + // routes to the endpoint). const env: EnvVariable[] = [ - { name: "AGENTA_TOOL_SPECS", value: JSON.stringify(specs) }, - { name: "AGENTA_TOOL_CALLBACK_ENDPOINT", value: callback.endpoint }, + { name: "AGENTA_TOOL_SPECS", value: JSON.stringify(executable) }, ]; - if (callback.authorization) { - env.push({ name: "AGENTA_TOOL_CALLBACK_AUTH", value: callback.authorization }); + // Only carry the callback env when there is an endpoint to call back to. + if (hasEndpoint) { + env.push({ name: "AGENTA_TOOL_CALLBACK_ENDPOINT", value: callback!.endpoint }); + if (callback!.authorization) { + env.push({ name: "AGENTA_TOOL_CALLBACK_AUTH", value: callback!.authorization }); + } } const { command, args } = bridgeLauncher(); diff --git a/services/agent/src/tools/mcp-server.ts b/services/agent/src/tools/mcp-server.ts index 09512bfa36..98a240c50e 100644 --- a/services/agent/src/tools/mcp-server.ts +++ b/services/agent/src/tools/mcp-server.ts @@ -16,8 +16,11 @@ * initialize, tools/list, tools/call; ignores notifications. stdout carries protocol * messages only; logs go to stderr. */ +import { randomUUID } from "node:crypto"; + import type { ResolvedToolSpec } from "../protocol.ts"; -import { EMPTY_OBJECT_SCHEMA, callAgentaTool } from "./client.ts"; +import { EMPTY_OBJECT_SCHEMA } from "./callback.ts"; +import { runResolvedTool } from "./dispatch.ts"; const SPECS: ResolvedToolSpec[] = JSON.parse(process.env.AGENTA_TOOL_SPECS ?? "[]"); const ENDPOINT = process.env.AGENTA_TOOL_CALLBACK_ENDPOINT ?? ""; @@ -58,7 +61,8 @@ async function handle(message: any): Promise { jsonrpc: "2.0", id, result: { - tools: SPECS.map((s) => ({ + // `client` tools are browser-fulfilled, so this server does not advertise them. + tools: SPECS.filter((s) => s.kind !== "client").map((s) => ({ name: s.name, description: s.description ?? s.name, inputSchema: (s.inputSchema as Record) ?? EMPTY_OBJECT_SCHEMA, @@ -74,13 +78,14 @@ async function handle(message: any): Promise { return { jsonrpc: "2.0", id, error: { code: -32602, message: `unknown tool: ${name}` } }; } try { - const text = await callAgentaTool( - ENDPOINT, - AUTH, - spec.callRef, - `tool-${Date.now()}`, - params?.arguments, - ); + // `code` runs the snippet locally (scoped secret env); everything else routes back to + // Agenta's /tools/call. A unique id per call so two parallel calls in the same + // millisecond don't collide (Date.now() would). + const text = await runResolvedTool(spec, params?.arguments, { + toolCallId: randomUUID(), + endpoint: ENDPOINT, + authorization: AUTH, + }); return { jsonrpc: "2.0", id, result: { content: [{ type: "text", text }] } }; } catch (err) { // Surface as an MCP tool error (isError) so the model can recover, not a crash. diff --git a/services/agent/src/tools/relay.ts b/services/agent/src/tools/relay.ts index c182e4e7f9..952ff8893a 100644 --- a/services/agent/src/tools/relay.ts +++ b/services/agent/src/tools/relay.ts @@ -16,7 +16,7 @@ * Local runs keep the direct path (the in-process / local-daemon extension reaches Agenta); * the relay is only wired when AGENTA_TOOL_RELAY_DIR is set (Daytona + Pi + tools). */ -import { callAgentaTool } from "./client.ts"; +import { callAgentaTool } from "./callback.ts"; import type { ToolCallbackContext } from "../protocol.ts"; export const RELAY_REQ_SUFFIX = ".req.json"; diff --git a/services/agent/src/tracing/otel.ts b/services/agent/src/tracing/otel.ts index 6f3ddb2848..2815564a1d 100644 --- a/services/agent/src/tracing/otel.ts +++ b/services/agent/src/tracing/otel.ts @@ -684,6 +684,12 @@ export interface RivetOtel { start(input: { prompt?: string; messages?: any[]; sessionId?: string }): void; /** Feed one ACP `session/update` payload (the `update` object). */ handleUpdate(update: any): void; + /** + * Record an event the ACP stream does not carry (e.g. an `interaction_request` raised via + * the permission callback). Routes through the same choke point as stream events, so it + * lands in both the live sink and the batch `events()` log in build order. + */ + emitEvent(event: AgentEvent): void; /** End all open spans. Returns the accumulated assistant text. */ finish(): string; /** Flush this run's trace to Agenta (invoke_agent has a remote parent). */ @@ -981,6 +987,7 @@ export function createRivetOtel(init: RivetOtelInit): RivetOtel { return { start, handleUpdate, + emitEvent: record, finish, flush: () => flushTrace(runTraceId), traceId: () => runTraceId, diff --git a/services/agent/test/code-tool.test.ts b/services/agent/test/code-tool.test.ts new file mode 100644 index 0000000000..0711f57b41 --- /dev/null +++ b/services/agent/test/code-tool.test.ts @@ -0,0 +1,92 @@ +/** + * Unit test for the code-tool executor (runCodeTool). + * + * Exercises both runtimes end-to-end through real subprocesses: a python tool, node tools + * written as a bare top-level `function main` (the F2 regression) and as an explicit + * `module.exports.main`, an async node `main`, the F3 env-isolation guarantee (provider keys + * do NOT leak in; declared scoped secrets DO), and the non-zero-exit reject path. + * + * Run: pnpm exec tsx test/code-tool.test.ts + */ +import assert from "node:assert/strict"; + +import { runCodeTool } from "../src/tools/code.ts"; + +// --- Python: bare `def main(**kw)` ------------------------------------------ +{ + const code = 'def main(**kw):\n return {"sum": kw.get("a", 0) + kw.get("b", 0)}\n'; + const out = await runCodeTool("python", code, undefined, { a: 2, b: 3 }); + assert.deepEqual(JSON.parse(out), { sum: 5 }, "python bare main returns the right JSON"); +} + +// --- Node: bare top-level `function main` (F2 regression) ------------------- +{ + const code = "function main(inputs) { return { got: inputs }; }"; + const out = await runCodeTool("node", code, undefined, { hello: "world" }); + assert.deepEqual( + JSON.parse(out), + { got: { hello: "world" } }, + "node bare function main executes and echoes the input", + ); +} + +// --- Node: explicit `module.exports.main` ----------------------------------- +{ + const code = "module.exports.main = function (inputs) { return { via: 'exports', got: inputs }; };"; + const out = await runCodeTool("node", code, undefined, { x: 1 }); + assert.deepEqual( + JSON.parse(out), + { via: "exports", got: { x: 1 } }, + "node module.exports.main works", + ); +} + +// --- Node: async `main` returning a Promise --------------------------------- +{ + const code = + "async function main(inputs) { await new Promise((r) => setTimeout(r, 5)); return { doubled: inputs.n * 2 }; }"; + const out = await runCodeTool("node", code, undefined, { n: 21 }); + assert.deepEqual(JSON.parse(out), { doubled: 42 }, "node async main resolves"); +} + +// --- F3: provider keys do NOT leak; scoped secrets DO ----------------------- +{ + const hadKey = "OPENAI_API_KEY" in process.env; + const prevKey = process.env.OPENAI_API_KEY; + process.env.OPENAI_API_KEY = "leak-me-xyz"; + try { + // The provider key sits in process.env but must not reach the snippet. + const leakCode = "function main() { return { key: process.env.OPENAI_API_KEY ?? 'absent' }; }"; + const leakOut = await runCodeTool("node", leakCode, undefined, {}); + assert.deepEqual( + JSON.parse(leakOut), + { key: "absent" }, + "F3: OPENAI_API_KEY did NOT leak into the snippet env", + ); + + // A secret declared on the tool (passed via the scoped `env` arg) must be visible. + const scopedCode = + "function main() { return { secret: process.env.MY_TOOL_SECRET ?? 'absent' }; }"; + const scopedOut = await runCodeTool("node", scopedCode, { MY_TOOL_SECRET: "ok" }, {}); + assert.deepEqual( + JSON.parse(scopedOut), + { secret: "ok" }, + "F3: scoped MY_TOOL_SECRET IS visible to the snippet", + ); + } finally { + if (hadKey) process.env.OPENAI_API_KEY = prevKey; + else delete process.env.OPENAI_API_KEY; + } +} + +// --- Non-zero exit / throw rejects ------------------------------------------ +{ + const code = "function main() { throw new Error('boom'); }"; + await assert.rejects( + () => runCodeTool("node", code, undefined, {}), + /boom|exited/, + "a throwing snippet rejects", + ); +} + +console.log("code-tool.test.ts: all assertions passed"); diff --git a/services/agent/test/continuation.test.ts b/services/agent/test/continuation.test.ts new file mode 100644 index 0000000000..0d4af7948f --- /dev/null +++ b/services/agent/test/continuation.test.ts @@ -0,0 +1,55 @@ +/** + * Unit tests for the cross-turn HITL continuation substrate. + * + * Under the cold model the harness rebuilds context from the replayed transcript, and ACP + * prompt content blocks cannot carry tool calls/results. So a resolved interaction (an + * approved tool that ran, a client-fulfilled tool) must survive into the replay as text. + * `messageTranscript` encodes tool turns; `buildTurnText` keeps them in the replayed history. + * + * Run: pnpm exec tsx test/continuation.test.ts + */ +import assert from "node:assert/strict"; + +import { messageTranscript, buildTurnText } from "../src/engines/rivet.ts"; +import type { AgentRunRequest, ContentBlock } from "../src/protocol.ts"; + +// --- messageTranscript ------------------------------------------------------- +assert.equal(messageTranscript("hello"), "hello"); +assert.equal(messageTranscript([{ type: "text", text: "a" }, { type: "text", text: "b" }]), "a\nb"); +assert.equal( + messageTranscript([{ type: "tool_call", toolName: "getWeather", input: { city: "Paris" } }]), + '[called getWeather({"city":"Paris"})]', +); +assert.equal( + messageTranscript([{ type: "tool_result", toolName: "getWeather", output: { temp: 24 } }]), + '[getWeather returned: {"temp":24}]', +); +assert.equal( + messageTranscript([{ type: "tool_result", toolName: "send", output: "boom", isError: true }]), + "[send error: boom]", +); + +// --- buildTurnText keeps a resolved tool turn in the replay ------------------ +{ + const req: AgentRunRequest = { + messages: [ + { role: "user", content: "weather in Paris?" }, + { + role: "assistant", + content: [{ type: "tool_call", toolName: "getWeather", input: { city: "Paris" } } as ContentBlock], + }, + { + role: "tool", + content: [{ type: "tool_result", toolName: "getWeather", output: { temp: 24 } } as ContentBlock], + }, + { role: "user", content: "and tomorrow?" }, + ], + }; + const text = buildTurnText(req); + assert.ok(text.includes("called getWeather"), "tool call survives replay"); + assert.ok(text.includes("getWeather returned"), "tool result survives replay"); + assert.ok(text.includes("and tomorrow?"), "latest user prompt is the live turn"); + assert.ok(text.startsWith("Conversation so far:"), "transcript header present"); +} + +console.log("continuation.test.ts: all assertions passed"); diff --git a/services/agent/test/mcp-servers.test.ts b/services/agent/test/mcp-servers.test.ts new file mode 100644 index 0000000000..97e821429f --- /dev/null +++ b/services/agent/test/mcp-servers.test.ts @@ -0,0 +1,58 @@ +/** + * Unit tests for the user-declared MCP server conversion (Agent B's Slice 4, wired in rivet). + * + * Agent B's `resolve_mcp_servers` emits the McpServerConfig wire shape + * ({name,transport,command,args,env,url?,tools?}, env as a Record), pinned in the Python + * test_wire_contract. This covers the TS half: converting that to the ACP stdio entry the + * session consumes (env as a {name,value} list), skipping remote/http, and not enforcing the + * per-server tools allowlist over ACP in v1. + * + * Run: pnpm exec tsx test/mcp-servers.test.ts + */ +import assert from "node:assert/strict"; + +import { toAcpMcpServers } from "../src/engines/rivet.ts"; +import type { McpServerConfig } from "../src/protocol.ts"; + +assert.deepEqual(toAcpMcpServers(undefined), [], "undefined -> []"); +assert.deepEqual(toAcpMcpServers([]), [], "[] -> []"); + +// stdio server: env Record -> ACP {name,value} list; defaults applied. +{ + const servers: McpServerConfig[] = [ + { + name: "github", + transport: "stdio", + command: "npx", + args: ["-y", "@modelcontextprotocol/server-github"], + env: { GITHUB_PERSONAL_ACCESS_TOKEN: "ghp_x", LOG_LEVEL: "info" }, + tools: ["create_issue"], // allowlist not enforced over ACP v1 (logged), server still delivered + }, + ]; + const out = toAcpMcpServers(servers); + assert.equal(out.length, 1); + assert.equal(out[0].name, "github"); + assert.equal(out[0].command, "npx"); + assert.deepEqual(out[0].args, ["-y", "@modelcontextprotocol/server-github"]); + assert.deepEqual(out[0].env, [ + { name: "GITHUB_PERSONAL_ACCESS_TOKEN", value: "ghp_x" }, + { name: "LOG_LEVEL", value: "info" }, + ]); +} + +// remote/http is skipped (no auth on the wire by design); stdio without command is skipped. +{ + const out = toAcpMcpServers([ + { name: "remote", transport: "http", url: "https://example.com/mcp" }, + { name: "broken", transport: "stdio" }, // no command + ]); + assert.deepEqual(out, [], "http + command-less stdio both skipped"); +} + +// missing env / args default to empty. +{ + const out = toAcpMcpServers([{ name: "fs", transport: "stdio", command: "mcp-fs" }]); + assert.deepEqual(out, [{ name: "fs", command: "mcp-fs", args: [], env: [] }]); +} + +console.log("mcp-servers.test.ts: all assertions passed"); diff --git a/services/agent/test/responder.test.ts b/services/agent/test/responder.test.ts new file mode 100644 index 0000000000..e06ae43e00 --- /dev/null +++ b/services/agent/test/responder.test.ts @@ -0,0 +1,84 @@ +/** + * Unit tests for the interaction responder seam and the otel `emitEvent` hook. + * + * Covers the behavior parity of the responder (it replaces the old inline auto-approve in + * rivet.ts) and that an out-of-stream event (an `interaction_request`) routed through + * `emitEvent` lands in both the live sink and the batch `events()` log. No harness, no + * network. + * + * Run: pnpm exec tsx test/responder.test.ts + */ +import assert from "node:assert/strict"; + +import { createRivetOtel } from "../src/tracing/otel.ts"; +import type { AgentEvent } from "../src/protocol.ts"; +import { + PolicyResponder, + decisionToReply, + policyFromRequest, +} from "../src/responder.ts"; + +// --- policyFromRequest ------------------------------------------------------- +{ + delete process.env.AGENTA_RIVET_DENY_PERMISSIONS; + assert.equal(policyFromRequest(undefined), "auto"); + assert.equal(policyFromRequest("auto"), "auto"); + assert.equal(policyFromRequest("deny"), "deny"); + + process.env.AGENTA_RIVET_DENY_PERMISSIONS = "true"; + assert.equal(policyFromRequest(undefined), "deny", "env forces deny"); + assert.equal(policyFromRequest("auto"), "deny", "env overrides auto"); + delete process.env.AGENTA_RIVET_DENY_PERMISSIONS; +} + +// --- decisionToReply (parity with the old inline mapping) -------------------- +{ + assert.equal(decisionToReply("allow", ["always", "once", "reject"]), "always"); + assert.equal(decisionToReply("allow", ["once", "reject"]), "once"); + assert.equal(decisionToReply("allow", []), "once", "allow falls back to once"); + assert.equal(decisionToReply("deny", ["always", "once", "reject"]), "reject"); + assert.equal(decisionToReply("deny", []), "reject", "deny falls back to reject"); +} + +// --- PolicyResponder --------------------------------------------------------- +{ + const auto = new PolicyResponder("auto"); + const deny = new PolicyResponder("deny"); + const req = { id: "p1", availableReplies: ["once", "reject"] }; + assert.equal(await auto.onPermission(req), "allow"); + assert.equal(await deny.onPermission(req), "deny"); +} + +// --- emitEvent: streaming path (sink + batch) -------------------------------- +{ + const emitted: AgentEvent[] = []; + const run = createRivetOtel({ harness: "claude", model: "anthropic/x", emit: (e) => emitted.push(e) }); + run.start({ prompt: "hi" }); + const interaction: AgentEvent = { + type: "interaction_request", + id: "p1", + kind: "permission", + payload: { availableReplies: ["once", "reject"] }, + }; + run.emitEvent(interaction); + + const live = emitted.find((e) => e.type === "interaction_request"); + assert.ok(live, "interaction_request flushed to the live sink"); + assert.equal((live as any).id, "p1"); + assert.ok( + run.events().some((e) => e.type === "interaction_request"), + "interaction_request also recorded in the batch log", + ); +} + +// --- emitEvent: one-shot path (batch only) ----------------------------------- +{ + const run = createRivetOtel({ harness: "claude", model: "anthropic/x" }); + run.start({ prompt: "hi" }); + run.emitEvent({ type: "data", name: "weather", data: { temp: 24 } }); + const ev = run.events().find((e) => e.type === "data"); + assert.ok(ev, "data event recorded with no live sink"); + assert.equal((ev as any).name, "weather"); +} + +console.log("responder.test.ts: all assertions passed"); diff --git a/services/agent/test/tool-bridge.test.ts b/services/agent/test/tool-bridge.test.ts new file mode 100644 index 0000000000..1838177918 --- /dev/null +++ b/services/agent/test/tool-bridge.test.ts @@ -0,0 +1,151 @@ +/** + * Unit tests for buildToolMcpServers (the tool MCP bridge attachment decision). + * + * Regression cover for F4: attachment must be decided per tool kind, not on the callback + * endpoint alone. A `code` tool runs locally in mcp-server.ts and needs no endpoint, so a run + * whose tools are all `code` must still attach the `agenta-tools` server. Only `callback`-kind + * tools require AGENTA_TOOL_CALLBACK_ENDPOINT; missing it must degrade those tools, not drop the + * whole server. `client` tools are browser-fulfilled and never justify attaching the bridge. + * + * Run: pnpm exec tsx test/tool-bridge.test.ts + */ +import assert from "node:assert/strict"; + +import { buildToolMcpServers } from "../src/tools/mcp-bridge.ts"; +import type { ResolvedToolSpec, ToolCallbackContext } from "../src/protocol.ts"; + +/** Look up an env var value by name in the ACP {name,value} list (undefined if absent). */ +function envValue( + env: { name: string; value: string }[], + name: string, +): string | undefined { + return env.find((e) => e.name === name)?.value; +} + +// code-only specs + NO callback -> one server, with AGENTA_TOOL_SPECS but no endpoint (F4). +{ + const specs: ResolvedToolSpec[] = [ + { name: "adder", kind: "code", runtime: "python", code: "def main(**k): return 1" }, + ]; + const out = buildToolMcpServers(specs, undefined); + assert.equal(out.length, 1, "code-only run still attaches the server"); + assert.equal(out[0].name, "agenta-tools"); + assert.ok( + envValue(out[0].env, "AGENTA_TOOL_SPECS") !== undefined, + "AGENTA_TOOL_SPECS is set", + ); + assert.equal( + envValue(out[0].env, "AGENTA_TOOL_CALLBACK_ENDPOINT"), + undefined, + "no endpoint env for code-only run", + ); + // The full executable spec list round-trips through AGENTA_TOOL_SPECS. + assert.deepEqual(JSON.parse(envValue(out[0].env, "AGENTA_TOOL_SPECS")!), specs); +} + +// callback specs + a callback with endpoint -> one server carrying endpoint (+ auth). +{ + const specs: ResolvedToolSpec[] = [ + { name: "search", kind: "callback", callRef: "composio.search" }, + ]; + const callback: ToolCallbackContext = { + endpoint: "https://agenta.example/tools/call", + authorization: "Bearer tok", + }; + const out = buildToolMcpServers(specs, callback); + assert.equal(out.length, 1); + assert.equal( + envValue(out[0].env, "AGENTA_TOOL_CALLBACK_ENDPOINT"), + "https://agenta.example/tools/call", + "endpoint env set for callback tools", + ); + assert.equal( + envValue(out[0].env, "AGENTA_TOOL_CALLBACK_AUTH"), + "Bearer tok", + "auth env set when provided", + ); +} + +// callback spec + endpoint but no authorization -> no AUTH env entry. +{ + const specs: ResolvedToolSpec[] = [ + { name: "search", kind: "callback", callRef: "composio.search" }, + ]; + const out = buildToolMcpServers(specs, { endpoint: "https://agenta.example/tools/call" }); + assert.equal(out.length, 1); + assert.equal( + envValue(out[0].env, "AGENTA_TOOL_CALLBACK_ENDPOINT"), + "https://agenta.example/tools/call", + ); + assert.equal( + envValue(out[0].env, "AGENTA_TOOL_CALLBACK_AUTH"), + undefined, + "no AUTH env when authorization absent", + ); +} + +// absent kind defaults to callback (back-compat): endpoint still wired when present. +{ + const specs: ResolvedToolSpec[] = [{ name: "legacy", callRef: "composio.legacy" }]; + const out = buildToolMcpServers(specs, { endpoint: "https://agenta.example/tools/call" }); + assert.equal(out.length, 1, "back-compat (no kind) attaches as a callback tool"); + assert.equal( + envValue(out[0].env, "AGENTA_TOOL_CALLBACK_ENDPOINT"), + "https://agenta.example/tools/call", + ); +} + +// mixed code+callback specs + NO endpoint -> still one server (so code works), endpoint omitted. +{ + const specs: ResolvedToolSpec[] = [ + { name: "adder", kind: "code", runtime: "python", code: "def main(**k): return 1" }, + { name: "search", kind: "callback", callRef: "composio.search" }, + ]; + const out = buildToolMcpServers(specs, undefined); + assert.notDeepEqual(out, [], "mixed run with no endpoint must not return []"); + assert.equal(out.length, 1, "still attaches the server so the code tool works"); + assert.equal( + envValue(out[0].env, "AGENTA_TOOL_CALLBACK_ENDPOINT"), + undefined, + "endpoint env omitted when missing", + ); + // Both executable specs are still passed to the bridge. + assert.deepEqual(JSON.parse(envValue(out[0].env, "AGENTA_TOOL_SPECS")!), specs); +} + +// empty specs -> []. +assert.deepEqual(buildToolMcpServers([], undefined), [], "empty specs -> []"); + +// client-only specs -> [] (no executable tools; the bridge does not advertise client tools). +{ + const specs: ResolvedToolSpec[] = [{ name: "confirm", kind: "client" }]; + assert.deepEqual( + buildToolMcpServers(specs, undefined), + [], + "client-only -> [] (nothing executable here)", + ); + // Even with an endpoint, client-only stays empty. + assert.deepEqual( + buildToolMcpServers(specs, { endpoint: "https://agenta.example/tools/call" }), + [], + "client-only -> [] even with an endpoint", + ); +} + +// client tools alongside an executable one are dropped from AGENTA_TOOL_SPECS, server attaches. +{ + const specs: ResolvedToolSpec[] = [ + { name: "confirm", kind: "client" }, + { name: "adder", kind: "code", runtime: "python", code: "def main(**k): return 1" }, + ]; + const out = buildToolMcpServers(specs, undefined); + assert.equal(out.length, 1, "executable spec attaches the server"); + const passed: ResolvedToolSpec[] = JSON.parse(envValue(out[0].env, "AGENTA_TOOL_SPECS")!); + assert.deepEqual( + passed.map((s) => s.name), + ["adder"], + "client spec excluded from the executable list passed to the bridge", + ); +} + +console.log("tool-bridge.test.ts: all assertions passed"); diff --git a/services/agent/test/tool-dispatch.test.ts b/services/agent/test/tool-dispatch.test.ts new file mode 100644 index 0000000000..8ec779d396 --- /dev/null +++ b/services/agent/test/tool-dispatch.test.ts @@ -0,0 +1,85 @@ +/** + * Unit tests for the shared tool-dispatch module (tools/dispatch.ts) and its routing. + * + * The kind-dispatch ("branch on spec.kind to execute a resolved tool") used to be duplicated + * across engines/pi.ts, extensions/agenta.ts, and tools/mcp-server.ts. It now lives once in + * `runResolvedTool`. These tests cover both the routing into that function and the call-site + * advertising behavior that stays per-site: + * - buildCustomTools (pi.ts) skips `client` specs, builds a tool per `code`/`callback` spec, + * and skips a `callback` spec with no callback endpoint. + * - runResolvedTool runs a real `code` snippet end-to-end (python) and throws for `client`. + * + * No network and no harness: the `code` path shells out to python3 (available locally); the + * `callback`/relay paths are not exercised here (they need a live /tools/call or a relay dir). + * + * Run: pnpm exec tsx test/tool-dispatch.test.ts + */ +import assert from "node:assert/strict"; + +import { buildCustomTools } from "../src/engines/pi.ts"; +import { runResolvedTool } from "../src/tools/dispatch.ts"; +import type { ResolvedToolSpec, ToolCallbackContext } from "../src/protocol.ts"; + +const callback: ToolCallbackContext = { endpoint: "https://agenta.test/tools/call" }; + +const clientSpec: ResolvedToolSpec = { name: "client_tool", kind: "client" }; +const codeSpec: ResolvedToolSpec = { + name: "code_tool", + kind: "code", + runtime: "python", + code: 'def main(**kw):\n return {"echo": kw}\n', +}; +const callbackSpec: ResolvedToolSpec = { + name: "callback_tool", + kind: "callback", + callRef: "composio.SOME_ACTION", +}; + +// --- buildCustomTools routing ----------------------------------------------- +{ + const tools = buildCustomTools([clientSpec, codeSpec, callbackSpec], callback); + const names = tools.map((t) => t.name); + + // `client` is browser-fulfilled, so it is never registered in-process. + assert.ok(!names.includes("client_tool"), "client spec is skipped"); + // `code` and `callback` each produce exactly one tool with the spec's name. + assert.ok(names.includes("code_tool"), "code spec produces a tool"); + assert.ok(names.includes("callback_tool"), "callback spec produces a tool"); + assert.equal(tools.length, 2, "only the two executable specs produce tools"); +} + +// A `callback` spec with no callback endpoint is skipped (logged), but a sibling `code` +// spec still registers (code never needs the endpoint). +{ + const tools = buildCustomTools([codeSpec, callbackSpec], undefined); + const names = tools.map((t) => t.name); + assert.ok(names.includes("code_tool"), "code spec still registers without an endpoint"); + assert.ok( + !names.includes("callback_tool"), + "callback spec is skipped when no callback endpoint", + ); + assert.equal(tools.length, 1, "only the code spec registers without an endpoint"); +} + +// --- runResolvedTool: code executes; client throws -------------------------- +{ + const text = await runResolvedTool(codeSpec, { greeting: "hi", n: 3 }, { + toolCallId: "call-1", + }); + const parsed = JSON.parse(text); + assert.deepEqual( + parsed, + { echo: { greeting: "hi", n: 3 } }, + "code tool runs the snippet and returns its JSON output containing the input", + ); +} + +{ + await assert.rejects( + () => runResolvedTool(clientSpec, {}, { toolCallId: "call-2" }), + /browser-fulfilled/, + "client tool throws (never executed in-sandbox)", + ); +} + +console.log("tool-dispatch.test.ts: all assertions passed"); diff --git a/services/oss/src/agent/app.py b/services/oss/src/agent/app.py index fe3e164788..6409065d55 100644 --- a/services/oss/src/agent/app.py +++ b/services/oss/src/agent/app.py @@ -27,8 +27,8 @@ SessionConfig, make_harness, to_messages, - ui_message_stream, ) +from agenta.sdk.agents.adapters.vercel import agent_run_to_vercel_parts from oss.src.agent.config import load_config, wrapper_dir from oss.src.agent.schemas import AGENT_SCHEMAS @@ -113,7 +113,7 @@ async def _agent( # generator (the normalizer turns it into a streaming response). `/invoke` and the # `/messages` JSON path leave it unset and take the batch path below. if stream: - return _agent_stream(harness, session_config, msgs) + return _agent_vercel_stream(harness, session_config, msgs) await harness.setup() try: @@ -125,7 +125,7 @@ async def _agent( return {"role": "assistant", "content": result.output} -async def _agent_stream(harness, session_config, msgs): +async def _agent_vercel_stream(harness, session_config, msgs): """Run one streaming turn and yield Vercel UI Message Stream parts. Owns the environment lifecycle (``setup`` / ``cleanup``); the per-turn session is torn @@ -135,7 +135,7 @@ async def _agent_stream(harness, session_config, msgs): await harness.setup() try: run = await harness.stream(session_config, msgs) - async for part in ui_message_stream(run): + async for part in agent_run_to_vercel_parts(run): yield part try: record_usage(run.result().usage)