Agenta-AI · mmabrouk · Jun 17, 2026
diff --git a/.gitignore b/.gitignore
@@ -12,6 +12,8 @@
 **/*dont_commit_me*
 web/packages/agenta-api-client/dist/
 web/tsconfig.tsbuildinfo
+# Agent Pi extension bundle, built by `pnpm run build:extension` and in the Docker image.
+services/agent/dist/
 
 __pycache__/
 **/__pycache__/

diff --git a/docs/design/agent-workflows/README.md b/docs/design/agent-workflows/README.md
@@ -116,6 +116,9 @@ running agent.
 - [`wp-7-tools/`](wp-7-tools/README.md) — make runnable tools part of the agent config; resolve
   Composio actions into Pi tools and route tool calls back through the existing
   `POST /tools/call`, with MCP and workflow-as-tool as future adapters.
+- [`wp-8-rivet-acp-runtime/`](wp-8-rivet-acp-runtime/README.md) — re-platform the service onto
+  `rivet-dev/sandbox-agent` so the agent is driven over ACP and the harness (Pi, Claude Code,
+  Codex) becomes a config value, running locally first; tools, Daytona, and the folder jail deferred.
 
 ## Related work
 

diff --git a/docs/design/agent-workflows/wp-8-rivet-acp-runtime/README.md b/docs/design/agent-workflows/wp-8-rivet-acp-runtime/README.md
@@ -0,0 +1,80 @@
+# WP-8: Rivet + ACP agent runtime
+
+Status: design ready to implement. Start at [`plan.md`](plan.md). Decisions and open
+items are in [`status.md`](status.md).
+
+This folder is self-contained. A new engineer should be able to read it and implement the
+work end to end without prior context. Read in this order: this README, then
+[`context.md`](context.md) (the code that exists today), [`research.md`](research.md)
+(verified facts about rivet, ACP, and the pattern we copy), [`architecture.md`](architecture.md)
+(the target design), and [`plan.md`](plan.md) (the phased build).
+
+## Summary
+
+Re-platform the agent workflow service (`services/oss/src/agent.py`) so it drives the
+agent over the **Agent Client Protocol (ACP)** through [`rivet-dev/sandbox-agent`](https://github.com/rivet-dev/sandbox-agent),
+instead of the bespoke Pi JSON protocol it uses today.
+
+The `/invoke` contract does not change. The handler still builds a user turn and returns
+`{"role": "assistant", "content": ...}`. What changes is the transport behind the existing
+`Harness` port: rivet runs the chosen harness (Pi, Claude Code) as an ACP session and
+streams the reply back. Picking a different harness becomes a config value, not new code.
+
+## The four requirements
+
+1. **Drive the agent over ACP**, not the Pi JSON protocol. Rivet speaks ACP to the
+   harness; our service drives rivet.
+2. **Swap harness as config.** The same agent config runs on Pi or Claude Code by setting
+   one value.
+3. **Run locally.** The same path runs on a dev machine with no container, using rivet's
+   `local` provider. The rivet server is open source, so running it locally is normal.
+4. **Defer tools.** Ship with no tools. The tool model is fixed (definition plus swappable
+   body, delivered per-harness over MCP), but nothing is built here.
+
+## The design in five lines
+
+- Keep `agent.py`, the `/invoke` contract, and the `Harness` port unchanged.
+- Add a `RivetHarness` adapter behind the port, plus a small TypeScript runner that wraps
+  the rivet SDK.
+- Run **one rivet daemon and one sandbox per invoke** (cold), then tear it down. This
+  copies the pattern Agenta already ships for code evaluators.
+- Inject the trace context as an environment variable **at the daemon's birth** (the
+  sandbox `env_vars` on Daytona, the SDK `env` option locally). No fork of rivet or the
+  adapters is needed under this per-invoke model.
+- Two axes swap independently: **sandbox** (local, daytona) and **harness** (pi, claude).
+
+## Agent configuration (the contract to rivet: filesystem plus config)
+
+- **AGENTS.md** — instructions, after variable substitution.
+- **Input variables** — substituted into AGENTS.md, like prompt-template variables.
+- **Skills** — laid into the workspace as files (path and format are per-harness).
+- **Tool definitions** — schema only, separate from bodies. Empty here.
+- **Harness** — `pi` / `claude`.
+- **Sandbox** — `local` / `daytona`.
+- **Secrets** — harness and LLM auth, passed as launch env, never written into the
+  agent-visible filesystem.
+
+## In scope
+
+ACP transport via rivet, harness swap (Pi and Claude Code), local run, and **tracing**
+(the agent's spans must nest under the `/invoke` span; standalone traces are not
+acceptable). Daytona and concurrency are described as the immediate follow-on phases.
+
+## Deferred (each its own follow-on)
+
+- **Tools** ([WP-7](../wp-7-tools/README.md)): the definition-plus-body model over MCP.
+- **Folder isolation (the jail)**: rivet has no filesystem confinement. Needed only when a
+  single warm daemon hosts many agents at once. A TypeScript-or-Rust change, deferred. See
+  [`isolation-and-fork.md`](isolation-and-fork.md).
+- **Multi-turn and streaming to the client** ([WP-4](../wp-4-multi-message-output/README.md)):
+  one turn in, one message out, matching today. A session is persisted message history
+  replayed via ACP `session/load`.
+- **Standalone SDK runner**: run an agent from the SDK with a config. The adapters are
+  written to live in the SDK so this is a packaging step later, not a rewrite.
+
+## Why rivet
+
+Rivet is the thing we were about to hand-build in the `Harness` and `Runtime` ports: an
+ACP daemon that drives several harnesses, keyed by session, over a swappable sandbox
+(local, daytona) with an HTTP and SSE control plane. We adopt it unmodified (Apache-2.0).
+The one capability it lacks, filesystem confinement, we are deferring.
diff --git a/docs/design/agent-workflows/wp-8-rivet-acp-runtime/architecture.md b/docs/design/agent-workflows/wp-8-rivet-acp-runtime/architecture.md
@@ -0,0 +1,176 @@
+# Architecture
+
+## Principle
+
+Keep the `Harness` port and the `/invoke` contract. Add one adapter behind the port that
+runs the agent through rivet over ACP, and a small TypeScript runner that wraps the rivet
+SDK. Everything Pi-specific moves below the port and becomes one harness choice.
+
+```
+                 unchanged
+  ┌───────────────────────────────────────────────┐
+  │ agent.py  (/invoke, /inspect, ag.create_app)   │
+  │   _resolve_run_config / _latest_user_message   │
+  │   _build_harness()  ── selects adapter by env  │
+  └───────────────────────────────────────────────┘
+                      │  Harness port (setup / invoke / shutdown)
+                      ▼
+  ┌───────────────────────────────────────────────┐
+  │ RivetHarness (new, Python)                     │   PiHarness / PiHttpHarness
+  │  maps HarnessRequest + {harness, sandbox} →    │   (kept; legacy path)
+  │  a one-shot rivet run; passes trace + secrets  │
+  └───────────────────────────────────────────────┘
+                      │  /run (HTTP or stdio), same contract family as runPi
+                      ▼
+  ┌───────────────────────────────────────────────┐
+  │ runRivet.ts  (services/agent, wraps rivet SDK) │
+  │  start({ sandbox, env }) → createSession({     │
+  │  agent, cwd }) → write AGENTS.md → prompt →     │
+  │  collect chunks → destroy                       │
+  └───────────────────────────────────────────────┘
+                      │  spawns the daemon (local subprocess, or in Daytona)
+                      ▼
+  ┌───────────────────────────────────────────────┐
+  │ sandbox-agent daemon (Rust, one per invoke)    │
+  └───────────────────────────────────────────────┘
+                      │  ACP (JSON-RPC: session/prompt, session/update)
+                      ▼
+  ┌───────────────────────────────────────────────┐
+  │ harness ACP adapter subprocess in cwd          │
+  │  pi-acp │ claude-code-acp                       │
+  └───────────────────────────────────────────────┘
+```
+
+The ACP boundary is daemon to harness. That is the requirement: the agent loop runs over
+ACP, not the Pi JSON envelope. The service-to-rivet hop is rivet's own control surface and
+stays harness-agnostic behind the port.
+
+## Two orthogonal swap axes
+
+These swap independently. Do not bundle them.
+
+- **Sandbox (where the daemon runs):** `local`, `daytona`. A config value passed to
+  `runRivet`, which selects the rivet provider.
+- **Harness (which engine):** `pi`, `claude`. A config value passed as the rivet `agent`.
+
+The demo proves each separately: swap `local` and `daytona` with the harness fixed, and
+swap `pi` and `claude` with the sandbox fixed.
+
+## Lifecycle: one daemon and one sandbox per invoke (cold)
+
+Each `/invoke` brings up its own daemon and sandbox, runs, and tears down. This copies the
+shipped code-evaluator pattern (`DaytonaRunner`: an ephemeral sandbox per execution from a
+snapshot, deleted in a `finally`). Two reasons it is the right default:
+
+- It makes the daemon's environment **per-invoke**, which is what makes tracing work
+  without forking anything (see below).
+- It needs no filesystem jail, because agents never share a daemon.
+
+Cost is acceptable. Locally the daemon is a Rust binary that boots in tens of
+milliseconds, so the per-invoke cost is the Node adapter spawn (~0.2 to 0.5s). On Daytona
+the sandbox create adds ~1s. Concurrency is bounded the way evaluations already bound it
+(see Concurrency).
+
+## Tracing: inject at the daemon's birth
+
+The agent's spans must nest under the `/invoke` span. Standalone traces are not
+acceptable. The mechanism is uniform across sandboxes because each invoke owns its daemon:
+
+- The static OTLP target and auth (`OTEL_*`, the Agenta endpoint and `Authorization`) and
+  the per-invoke `traceparent` go into the daemon's environment when it is created.
+  - **Local:** the SDK `env` option on `start({ sandbox: local(), env })`.
+  - **Daytona:** the sandbox `env_vars`, exactly like `DaytonaRunner` injects `AGENTA_*`.
+- The daemon passes its env to the adapter subprocess, which passes it to the harness.
+- **Pi:** install the `agenta-otel` logic as a Pi extension in the environment (global
+  `~/.pi/agent/extensions`, or baked into the Daytona snapshot). Pi loads it and emits
+  spans under the injected `traceparent`.
+- **Claude Code:** set `CLAUDE_CODE_ENABLE_TELEMETRY=1`, `OTEL_*`, and `TRACEPARENT`, and
+  run it in `-p` / Agent-SDK mode.
+
+No fork of rivet or the adapters is needed under the per-invoke model. A fork (the
+TypeScript adapter reading ACP `_meta.traceparent`, not Rust) is only needed if a later
+phase shares one warm daemon across concurrent invokes.
+
+## Components
+
+### `RivetHarness` (Python, new)
+
+`services/oss/src/agent_pi/rivet_harness.py`, implements the `Harness` ABC. It holds the
+harness id and sandbox choice (from config) and the trace/secret context, and maps a
+`HarnessRequest` onto a `runRivet` `/run` call. Field mapping:
+
+| `HarnessRequest` | Becomes |
+| --- | --- |
+| `agents_md` | written as `AGENTS.md` into the session `cwd` |
+| `model` | session model where the harness honors it (the adapter normalizes this) |
+| `prompt` | the ACP prompt text |
+| `messages` | MVP uses the latest user turn; history replay is later |
+| `tools` etc. | unused (empty) in WP-8 |
+| `trace` | injected as daemon env (`traceparent`, OTLP endpoint, auth) |
+
+### `runRivet.ts` (TypeScript, in `services/agent`)
+
+Wraps the rivet SDK. Selected by env (`AGENT_BACKEND=rivet`) and serves the same `/run`
+contract `runPi.ts` serves, so the Python side stays thin. Per invoke:
+
+1. `start({ sandbox: local() | daytona({...}), env })` (env carries trace + secrets).
+2. `createSession({ agent: <harness>, cwd })`.
+3. Write `AGENTS.md` (and later skills) into `cwd`.
+4. `prompt(sessionId, prompt)`, accumulate `agent_message_chunk` into the output.
+5. `destroy()`.
+6. Return `{ ok, output, sessionId, model }`.
+
+### `agent.py` selection
+
+Extend `_build_harness()` with `AGENTA_AGENT_RUNTIME=rivet` to return `RivetHarness`
+(harness from `AGENTA_AGENT_HARNESS`, sandbox from config, default `local`). Keep the Pi
+path as default so nothing regresses.
+
+## Agent configuration (the contract: filesystem plus config)
+
+Resolved before each run: AGENTS.md, input variables (substituted into AGENTS.md), skills
+(files in the workspace), tool definitions (empty here), harness, sandbox, secrets. The
+contract handed to rivet is files in `cwd` plus the session/daemon config. Secrets go as
+launch env, never as files, because there is no jail.
+
+## Tools: definition vs body (deferred, but shapes the seam)
+
+A tool splits into a **definition** (the schema the model sees, stored in a neutral
+OpenAI-function shape) and a **body** (the execution). The body is swappable: real,
+service-backed, or mock. A test variant of an agent swaps bodies without touching
+definitions. Delivery is per-harness over **MCP** (rivet's per-directory MCP config), not a
+raw OpenAI array. The body model is general and not Agenta-specific: a self-contained body
+runs in-process, a service-backed body (for example a Composio tool calling Agenta's
+`/tools/call`) needs its service reachable (a local or remote Agenta), and a mock needs
+nothing. WP-8 ships no tools; this is the shape to preserve, not build.
+
+## Sessions and state
+
+A session is the **stored message history**, not a kept-alive sandbox. Because we offer no
+persistent file writes, nothing on disk is worth keeping. So: ephemeral sandbox per turn,
+persisted messages, continue by replaying history with ACP `session/load` (Pi
+`resumeSession`, Claude Code `loadSession`). Zero at-rest cost. The history store is the
+backend DB on the platform and a local file standalone. Tradeoff: long-history replay
+re-sends tokens, so cap it. Paused or FS-persisted sessions wait until we offer durable
+writes.
+
+## Concurrency
+
+Mirror evaluations. Do not run the agent inside the API request if a background path is
+available; dispatch it like an evaluation (taskiq worker on a Redis stream) and bound
+concurrency with a shared semaphore. Each concurrent slot is one ephemeral sandbox, so the
+semaphore caps how many sandboxes (and how much Daytona cost) run at once. Extra invokes
+queue. Locally a slot is a cheap subprocess.
+
+## Running standalone via the SDK (later)
+
+The harness and sandbox adapters are written to live in the SDK, so the backend service
+and a standalone run share one implementation. Running locally is not special: the rivet
+server is open source (Apache-2.0, a static binary), so a local run runs that server
+locally and the SDK wraps the rivet client. A standalone run fetches or loads a config,
+then calls the SDK runner.
+
+## What this does not change
+
+No new endpoints. No change to `/invoke` or `/inspect` shapes. No tools, no jail, no
+multi-turn, no client-side streaming. Each is its own follow-on.
diff --git a/docs/design/agent-workflows/wp-8-rivet-acp-runtime/context.md b/docs/design/agent-workflows/wp-8-rivet-acp-runtime/context.md
@@ -0,0 +1,89 @@
+# Context: the code that exists today
+
+Read this to orient on the current service before changing it. All paths are in this repo
+(`/home/mahmoud/code/agenta`).
+
+## The agent service (WP-2)
+
+`services/oss/src/agent.py` is an Agenta app exposing `/invoke` and `/inspect`, like the
+chat and completion services. The handler `_agent(...)`:
+
+1. Resolves config with `_resolve_run_config(...)`: model, AGENTS.md (the system text),
+   and tools, from the request `parameters` or the file config.
+2. Builds the latest user turn with `_latest_user_message(...)`.
+3. Picks a harness adapter with `_build_harness()` and calls the `Harness` port
+   (`setup` / `invoke` / `shutdown`).
+4. Returns `{"role": "assistant", "content": result.output}`.
+
+Trace context is captured in `_trace_context()` and threaded into the harness so the
+agent's spans nest under the `/invoke` span.
+
+## The ports (the seam we keep)
+
+`services/oss/src/agent_pi/ports.py`:
+
+- `Harness` (ABC): `setup()`, `invoke(HarnessRequest) -> HarnessResult`, `shutdown()`.
+- `HarnessRequest`: `agents_md`, `model`, `prompt`, `messages`, `tools`, `custom_tools`,
+  `tool_callback`, `trace`.
+- `HarnessResult`: `output`, `session_id`, `model`.
+- `TraceContext`: `traceparent`, `baggage`, `endpoint` (OTLP), `authorization`,
+  `capture_content`. Has `to_wire()` (camelCase).
+- `Runtime` (ABC): the sandbox/environment seam for the legacy Pi path (`start`,
+  `shutdown`, `exec`). The rivet path does not use `Runtime.exec`; it selects a rivet
+  provider instead (see architecture).
+
+## The current Pi adapters (legacy, keep working)
+
+- `services/oss/src/agent_pi/pi_harness.py` (`PiHarness`): spawns the TypeScript Pi
+  wrapper as a subprocess, one JSON object over stdio.
+- `services/oss/src/agent_pi/pi_http_harness.py` (`PiHttpHarness`): POSTs the same JSON to
+  the wrapper running as an HTTP sidecar.
+- Both send a Pi-shaped envelope (`{agentsMd, model, prompt, messages, tools, customTools,
+  toolCallback, trace}`).
+
+## The TypeScript wrapper
+
+`services/agent/` is a small Node service.
+
+- `src/runPi.ts`: turns the envelope into direct Pi SDK calls (`createAgentSession`, ...).
+- `src/agenta-otel.ts`: a Pi OTel helper. Today `runPi.ts` imports it in-process and emits
+  `invoke_agent` as a child of the incoming `traceparent`. Under rivet this logic must
+  become a Pi **extension** installed in the environment (see architecture, tracing).
+- `src/server.ts` (HTTP `/run`) and `src/cli.ts` (stdio) are the two transports.
+
+## The pattern we copy: how code evaluators run in Daytona
+
+This is the shipped precedent for "ephemeral sandbox per execution", and the agent service
+mirrors it.
+
+- `sdks/python/agenta/sdk/engines/running/runners/` holds `base.py` (`CodeRunner`),
+  `local.py` (`LocalRunner`, in-process `exec`), `daytona.py` (`DaytonaRunner`, remote
+  sandbox), and `registry.py` (`get_runner()`).
+- Selection: env `AGENTA_SERVICES_CODE_SANDBOX_RUNNER` (`local` default, `daytona` in
+  cloud).
+- `DaytonaRunner.run()` creates an `ephemeral=True` sandbox from a snapshot
+  (`DAYTONA_SNAPSHOT`), runs, and deletes it in a `finally`. **One sandbox per execution.**
+  No warm pool, no shared instance. It injects `AGENTA_HOST`, `AGENTA_API_KEY`, and the
+  user's provider keys as the sandbox `env_vars`.
+- Concurrency is bounded by the evaluation engine, not the runner: a shared
+  `asyncio.Semaphore(batch_size)` (default 10) in
+  `sdks/python/agenta/sdk/evaluations/runtime/processor.py`. So at most ~10 ephemeral
+  sandboxes exist at once.
+- Daytona config lives in `api/oss/src/utils/env.py` (`DaytonaConfig`:
+  `DAYTONA_API_KEY`, `DAYTONA_API_URL`, `DAYTONA_SNAPSHOT`, `DAYTONA_TARGET`).
+
+## What we change and what we keep
+
+Change: the transport behind the `Harness` port becomes rivet over ACP, with harness and
+sandbox as config values.
+
+Keep: the `/invoke` and `/inspect` contract, the `Harness` port and its dataclasses, the
+config resolution in `agent.py`, and the env-driven adapter selection in
+`_build_harness()` (extended with a rivet branch). The legacy Pi adapters keep working so
+nothing regresses.
+
+## Conventions
+
+- Standalone scripts run with `uv run` and inline `# /// script` dependencies.
+- Python edits: `ruff format` then `ruff check --fix` before committing.
+- Local-server parity is a first-class requirement carried from WP-2.