Agenta-AI · mmabrouk · Jun 20, 2026 · Jun 20, 2026
diff --git a/.github/workflows/12-check-unit-tests.yml b/.github/workflows/12-check-unit-tests.yml
@@ -301,3 +301,76 @@ jobs:
           files: services/oss/tests/results/junit.xml
           check_name: Application services Unit Test Results
           comment_mode: off
+
+  run-services-node-unit-tests:
+    # The agent runner (services/agent) is a standalone Node/pnpm package, not part of the
+    # Python services suite above. It runs its own vitest unit tests plus a tsc typecheck gate.
+    # No "has_tests" guard on purpose: this suite is established, so a missing/empty suite must
+    # FAIL the job (vitest exits non-zero on no test files), not silently skip it.
+    if: |
+      github.event_name == 'workflow_dispatch' ||
+      !github.event.pull_request.draft
+    runs-on: ubuntu-latest
+    permissions:
+      checks: write
+      pull-requests: write
+      contents: read
+    env:
+      AGENTA_LICENSE: oss
+    steps:
+      - uses: actions/checkout@v6
+        with:
+          persist-credentials: false
+
+      - name: Skip when package selection excludes services
+        if: github.event_name == 'workflow_dispatch' && !contains(fromJSON('["all","services-only"]'), inputs.packages)
+        run: exit 0
+
+      - name: Set up Node.js
+        if: github.event_name != 'workflow_dispatch' || contains(fromJSON('["all","services-only"]'), inputs.packages)
+        uses: actions/setup-node@v4
+        with:
+          node-version: '24'
+
+      - name: Enable Corepack
+        if: github.event_name != 'workflow_dispatch' || contains(fromJSON('["all","services-only"]'), inputs.packages)
+        run: corepack enable
+
+      - name: Cache pnpm store
+        if: github.event_name != 'workflow_dispatch' || contains(fromJSON('["all","services-only"]'), inputs.packages)
+        uses: actions/cache@v4
+        with:
+          path: ~/.pnpm-store
+          key: ${{ runner.os }}-services-agent-pnpm-${{ hashFiles('services/agent/pnpm-lock.yaml') }}
+          restore-keys: |
+            ${{ runner.os }}-services-agent-pnpm-
+
+      - name: Set up pnpm store
+        if: github.event_name != 'workflow_dispatch' || contains(fromJSON('["all","services-only"]'), inputs.packages)
+        working-directory: services/agent
+        run: pnpm config set store-dir ~/.pnpm-store
+
+      - name: Install dependencies
+        if: github.event_name != 'workflow_dispatch' || contains(fromJSON('["all","services-only"]'), inputs.packages)
+        working-directory: services/agent
+        run: pnpm install --frozen-lockfile
+
+      - name: Typecheck (tsc --noEmit, src + tests + config)
+        if: github.event_name != 'workflow_dispatch' || contains(fromJSON('["all","services-only"]'), inputs.packages)
+        working-directory: services/agent
+        run: pnpm run typecheck
+
+      # The code-tool unit test spawns python3 and node end-to-end; both are preinstalled on
+      # ubuntu runners (node is also set up above), so no setup-python step is needed.
+      - name: Run agent runner unit tests
+        if: github.event_name != 'workflow_dispatch' || contains(fromJSON('["all","services-only"]'), inputs.packages)
+        working-directory: services/agent
+        run: pnpm run test:unit
+
+      - name: Publish agent runner unit test results
+        if: always() && (github.event_name != 'workflow_dispatch' || contains(fromJSON('["all","services-only"]'), inputs.packages))
+        uses: EnricoMi/publish-unit-test-result-action@v2
+        with:
+          files: services/agent/test-results/junit.xml
+          check_name: Agent Runner Unit Test Results
+          comment_mode: off
diff --git a/.gitignore b/.gitignore
@@ -65,6 +65,8 @@ sdks/python/oss/tests/results/
 sdks/python/ee/tests/results/
 services/oss/tests/results/
 services/ee/tests/results/
+services/agent/test-results/
+services/agent/coverage/
 .*
 !**/.gitkeep
 !.github/

diff --git a/docs/design/agent-workflows/typescript-structure/README.md b/docs/design/agent-workflows/typescript-structure/README.md
@@ -0,0 +1,35 @@
+# TypeScript structure for the agent runner
+
+Planning workspace for making the new TypeScript code in the agent-workflows project
+usable, maintainable, and testable, with tests that run easily and run in CI.
+
+The new TypeScript lives mostly in one place: `services/agent/` (the Node "agent runner"
+sidecar). This folder researches its current shape and proposes how to structure, test,
+and gate it the way the rest of the monorepo already handles Python and frontend code.
+
+## Files
+
+- [context.md](context.md) — why this work exists, goals, non-goals, who it is for.
+- [research.md](research.md) — what is actually in the repo today: where the TS lives, how
+  it builds, ships, and is (barely) tested; the conventions the repo already standardizes
+  for TS; a Python-to-TypeScript mental model; the gaps.
+- [plan.md](plan.md) — the phased plan to close the gaps, with concrete file changes,
+  scripts, and CI wiring.
+- [status.md](status.md) — source of truth for progress and open decisions. Read this
+  first to see where things stand.
+
+## TL;DR
+
+The runner code is well-organized (clear `engines/`, `tools/`, `tracing/` seams, a single
+`protocol.ts` wire contract). The weak spots are tooling, not architecture:
+
+1. Eight test files exist but there is **no test runner and no `pnpm test`**. Each test is
+   a hand-run `tsx` script.
+2. Those tests run in **no CI workflow**. The Node side is invisible to the unit-test gate.
+3. There is **no typecheck gate** even though the code is already `strict: true`.
+4. The TS side has **no test asserting the cross-language wire contract**, which is only
+   pinned from Python today.
+
+The plan adopts **vitest** (the runner `web/packages/*` already use), wires a Node job into
+`12-check-unit-tests.yml`, adds a `tsc --noEmit` gate, and adds a golden-fixture round-trip
+test so `protocol.ts` cannot drift from the Python wire silently.
diff --git a/docs/design/agent-workflows/typescript-structure/context.md b/docs/design/agent-workflows/typescript-structure/context.md
@@ -0,0 +1,49 @@
+# Context
+
+## Why this work exists
+
+The agent-workflows project introduced the first substantial server-side TypeScript in a
+repo that was Python on the backend and TypeScript only on the frontend. The new code is
+the agent runner sidecar at `services/agent/`. It drives the agent harnesses (Pi, Claude
+Code, rivet's `sandbox-agent`) because those are Node libraries with no Python SDK. The
+Python agent service calls into it over one JSON contract.
+
+This code grew fast during the build-out. It works and it is reasonably well-factored, but
+it sits outside the conventions the rest of the monorepo follows. The owner is a Python
+developer and wants this TypeScript to feel as routine to maintain and test as the Python
+does: a single command to run the tests, the tests running in CI, a typecheck gate, and a
+clear place for new code and new tests to go.
+
+## Goals
+
+1. **Testable, easily.** One command (`pnpm test`) runs every unit test for the runner.
+   Watch mode and coverage work. Writing a new test is obvious and low-ceremony.
+2. **Tested in CI.** The runner's tests run on every PR that touches it, with results
+   published the same way the Python and web suites are.
+3. **Typechecked.** The `strict` TypeScript already configured produces a CI signal, so a
+   type error fails the build instead of reaching the dockerized sidecar at runtime.
+4. **Contract-safe.** The wire contract between the Python service and the Node runner is
+   guarded from both sides, not just from Python.
+5. **Maintainable and discoverable.** A new contributor (or agent) can find where runner
+   code and runner tests belong, following the same instruction-layering the repo uses for
+   `web/` and `api/`.
+
+## Non-goals
+
+- Rewriting or re-architecting the runner. The `engines` / `tools` / `tracing` split and
+  the `protocol.ts` contract stay. This is about tooling and structure, not a redesign.
+- Folding `services/agent` into the `web/` pnpm workspace. It is a deployable sidecar with
+  its own Docker build and its own lockfile; it should stay a standalone package (see
+  research.md for the trade-off).
+- Changing the frontend TypeScript (`web/oss/src/components/AgentChatSlice/`). That code
+  already lives in the web app under established conventions (vitest, package practices).
+  It is out of scope here.
+- End-to-end / live-LLM acceptance tests for the runner. Those depend on real harness
+  credentials and are tracked separately in the agent-workflows test work. This plan is
+  about the fast unit/contract layer that can run on every PR with no secrets.
+
+## Who this is for
+
+The maintainer (Python-first) and any future contributor or agent touching
+`services/agent`. research.md includes a Python-to-TypeScript mental model so the tooling
+choices map onto things already familiar from the SDK and API side (uv, ruff, pytest).
diff --git a/docs/design/agent-workflows/typescript-structure/plan.md b/docs/design/agent-workflows/typescript-structure/plan.md
@@ -0,0 +1,173 @@
+# Plan
+
+Four phases, ordered so value lands early and nothing later depends on a refactor. Phases 1
+and 2 are the core ask (easy-to-run tests, tests in CI). Phase 3 protects the contract.
+Phase 4 is structure and maintainability, adopted progressively.
+
+Effort estimates assume one developer familiar with the runner. They are deliberate, not
+padded.
+
+## Phase 1 — Make the tests run with one command (~half day)
+
+Goal: `pnpm test` in `services/agent` runs every unit test, with watch and coverage.
+
+0. **Fix the latent bug the typecheck will expose.** `src/tools/dispatch.ts` references an
+   undefined `callRef` at lines 88 and 92 inside `relayToolCall`. Use the in-scope value
+   (`toolName`, or thread the spec's `callRef` in) so the error path stops throwing
+   `ReferenceError`. Found by Codex; this is the proof the typecheck gate has teeth.
+1. Add dev deps to `services/agent/package.json`: `vitest`, `@vitest/coverage-v8`, **and
+   `typescript`** (currently absent: `node_modules/.bin/tsc` does not exist, so `typecheck`
+   cannot run without it). Match the versions `web/packages/*` pin (`vitest` `^4.1.x`); align
+   `@types/node` with Node 24.
+2. Add `services/agent/vitest.config.ts`, modeled on `agenta-shared/vitest.config.ts`:
+   `include: ["tests/unit/**/*.test.ts"]`, `environment: "node"`,
+   `reporters: ["default", "junit"]` to `test-results/junit.xml`, v8 coverage over `src/`.
+3. Add scripts to `package.json`:
+
+   ```jsonc
+   "test": "pnpm run test:unit",
+   "test:unit": "vitest run",
+   "test:watch": "vitest",
+   "test:coverage": "vitest run --coverage",
+   "typecheck": "tsc --noEmit"
+   ```
+
+4. Move `test/*.test.ts` to `tests/unit/*.test.ts` and wrap the bare `{ ... }` blocks in
+   `describe` / `it` so reporting and junit are per-case. **Do not bother rewriting every
+   `assert` to `expect`** (Codex's point): vitest runs `node:assert` fine, so the conversion
+   is just adding `describe`/`it` wrappers, not touching assertions. Keep filenames. The
+   dynamic-import-after-env pattern (e.g. `skills.test.ts`) stays valid; add
+   `vi.resetModules()` only where a file needs a clean module per case.
+5. Update the `Run:` header comment in each test to `pnpm test` (or
+   `pnpm exec vitest run tests/unit/<name>.test.ts` for a single file).
+
+Done when: `pnpm test` is green locally and prints a single summary across all files.
+
+## Phase 2 — Run them in CI (~half day)
+
+Goal: the runner's tests gate every PR that touches `services/agent`.
+
+1. Add a `run-services-node-unit-tests` job to `.github/workflows/12-check-unit-tests.yml`,
+   mirroring the existing `run-web-unit-tests` setup but scoped to the package:
+   - `actions/setup-node@v4` with `node-version: '24'`, `corepack enable`.
+   - Cache the pnpm store keyed on `services/agent/pnpm-lock.yaml`.
+   - `working-directory: services/agent`, `pnpm install --frozen-lockfile`, then
+     `pnpm run typecheck` and `pnpm run test:unit`.
+   - **Ensure `python3` is on the runner.** `test/code-tool.test.ts` spawns `python3` (and
+     `node`) through `runCodeTool`. ubuntu-latest ships python3, but make it explicit, or
+     split the subprocess code-tool test into an integration test the unit job can skip.
+   - Publish `services/agent/test-results/junit.xml` with
+     `EnricoMi/publish-unit-test-result-action@v2`, `check_name: Agent Runner Unit Tests`.
+2. Path-filter the job. The workflow already triggers on `services/**`; gate the new job's
+   steps so it only does work when `services/agent/**` changed (the same `if:` pattern the
+   other jobs use for their package selection), to avoid installing Node on unrelated PRs.
+3. Decide whether `typecheck` failing fails the job. Recommendation: yes. The code is
+   already `strict`; a type error should not merge.
+
+Done when: a PR touching `services/agent` shows an "Agent Runner Unit Tests" check, and a
+deliberately broken type or assertion turns it red.
+
+## Phase 3 — Guard the wire contract from the TS side (~half day)
+
+Goal: a contract change must update Python and TypeScript together, or fail on both.
+
+**Codex correction (important):** `protocol.ts` is types only, erased at runtime. "Loading
+JSON and round-tripping it through an interface" validates nothing at runtime. The contract
+test needs real runtime checks, in two layers:
+
+1. Add `tests/utils/golden.ts` that loads the shared fixtures from
+   `sdks/python/oss/tests/pytest/unit/agents/golden/` (relative path from the runner, read
+   at test time). No copying; one source of truth.
+2. **Runtime validation, not type assertion.** Either (a) introduce a zod (or equivalent)
+   schema that mirrors `protocol.ts` and `parse()` each golden fixture in
+   `tests/unit/wire-contract.test.ts`, or (b) write explicit structural assertions (required
+   keys present, types correct, the `ok` discriminant). Option (a) doubles as a real runtime
+   guard the server can use on inbound requests; option (b) is lighter but only a test.
+3. **Type-level check, separately.** Use vitest's `expectTypeOf` (or a `tsd`-style check) so
+   a fixture that drifts from `AgentRunRequest` fails `typecheck`, independent of the runtime
+   assertions.
+4. Exercise the pure helpers in `protocol.ts` (`messageText`, `resolvePromptText`,
+   `resolveRunSessionId`) against fixture-derived inputs.
+5. Note in `protocol.ts` and Python `test_wire_contract.py` that the contract is now pinned
+   from both sides, so future editors look both ways.
+
+Done when: editing a field name in `protocol.ts` without updating the fixtures (or vice
+versa) fails this test, at runtime and at typecheck.
+
+## Phase 4 — Structure and maintainability (progressive, no big bang)
+
+Adopt as the runner is touched, not in one sweep.
+
+1. **Add `services/agent/AGENTS.md`** (with a `CLAUDE.md` symlink, matching `web/`, `api/`).
+   Keep it short: the package is a standalone pnpm project; how to run/serve/test/typecheck;
+   where runner code goes (`src/{engines,tools,tracing}`) and where tests go
+   (`tests/unit`, fixtures in `tests/utils`); the wire contract is mirrored in Python
+   `wire.py` and pinned by golden fixtures, so change both sides; vitest is the runner.
+   Add a thin `.claude/rules` / `.cursor/rules` pointer if the repo expects one.
+2. **Local typecheck gate (optional).** The root `.husky/pre-commit` already runs prettier
+   and gitleaks repo-wide. Optionally add `pnpm --dir services/agent typecheck` for changed
+   TS, or leave the gate to CI to keep commits fast. Recommendation: CI is the gate; skip
+   the local hook unless commits regularly land type errors.
+3. **Linting (optional, phase-2 nice-to-have).** There is no eslint outside `web/`.
+   `prettier` (global hook) covers formatting. A small `typescript-eslint` flat config for
+   `services/agent` would add real value for async runner code (`no-floating-promises`,
+   `no-misused-promises`). Treat as optional; `tsc --strict` + prettier is an acceptable
+   floor.
+4. **Extract a testability seam (Codex).** `server.ts` and `cli.ts` wire transport to the
+   engines inline, so HTTP/CLI behavior can only be tested with a live harness. Export
+   `createServer(runAgent)` and `runCli(runAgent)` that take the engine as an argument. Then
+   unit tests inject a fake engine returning a deterministic `AgentRunResult` and cover
+   `/health`, invalid-JSON handling, `POST /run`, NDJSON record ordering, and CLI exit codes,
+   with no Pi/Claude/rivet. This is the highest-value structural change for testability.
+5. **Decompose the two large files opportunistically.** When next editing `engines/rivet.ts`
+   or `tracing/otel.ts`, pull a cohesive seam into its own module and unit-test it, the way
+   `responder.ts` was extracted from `rivet.ts`. Not a scheduled refactor.
+
+## Phase 5 — Make it a versioned, supportable service (Codex's main gap)
+
+The review's core point: the plan above makes the runner testable but does not make it a
+first-class deployable. These items make the SDK and the sidecar safe to release on their
+own cadences. Scope and sequence with the platform/release owner; some are bigger than a
+half-day.
+
+1. **Protocol/version negotiation.** Add a `protocolVersion` (major) to the wire and have
+   `GET /health` (or a new `/capabilities`) return `runnerVersion`, `protocolVersion`,
+   supported engines, and harnesses. The Python adapter probes once and refuses an
+   incompatible major before the first run. Today `/health` returns only `{status:"ok"}` and
+   `package.json` is `0.0.0`.
+2. **Release ownership.** Decide whether the sidecar version tracks the Agenta release or is
+   versioned independently, and stop shipping `0.0.0`. The SDK should pin a compatible runner
+   *protocol* range, not a package-version equality.
+3. **Sidecar image publishing.** No CI publishes the runner image today (only api/web/services
+   images are built, e.g. in `42-railway-build.yml`). Add a build/publish job so the HTTP
+   sidecar (the production boundary) is actually distributable.
+4. **Local code-tool execution policy.** `runCodeTool` scopes secret env, but a `code` tool
+   still runs an arbitrary `python3`/`node` process in the sidecar. State the sandbox,
+   resource, and network policy (it is already sandboxed in Daytona; the local/in-sidecar
+   path needs an explicit stance), so this is a deliberate posture, not an oversight.
+5. **Config hygiene.** `services/oss/src/agent/app.py` reads `AGENTA_AGENT_*` via raw
+   `os.getenv`. The repo convention (root `AGENTS.md`) is to add config to
+   `api/oss/src/utils/env.py` and consume the shared `env` object. Align it.
+6. **Fix the stale `local.py` docstring.** `sdks/python/.../adapters/local.py` says the Pi
+   runner is "shipped inside the wheel," which is not true today and is the likely source of
+   the wheel confusion. Either implement that path deliberately (see the packaging options in
+   the answer to question 1) or correct the docstring to match reality.
+
+## Sequencing and ownership
+
+- Phases 1 to 3 are independent of any runtime change and can land as one small PR or three
+  tiny ones. They add no production code paths, only tooling and tests. Start here.
+- Phase 4 item 1 (`AGENTS.md`) is worth doing alongside Phase 1 so the new test location is
+  documented the moment it exists. Item 4 (the `createServer`/`runCli` seam) unblocks the
+  HTTP/CLI tests and is worth pulling forward.
+- Phase 5 is a separate track, owned with whoever owns releases and deployment. It does not
+  block Phases 1 to 4, but it is what turns "tested code" into "supportable service."
+- None of this blocks ongoing runner feature work; it runs in parallel.
+
+## What success looks like
+
+- `cd services/agent && pnpm test` runs the whole suite in one go, green, with a summary.
+- A PR touching the runner gets a red/green unit-test + typecheck check automatically.
+- `protocol.ts` cannot drift from the Python wire without a test failing.
+- A new contributor reads `services/agent/AGENTS.md` and knows where code and tests go and
+  how to run them, without reading the whole tree.