Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
73 changes: 73 additions & 0 deletions .github/workflows/12-check-unit-tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -301,3 +301,76 @@ jobs:
files: services/oss/tests/results/junit.xml
check_name: Application services Unit Test Results
comment_mode: off

run-services-node-unit-tests:
# The agent runner (services/agent) is a standalone Node/pnpm package, not part of the
# Python services suite above. It runs its own vitest unit tests plus a tsc typecheck gate.
# No "has_tests" guard on purpose: this suite is established, so a missing/empty suite must
# FAIL the job (vitest exits non-zero on no test files), not silently skip it.
if: |
github.event_name == 'workflow_dispatch' ||
!github.event.pull_request.draft
runs-on: ubuntu-latest
permissions:
checks: write
pull-requests: write
contents: read
env:
AGENTA_LICENSE: oss
steps:
- uses: actions/checkout@v6
Comment thread
mmabrouk marked this conversation as resolved.
with:
persist-credentials: false

- name: Skip when package selection excludes services
if: github.event_name == 'workflow_dispatch' && !contains(fromJSON('["all","services-only"]'), inputs.packages)
run: exit 0

- name: Set up Node.js
if: github.event_name != 'workflow_dispatch' || contains(fromJSON('["all","services-only"]'), inputs.packages)
uses: actions/setup-node@v4
with:
node-version: '24'

- name: Enable Corepack
if: github.event_name != 'workflow_dispatch' || contains(fromJSON('["all","services-only"]'), inputs.packages)
run: corepack enable

- name: Cache pnpm store
if: github.event_name != 'workflow_dispatch' || contains(fromJSON('["all","services-only"]'), inputs.packages)
uses: actions/cache@v4
with:
path: ~/.pnpm-store
key: ${{ runner.os }}-services-agent-pnpm-${{ hashFiles('services/agent/pnpm-lock.yaml') }}
restore-keys: |
${{ runner.os }}-services-agent-pnpm-

- name: Set up pnpm store
if: github.event_name != 'workflow_dispatch' || contains(fromJSON('["all","services-only"]'), inputs.packages)
working-directory: services/agent
run: pnpm config set store-dir ~/.pnpm-store

- name: Install dependencies
if: github.event_name != 'workflow_dispatch' || contains(fromJSON('["all","services-only"]'), inputs.packages)
working-directory: services/agent
run: pnpm install --frozen-lockfile

- name: Typecheck (tsc --noEmit, src + tests + config)
if: github.event_name != 'workflow_dispatch' || contains(fromJSON('["all","services-only"]'), inputs.packages)
working-directory: services/agent
run: pnpm run typecheck

# The code-tool unit test spawns python3 and node end-to-end; both are preinstalled on
# ubuntu runners (node is also set up above), so no setup-python step is needed.
- name: Run agent runner unit tests
if: github.event_name != 'workflow_dispatch' || contains(fromJSON('["all","services-only"]'), inputs.packages)
working-directory: services/agent
run: pnpm run test:unit

- name: Publish agent runner unit test results
if: always() && (github.event_name != 'workflow_dispatch' || contains(fromJSON('["all","services-only"]'), inputs.packages))
uses: EnricoMi/publish-unit-test-result-action@v2
with:
files: services/agent/test-results/junit.xml
check_name: Agent Runner Unit Test Results
comment_mode: off
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,8 @@ sdks/python/oss/tests/results/
sdks/python/ee/tests/results/
services/oss/tests/results/
services/ee/tests/results/
services/agent/test-results/
services/agent/coverage/
.*
!**/.gitkeep
!.github/
Expand Down
35 changes: 35 additions & 0 deletions docs/design/agent-workflows/typescript-structure/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# TypeScript structure for the agent runner

Planning workspace for making the new TypeScript code in the agent-workflows project
usable, maintainable, and testable, with tests that run easily and run in CI.

The new TypeScript lives mostly in one place: `services/agent/` (the Node "agent runner"
sidecar). This folder researches its current shape and proposes how to structure, test,
and gate it the way the rest of the monorepo already handles Python and frontend code.

## Files

- [context.md](context.md) — why this work exists, goals, non-goals, who it is for.
- [research.md](research.md) — what is actually in the repo today: where the TS lives, how
it builds, ships, and is (barely) tested; the conventions the repo already standardizes
for TS; a Python-to-TypeScript mental model; the gaps.
- [plan.md](plan.md) — the phased plan to close the gaps, with concrete file changes,
scripts, and CI wiring.
- [status.md](status.md) — source of truth for progress and open decisions. Read this
first to see where things stand.

## TL;DR

The runner code is well-organized (clear `engines/`, `tools/`, `tracing/` seams, a single
`protocol.ts` wire contract). The weak spots are tooling, not architecture:

1. Eight test files exist but there is **no test runner and no `pnpm test`**. Each test is
a hand-run `tsx` script.
2. Those tests run in **no CI workflow**. The Node side is invisible to the unit-test gate.
3. There is **no typecheck gate** even though the code is already `strict: true`.
4. The TS side has **no test asserting the cross-language wire contract**, which is only
pinned from Python today.

The plan adopts **vitest** (the runner `web/packages/*` already use), wires a Node job into
`12-check-unit-tests.yml`, adds a `tsc --noEmit` gate, and adds a golden-fixture round-trip
test so `protocol.ts` cannot drift from the Python wire silently.
49 changes: 49 additions & 0 deletions docs/design/agent-workflows/typescript-structure/context.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
# Context

## Why this work exists

The agent-workflows project introduced the first substantial server-side TypeScript in a
repo that was Python on the backend and TypeScript only on the frontend. The new code is
the agent runner sidecar at `services/agent/`. It drives the agent harnesses (Pi, Claude
Code, rivet's `sandbox-agent`) because those are Node libraries with no Python SDK. The
Python agent service calls into it over one JSON contract.

This code grew fast during the build-out. It works and it is reasonably well-factored, but
it sits outside the conventions the rest of the monorepo follows. The owner is a Python
developer and wants this TypeScript to feel as routine to maintain and test as the Python
does: a single command to run the tests, the tests running in CI, a typecheck gate, and a
clear place for new code and new tests to go.

## Goals

1. **Testable, easily.** One command (`pnpm test`) runs every unit test for the runner.
Watch mode and coverage work. Writing a new test is obvious and low-ceremony.
2. **Tested in CI.** The runner's tests run on every PR that touches it, with results
published the same way the Python and web suites are.
3. **Typechecked.** The `strict` TypeScript already configured produces a CI signal, so a
type error fails the build instead of reaching the dockerized sidecar at runtime.
4. **Contract-safe.** The wire contract between the Python service and the Node runner is
guarded from both sides, not just from Python.
5. **Maintainable and discoverable.** A new contributor (or agent) can find where runner
code and runner tests belong, following the same instruction-layering the repo uses for
`web/` and `api/`.

## Non-goals

- Rewriting or re-architecting the runner. The `engines` / `tools` / `tracing` split and
the `protocol.ts` contract stay. This is about tooling and structure, not a redesign.
- Folding `services/agent` into the `web/` pnpm workspace. It is a deployable sidecar with
its own Docker build and its own lockfile; it should stay a standalone package (see
research.md for the trade-off).
- Changing the frontend TypeScript (`web/oss/src/components/AgentChatSlice/`). That code
already lives in the web app under established conventions (vitest, package practices).
It is out of scope here.
- End-to-end / live-LLM acceptance tests for the runner. Those depend on real harness
credentials and are tracked separately in the agent-workflows test work. This plan is
about the fast unit/contract layer that can run on every PR with no secrets.

## Who this is for

The maintainer (Python-first) and any future contributor or agent touching
`services/agent`. research.md includes a Python-to-TypeScript mental model so the tooling
choices map onto things already familiar from the SDK and API side (uv, ruff, pytest).
173 changes: 173 additions & 0 deletions docs/design/agent-workflows/typescript-structure/plan.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,173 @@
# Plan

Four phases, ordered so value lands early and nothing later depends on a refactor. Phases 1
and 2 are the core ask (easy-to-run tests, tests in CI). Phase 3 protects the contract.
Phase 4 is structure and maintainability, adopted progressively.

Effort estimates assume one developer familiar with the runner. They are deliberate, not
padded.

## Phase 1 — Make the tests run with one command (~half day)

Goal: `pnpm test` in `services/agent` runs every unit test, with watch and coverage.

0. **Fix the latent bug the typecheck will expose.** `src/tools/dispatch.ts` references an
undefined `callRef` at lines 88 and 92 inside `relayToolCall`. Use the in-scope value
(`toolName`, or thread the spec's `callRef` in) so the error path stops throwing
`ReferenceError`. Found by Codex; this is the proof the typecheck gate has teeth.
1. Add dev deps to `services/agent/package.json`: `vitest`, `@vitest/coverage-v8`, **and
`typescript`** (currently absent: `node_modules/.bin/tsc` does not exist, so `typecheck`
cannot run without it). Match the versions `web/packages/*` pin (`vitest` `^4.1.x`); align
`@types/node` with Node 24.
2. Add `services/agent/vitest.config.ts`, modeled on `agenta-shared/vitest.config.ts`:
`include: ["tests/unit/**/*.test.ts"]`, `environment: "node"`,
`reporters: ["default", "junit"]` to `test-results/junit.xml`, v8 coverage over `src/`.
3. Add scripts to `package.json`:

```jsonc
"test": "pnpm run test:unit",
"test:unit": "vitest run",
"test:watch": "vitest",
"test:coverage": "vitest run --coverage",
"typecheck": "tsc --noEmit"
```

4. Move `test/*.test.ts` to `tests/unit/*.test.ts` and wrap the bare `{ ... }` blocks in
`describe` / `it` so reporting and junit are per-case. **Do not bother rewriting every
`assert` to `expect`** (Codex's point): vitest runs `node:assert` fine, so the conversion
is just adding `describe`/`it` wrappers, not touching assertions. Keep filenames. The
dynamic-import-after-env pattern (e.g. `skills.test.ts`) stays valid; add
`vi.resetModules()` only where a file needs a clean module per case.
5. Update the `Run:` header comment in each test to `pnpm test` (or
`pnpm exec vitest run tests/unit/<name>.test.ts` for a single file).

Done when: `pnpm test` is green locally and prints a single summary across all files.

## Phase 2 — Run them in CI (~half day)

Goal: the runner's tests gate every PR that touches `services/agent`.

1. Add a `run-services-node-unit-tests` job to `.github/workflows/12-check-unit-tests.yml`,
mirroring the existing `run-web-unit-tests` setup but scoped to the package:
- `actions/setup-node@v4` with `node-version: '24'`, `corepack enable`.
- Cache the pnpm store keyed on `services/agent/pnpm-lock.yaml`.
- `working-directory: services/agent`, `pnpm install --frozen-lockfile`, then
`pnpm run typecheck` and `pnpm run test:unit`.
- **Ensure `python3` is on the runner.** `test/code-tool.test.ts` spawns `python3` (and
`node`) through `runCodeTool`. ubuntu-latest ships python3, but make it explicit, or
split the subprocess code-tool test into an integration test the unit job can skip.
- Publish `services/agent/test-results/junit.xml` with
`EnricoMi/publish-unit-test-result-action@v2`, `check_name: Agent Runner Unit Tests`.
2. Path-filter the job. The workflow already triggers on `services/**`; gate the new job's
steps so it only does work when `services/agent/**` changed (the same `if:` pattern the
other jobs use for their package selection), to avoid installing Node on unrelated PRs.
3. Decide whether `typecheck` failing fails the job. Recommendation: yes. The code is
already `strict`; a type error should not merge.

Done when: a PR touching `services/agent` shows an "Agent Runner Unit Tests" check, and a
deliberately broken type or assertion turns it red.

## Phase 3 — Guard the wire contract from the TS side (~half day)

Goal: a contract change must update Python and TypeScript together, or fail on both.

**Codex correction (important):** `protocol.ts` is types only, erased at runtime. "Loading
JSON and round-tripping it through an interface" validates nothing at runtime. The contract
test needs real runtime checks, in two layers:

1. Add `tests/utils/golden.ts` that loads the shared fixtures from
`sdks/python/oss/tests/pytest/unit/agents/golden/` (relative path from the runner, read
at test time). No copying; one source of truth.
2. **Runtime validation, not type assertion.** Either (a) introduce a zod (or equivalent)
schema that mirrors `protocol.ts` and `parse()` each golden fixture in
`tests/unit/wire-contract.test.ts`, or (b) write explicit structural assertions (required
keys present, types correct, the `ok` discriminant). Option (a) doubles as a real runtime
guard the server can use on inbound requests; option (b) is lighter but only a test.
3. **Type-level check, separately.** Use vitest's `expectTypeOf` (or a `tsd`-style check) so
a fixture that drifts from `AgentRunRequest` fails `typecheck`, independent of the runtime
assertions.
4. Exercise the pure helpers in `protocol.ts` (`messageText`, `resolvePromptText`,
`resolveRunSessionId`) against fixture-derived inputs.
5. Note in `protocol.ts` and Python `test_wire_contract.py` that the contract is now pinned
from both sides, so future editors look both ways.

Done when: editing a field name in `protocol.ts` without updating the fixtures (or vice
versa) fails this test, at runtime and at typecheck.

## Phase 4 — Structure and maintainability (progressive, no big bang)

Adopt as the runner is touched, not in one sweep.

1. **Add `services/agent/AGENTS.md`** (with a `CLAUDE.md` symlink, matching `web/`, `api/`).
Keep it short: the package is a standalone pnpm project; how to run/serve/test/typecheck;
where runner code goes (`src/{engines,tools,tracing}`) and where tests go
(`tests/unit`, fixtures in `tests/utils`); the wire contract is mirrored in Python
`wire.py` and pinned by golden fixtures, so change both sides; vitest is the runner.
Add a thin `.claude/rules` / `.cursor/rules` pointer if the repo expects one.
2. **Local typecheck gate (optional).** The root `.husky/pre-commit` already runs prettier
and gitleaks repo-wide. Optionally add `pnpm --dir services/agent typecheck` for changed
TS, or leave the gate to CI to keep commits fast. Recommendation: CI is the gate; skip
the local hook unless commits regularly land type errors.
3. **Linting (optional, phase-2 nice-to-have).** There is no eslint outside `web/`.
`prettier` (global hook) covers formatting. A small `typescript-eslint` flat config for
`services/agent` would add real value for async runner code (`no-floating-promises`,
`no-misused-promises`). Treat as optional; `tsc --strict` + prettier is an acceptable
floor.
4. **Extract a testability seam (Codex).** `server.ts` and `cli.ts` wire transport to the
engines inline, so HTTP/CLI behavior can only be tested with a live harness. Export
`createServer(runAgent)` and `runCli(runAgent)` that take the engine as an argument. Then
unit tests inject a fake engine returning a deterministic `AgentRunResult` and cover
`/health`, invalid-JSON handling, `POST /run`, NDJSON record ordering, and CLI exit codes,
with no Pi/Claude/rivet. This is the highest-value structural change for testability.
5. **Decompose the two large files opportunistically.** When next editing `engines/rivet.ts`
or `tracing/otel.ts`, pull a cohesive seam into its own module and unit-test it, the way
`responder.ts` was extracted from `rivet.ts`. Not a scheduled refactor.

## Phase 5 — Make it a versioned, supportable service (Codex's main gap)

The review's core point: the plan above makes the runner testable but does not make it a
first-class deployable. These items make the SDK and the sidecar safe to release on their
own cadences. Scope and sequence with the platform/release owner; some are bigger than a
half-day.

1. **Protocol/version negotiation.** Add a `protocolVersion` (major) to the wire and have
`GET /health` (or a new `/capabilities`) return `runnerVersion`, `protocolVersion`,
supported engines, and harnesses. The Python adapter probes once and refuses an
incompatible major before the first run. Today `/health` returns only `{status:"ok"}` and
`package.json` is `0.0.0`.
2. **Release ownership.** Decide whether the sidecar version tracks the Agenta release or is
versioned independently, and stop shipping `0.0.0`. The SDK should pin a compatible runner
*protocol* range, not a package-version equality.
3. **Sidecar image publishing.** No CI publishes the runner image today (only api/web/services
images are built, e.g. in `42-railway-build.yml`). Add a build/publish job so the HTTP
sidecar (the production boundary) is actually distributable.
4. **Local code-tool execution policy.** `runCodeTool` scopes secret env, but a `code` tool
still runs an arbitrary `python3`/`node` process in the sidecar. State the sandbox,
resource, and network policy (it is already sandboxed in Daytona; the local/in-sidecar
path needs an explicit stance), so this is a deliberate posture, not an oversight.
5. **Config hygiene.** `services/oss/src/agent/app.py` reads `AGENTA_AGENT_*` via raw
`os.getenv`. The repo convention (root `AGENTS.md`) is to add config to
`api/oss/src/utils/env.py` and consume the shared `env` object. Align it.
6. **Fix the stale `local.py` docstring.** `sdks/python/.../adapters/local.py` says the Pi
runner is "shipped inside the wheel," which is not true today and is the likely source of
the wheel confusion. Either implement that path deliberately (see the packaging options in
the answer to question 1) or correct the docstring to match reality.

## Sequencing and ownership

- Phases 1 to 3 are independent of any runtime change and can land as one small PR or three
tiny ones. They add no production code paths, only tooling and tests. Start here.
- Phase 4 item 1 (`AGENTS.md`) is worth doing alongside Phase 1 so the new test location is
documented the moment it exists. Item 4 (the `createServer`/`runCli` seam) unblocks the
HTTP/CLI tests and is worth pulling forward.
- Phase 5 is a separate track, owned with whoever owns releases and deployment. It does not
block Phases 1 to 4, but it is what turns "tested code" into "supportable service."
- None of this blocks ongoing runner feature work; it runs in parallel.

## What success looks like

- `cd services/agent && pnpm test` runs the whole suite in one go, green, with a summary.
- A PR touching the runner gets a red/green unit-test + typecheck check automatically.
- `protocol.ts` cannot drift from the Python wire without a test failing.
- A new contributor reads `services/agent/AGENTS.md` and knows where code and tests go and
how to run them, without reading the whole tree.
Loading
Loading