-
Notifications
You must be signed in to change notification settings - Fork 555
test(agent): vitest suite + CI for the agent runner; fix relay error bug #4784
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
35 changes: 35 additions & 0 deletions
35
docs/design/agent-workflows/typescript-structure/README.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,35 @@ | ||
| # TypeScript structure for the agent runner | ||
|
|
||
| Planning workspace for making the new TypeScript code in the agent-workflows project | ||
| usable, maintainable, and testable, with tests that run easily and run in CI. | ||
|
|
||
| The new TypeScript lives mostly in one place: `services/agent/` (the Node "agent runner" | ||
| sidecar). This folder researches its current shape and proposes how to structure, test, | ||
| and gate it the way the rest of the monorepo already handles Python and frontend code. | ||
|
|
||
| ## Files | ||
|
|
||
| - [context.md](context.md) — why this work exists, goals, non-goals, who it is for. | ||
| - [research.md](research.md) — what is actually in the repo today: where the TS lives, how | ||
| it builds, ships, and is (barely) tested; the conventions the repo already standardizes | ||
| for TS; a Python-to-TypeScript mental model; the gaps. | ||
| - [plan.md](plan.md) — the phased plan to close the gaps, with concrete file changes, | ||
| scripts, and CI wiring. | ||
| - [status.md](status.md) — source of truth for progress and open decisions. Read this | ||
| first to see where things stand. | ||
|
|
||
| ## TL;DR | ||
|
|
||
| The runner code is well-organized (clear `engines/`, `tools/`, `tracing/` seams, a single | ||
| `protocol.ts` wire contract). The weak spots are tooling, not architecture: | ||
|
|
||
| 1. Eight test files exist but there is **no test runner and no `pnpm test`**. Each test is | ||
| a hand-run `tsx` script. | ||
| 2. Those tests run in **no CI workflow**. The Node side is invisible to the unit-test gate. | ||
| 3. There is **no typecheck gate** even though the code is already `strict: true`. | ||
| 4. The TS side has **no test asserting the cross-language wire contract**, which is only | ||
| pinned from Python today. | ||
|
|
||
| The plan adopts **vitest** (the runner `web/packages/*` already use), wires a Node job into | ||
| `12-check-unit-tests.yml`, adds a `tsc --noEmit` gate, and adds a golden-fixture round-trip | ||
| test so `protocol.ts` cannot drift from the Python wire silently. |
49 changes: 49 additions & 0 deletions
49
docs/design/agent-workflows/typescript-structure/context.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,49 @@ | ||
| # Context | ||
|
|
||
| ## Why this work exists | ||
|
|
||
| The agent-workflows project introduced the first substantial server-side TypeScript in a | ||
| repo that was Python on the backend and TypeScript only on the frontend. The new code is | ||
| the agent runner sidecar at `services/agent/`. It drives the agent harnesses (Pi, Claude | ||
| Code, rivet's `sandbox-agent`) because those are Node libraries with no Python SDK. The | ||
| Python agent service calls into it over one JSON contract. | ||
|
|
||
| This code grew fast during the build-out. It works and it is reasonably well-factored, but | ||
| it sits outside the conventions the rest of the monorepo follows. The owner is a Python | ||
| developer and wants this TypeScript to feel as routine to maintain and test as the Python | ||
| does: a single command to run the tests, the tests running in CI, a typecheck gate, and a | ||
| clear place for new code and new tests to go. | ||
|
|
||
| ## Goals | ||
|
|
||
| 1. **Testable, easily.** One command (`pnpm test`) runs every unit test for the runner. | ||
| Watch mode and coverage work. Writing a new test is obvious and low-ceremony. | ||
| 2. **Tested in CI.** The runner's tests run on every PR that touches it, with results | ||
| published the same way the Python and web suites are. | ||
| 3. **Typechecked.** The `strict` TypeScript already configured produces a CI signal, so a | ||
| type error fails the build instead of reaching the dockerized sidecar at runtime. | ||
| 4. **Contract-safe.** The wire contract between the Python service and the Node runner is | ||
| guarded from both sides, not just from Python. | ||
| 5. **Maintainable and discoverable.** A new contributor (or agent) can find where runner | ||
| code and runner tests belong, following the same instruction-layering the repo uses for | ||
| `web/` and `api/`. | ||
|
|
||
| ## Non-goals | ||
|
|
||
| - Rewriting or re-architecting the runner. The `engines` / `tools` / `tracing` split and | ||
| the `protocol.ts` contract stay. This is about tooling and structure, not a redesign. | ||
| - Folding `services/agent` into the `web/` pnpm workspace. It is a deployable sidecar with | ||
| its own Docker build and its own lockfile; it should stay a standalone package (see | ||
| research.md for the trade-off). | ||
| - Changing the frontend TypeScript (`web/oss/src/components/AgentChatSlice/`). That code | ||
| already lives in the web app under established conventions (vitest, package practices). | ||
| It is out of scope here. | ||
| - End-to-end / live-LLM acceptance tests for the runner. Those depend on real harness | ||
| credentials and are tracked separately in the agent-workflows test work. This plan is | ||
| about the fast unit/contract layer that can run on every PR with no secrets. | ||
|
|
||
| ## Who this is for | ||
|
|
||
| The maintainer (Python-first) and any future contributor or agent touching | ||
| `services/agent`. research.md includes a Python-to-TypeScript mental model so the tooling | ||
| choices map onto things already familiar from the SDK and API side (uv, ruff, pytest). |
173 changes: 173 additions & 0 deletions
173
docs/design/agent-workflows/typescript-structure/plan.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,173 @@ | ||
| # Plan | ||
|
|
||
| Four phases, ordered so value lands early and nothing later depends on a refactor. Phases 1 | ||
| and 2 are the core ask (easy-to-run tests, tests in CI). Phase 3 protects the contract. | ||
| Phase 4 is structure and maintainability, adopted progressively. | ||
|
|
||
| Effort estimates assume one developer familiar with the runner. They are deliberate, not | ||
| padded. | ||
|
|
||
| ## Phase 1 — Make the tests run with one command (~half day) | ||
|
|
||
| Goal: `pnpm test` in `services/agent` runs every unit test, with watch and coverage. | ||
|
|
||
| 0. **Fix the latent bug the typecheck will expose.** `src/tools/dispatch.ts` references an | ||
| undefined `callRef` at lines 88 and 92 inside `relayToolCall`. Use the in-scope value | ||
| (`toolName`, or thread the spec's `callRef` in) so the error path stops throwing | ||
| `ReferenceError`. Found by Codex; this is the proof the typecheck gate has teeth. | ||
| 1. Add dev deps to `services/agent/package.json`: `vitest`, `@vitest/coverage-v8`, **and | ||
| `typescript`** (currently absent: `node_modules/.bin/tsc` does not exist, so `typecheck` | ||
| cannot run without it). Match the versions `web/packages/*` pin (`vitest` `^4.1.x`); align | ||
| `@types/node` with Node 24. | ||
| 2. Add `services/agent/vitest.config.ts`, modeled on `agenta-shared/vitest.config.ts`: | ||
| `include: ["tests/unit/**/*.test.ts"]`, `environment: "node"`, | ||
| `reporters: ["default", "junit"]` to `test-results/junit.xml`, v8 coverage over `src/`. | ||
| 3. Add scripts to `package.json`: | ||
|
|
||
| ```jsonc | ||
| "test": "pnpm run test:unit", | ||
| "test:unit": "vitest run", | ||
| "test:watch": "vitest", | ||
| "test:coverage": "vitest run --coverage", | ||
| "typecheck": "tsc --noEmit" | ||
| ``` | ||
|
|
||
| 4. Move `test/*.test.ts` to `tests/unit/*.test.ts` and wrap the bare `{ ... }` blocks in | ||
| `describe` / `it` so reporting and junit are per-case. **Do not bother rewriting every | ||
| `assert` to `expect`** (Codex's point): vitest runs `node:assert` fine, so the conversion | ||
| is just adding `describe`/`it` wrappers, not touching assertions. Keep filenames. The | ||
| dynamic-import-after-env pattern (e.g. `skills.test.ts`) stays valid; add | ||
| `vi.resetModules()` only where a file needs a clean module per case. | ||
| 5. Update the `Run:` header comment in each test to `pnpm test` (or | ||
| `pnpm exec vitest run tests/unit/<name>.test.ts` for a single file). | ||
|
|
||
| Done when: `pnpm test` is green locally and prints a single summary across all files. | ||
|
|
||
| ## Phase 2 — Run them in CI (~half day) | ||
|
|
||
| Goal: the runner's tests gate every PR that touches `services/agent`. | ||
|
|
||
| 1. Add a `run-services-node-unit-tests` job to `.github/workflows/12-check-unit-tests.yml`, | ||
| mirroring the existing `run-web-unit-tests` setup but scoped to the package: | ||
| - `actions/setup-node@v4` with `node-version: '24'`, `corepack enable`. | ||
| - Cache the pnpm store keyed on `services/agent/pnpm-lock.yaml`. | ||
| - `working-directory: services/agent`, `pnpm install --frozen-lockfile`, then | ||
| `pnpm run typecheck` and `pnpm run test:unit`. | ||
| - **Ensure `python3` is on the runner.** `test/code-tool.test.ts` spawns `python3` (and | ||
| `node`) through `runCodeTool`. ubuntu-latest ships python3, but make it explicit, or | ||
| split the subprocess code-tool test into an integration test the unit job can skip. | ||
| - Publish `services/agent/test-results/junit.xml` with | ||
| `EnricoMi/publish-unit-test-result-action@v2`, `check_name: Agent Runner Unit Tests`. | ||
| 2. Path-filter the job. The workflow already triggers on `services/**`; gate the new job's | ||
| steps so it only does work when `services/agent/**` changed (the same `if:` pattern the | ||
| other jobs use for their package selection), to avoid installing Node on unrelated PRs. | ||
| 3. Decide whether `typecheck` failing fails the job. Recommendation: yes. The code is | ||
| already `strict`; a type error should not merge. | ||
|
|
||
| Done when: a PR touching `services/agent` shows an "Agent Runner Unit Tests" check, and a | ||
| deliberately broken type or assertion turns it red. | ||
|
|
||
| ## Phase 3 — Guard the wire contract from the TS side (~half day) | ||
|
|
||
| Goal: a contract change must update Python and TypeScript together, or fail on both. | ||
|
|
||
| **Codex correction (important):** `protocol.ts` is types only, erased at runtime. "Loading | ||
| JSON and round-tripping it through an interface" validates nothing at runtime. The contract | ||
| test needs real runtime checks, in two layers: | ||
|
|
||
| 1. Add `tests/utils/golden.ts` that loads the shared fixtures from | ||
| `sdks/python/oss/tests/pytest/unit/agents/golden/` (relative path from the runner, read | ||
| at test time). No copying; one source of truth. | ||
| 2. **Runtime validation, not type assertion.** Either (a) introduce a zod (or equivalent) | ||
| schema that mirrors `protocol.ts` and `parse()` each golden fixture in | ||
| `tests/unit/wire-contract.test.ts`, or (b) write explicit structural assertions (required | ||
| keys present, types correct, the `ok` discriminant). Option (a) doubles as a real runtime | ||
| guard the server can use on inbound requests; option (b) is lighter but only a test. | ||
| 3. **Type-level check, separately.** Use vitest's `expectTypeOf` (or a `tsd`-style check) so | ||
| a fixture that drifts from `AgentRunRequest` fails `typecheck`, independent of the runtime | ||
| assertions. | ||
| 4. Exercise the pure helpers in `protocol.ts` (`messageText`, `resolvePromptText`, | ||
| `resolveRunSessionId`) against fixture-derived inputs. | ||
| 5. Note in `protocol.ts` and Python `test_wire_contract.py` that the contract is now pinned | ||
| from both sides, so future editors look both ways. | ||
|
|
||
| Done when: editing a field name in `protocol.ts` without updating the fixtures (or vice | ||
| versa) fails this test, at runtime and at typecheck. | ||
|
|
||
| ## Phase 4 — Structure and maintainability (progressive, no big bang) | ||
|
|
||
| Adopt as the runner is touched, not in one sweep. | ||
|
|
||
| 1. **Add `services/agent/AGENTS.md`** (with a `CLAUDE.md` symlink, matching `web/`, `api/`). | ||
| Keep it short: the package is a standalone pnpm project; how to run/serve/test/typecheck; | ||
| where runner code goes (`src/{engines,tools,tracing}`) and where tests go | ||
| (`tests/unit`, fixtures in `tests/utils`); the wire contract is mirrored in Python | ||
| `wire.py` and pinned by golden fixtures, so change both sides; vitest is the runner. | ||
| Add a thin `.claude/rules` / `.cursor/rules` pointer if the repo expects one. | ||
| 2. **Local typecheck gate (optional).** The root `.husky/pre-commit` already runs prettier | ||
| and gitleaks repo-wide. Optionally add `pnpm --dir services/agent typecheck` for changed | ||
| TS, or leave the gate to CI to keep commits fast. Recommendation: CI is the gate; skip | ||
| the local hook unless commits regularly land type errors. | ||
| 3. **Linting (optional, phase-2 nice-to-have).** There is no eslint outside `web/`. | ||
| `prettier` (global hook) covers formatting. A small `typescript-eslint` flat config for | ||
| `services/agent` would add real value for async runner code (`no-floating-promises`, | ||
| `no-misused-promises`). Treat as optional; `tsc --strict` + prettier is an acceptable | ||
| floor. | ||
| 4. **Extract a testability seam (Codex).** `server.ts` and `cli.ts` wire transport to the | ||
| engines inline, so HTTP/CLI behavior can only be tested with a live harness. Export | ||
| `createServer(runAgent)` and `runCli(runAgent)` that take the engine as an argument. Then | ||
| unit tests inject a fake engine returning a deterministic `AgentRunResult` and cover | ||
| `/health`, invalid-JSON handling, `POST /run`, NDJSON record ordering, and CLI exit codes, | ||
| with no Pi/Claude/rivet. This is the highest-value structural change for testability. | ||
| 5. **Decompose the two large files opportunistically.** When next editing `engines/rivet.ts` | ||
| or `tracing/otel.ts`, pull a cohesive seam into its own module and unit-test it, the way | ||
| `responder.ts` was extracted from `rivet.ts`. Not a scheduled refactor. | ||
|
|
||
| ## Phase 5 — Make it a versioned, supportable service (Codex's main gap) | ||
|
|
||
| The review's core point: the plan above makes the runner testable but does not make it a | ||
| first-class deployable. These items make the SDK and the sidecar safe to release on their | ||
| own cadences. Scope and sequence with the platform/release owner; some are bigger than a | ||
| half-day. | ||
|
|
||
| 1. **Protocol/version negotiation.** Add a `protocolVersion` (major) to the wire and have | ||
| `GET /health` (or a new `/capabilities`) return `runnerVersion`, `protocolVersion`, | ||
| supported engines, and harnesses. The Python adapter probes once and refuses an | ||
| incompatible major before the first run. Today `/health` returns only `{status:"ok"}` and | ||
| `package.json` is `0.0.0`. | ||
| 2. **Release ownership.** Decide whether the sidecar version tracks the Agenta release or is | ||
| versioned independently, and stop shipping `0.0.0`. The SDK should pin a compatible runner | ||
| *protocol* range, not a package-version equality. | ||
| 3. **Sidecar image publishing.** No CI publishes the runner image today (only api/web/services | ||
| images are built, e.g. in `42-railway-build.yml`). Add a build/publish job so the HTTP | ||
| sidecar (the production boundary) is actually distributable. | ||
| 4. **Local code-tool execution policy.** `runCodeTool` scopes secret env, but a `code` tool | ||
| still runs an arbitrary `python3`/`node` process in the sidecar. State the sandbox, | ||
| resource, and network policy (it is already sandboxed in Daytona; the local/in-sidecar | ||
| path needs an explicit stance), so this is a deliberate posture, not an oversight. | ||
| 5. **Config hygiene.** `services/oss/src/agent/app.py` reads `AGENTA_AGENT_*` via raw | ||
| `os.getenv`. The repo convention (root `AGENTS.md`) is to add config to | ||
| `api/oss/src/utils/env.py` and consume the shared `env` object. Align it. | ||
| 6. **Fix the stale `local.py` docstring.** `sdks/python/.../adapters/local.py` says the Pi | ||
| runner is "shipped inside the wheel," which is not true today and is the likely source of | ||
| the wheel confusion. Either implement that path deliberately (see the packaging options in | ||
| the answer to question 1) or correct the docstring to match reality. | ||
|
|
||
| ## Sequencing and ownership | ||
|
|
||
| - Phases 1 to 3 are independent of any runtime change and can land as one small PR or three | ||
| tiny ones. They add no production code paths, only tooling and tests. Start here. | ||
| - Phase 4 item 1 (`AGENTS.md`) is worth doing alongside Phase 1 so the new test location is | ||
| documented the moment it exists. Item 4 (the `createServer`/`runCli` seam) unblocks the | ||
| HTTP/CLI tests and is worth pulling forward. | ||
| - Phase 5 is a separate track, owned with whoever owns releases and deployment. It does not | ||
| block Phases 1 to 4, but it is what turns "tested code" into "supportable service." | ||
| - None of this blocks ongoing runner feature work; it runs in parallel. | ||
|
|
||
| ## What success looks like | ||
|
|
||
| - `cd services/agent && pnpm test` runs the whole suite in one go, green, with a summary. | ||
| - A PR touching the runner gets a red/green unit-test + typecheck check automatically. | ||
| - `protocol.ts` cannot drift from the Python wire without a test failing. | ||
| - A new contributor reads `services/agent/AGENTS.md` and knows where code and tests go and | ||
| how to run them, without reading the whole tree. |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.