Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 30 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -58,3 +58,33 @@ jobs:
- run: npx playwright install --with-deps chromium

- run: pnpm test:e2e --project=${{ matrix.project }}

agent-loop:
runs-on: ubuntu-latest
needs: build-and-test
steps:
- uses: actions/checkout@v4

- uses: pnpm/action-setup@v4
with:
version: 10

- uses: actions/setup-node@v4
with:
node-version: 20
cache: pnpm

- run: pnpm install --frozen-lockfile

- run: pnpm build

- run: npx playwright install --with-deps chromium

- run: pnpm test:e2e:stress:annotask:agent-loop

- uses: actions/upload-artifact@v4
if: always()
with:
name: agent-loop-metrics
path: playgrounds/stress-test/e2e/annotask/reports/agent-loop/
if-no-files-found: warn
137 changes: 137 additions & 0 deletions docs/agent-loop-evals.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
# Agent-loop evaluation harness

The agent-loop e2e suite measures the Annotask round-trip — user
annotates → task lands in MCP → coding agent applies the change →
re-render verifies the fix — for the three highest-leverage task types
on the stress-test playground.

It is the credibility artifact behind the public demo and the
design-partner pitch deck. The numbers it emits are how we'll know
whether shipping the next task type is helping or regressing the loop.

> **v1 scope.** The simulator that stands in for the coding agent is
> deterministic and rule-based — not LLM-driven. The harness is here
> to measure *plumbing reliability* (does the task land, do the MCP
> tools work, does HMR pick the fix up, do metrics persist) so we can
> ship the public demo without a paid-LLM dependency. The follow-up
> ticket on **agent-apply quality** (tracked under
> [ANN-1](/ANN/issues/ANN-1) child issues) is where the real LLM gets
> wired into this same harness.

## What each test proves

| Task type | Test surface | Round-trip assertion |
| ------------- | ----------------------------------------------------------------------- | ----------------------------------------------------------------------------- |
| `style_update`| Tracer stylesheet on a known `data-agent-loop-target` element. | Iframe `getComputedStyle().color` flips after Vite HMR. |
| `a11y_fix` | `<img>` in the test-only target component with `alt` attribute removed. | `axe-core` rescan reports zero `image-alt` violations after the fix. |
| `error_fix` | `console.error(<tracer>)` injected into the target component. | Console listener sees zero tracer errors after the fix lands. |

All three tests run on both **React+Vite** (`react-workflows`, port
4210) and **Vue+Vite** (`vue-data-lab`, port 4220) MFEs. Adding a new
framework target is a single entry in
`playgrounds/stress-test/e2e/annotask/helpers/agent-loop/targets.ts`.

## How the simulator stands in for an agent

The agent simulator
(`playgrounds/stress-test/e2e/annotask/helpers/agent-loop/simulator.ts`)
calls the same `annotask` CLI flags a real coding agent would (`--mcp`,
`--server=…`) so we exercise the MCP-shaped tool surface end-to-end:

1. `annotask task <id> --mcp` — hydrate full task detail
2. `annotask update-task <id> --status=in_progress --mcp` — lock it
3. **Apply step (rule-based for v1):**
- `style_update` — replace the `before` rgb literal with `after` in
`agent-loop-target.css`
- `a11y_fix` — for `rule: image-alt`, regex-inject `alt=""` on any
`<img>` missing the attribute
- `error_fix` — strip every line containing the test's tracer
comment marker
4. `annotask update-task <id> --status=review --resolution="…" --mcp`

The apply step is what an LLM coding agent will replace in the v2
ticket. The rest of the loop — lock, fetch context, mark review,
re-fetch denied tasks — is the production path.

## Running the suite

```bash
pnpm build # CLI must be built first; simulator uses dist/cli.js
pnpm test:e2e:stress:annotask # runs everything under playgrounds/stress-test/e2e/annotask/
```

The Playwright config under `playgrounds/stress-test/e2e/` boots the
host shell, the seven stress MFEs, and the four fast native API
services with `reuseExistingServer: true`. First boot takes about a
minute while Vite optimizes deps.

The agent-loop specs run in `serial` mode per (framework × task type)
because each test mutates the AgentLoopTarget component file and
restores it in `afterEach`. Two concurrent style_update tests on the
same MFE would race on the file.

## Reading the metrics output

Each test writes one JSON file under
`playgrounds/stress-test/e2e/annotask/reports/agent-loop/`:

```json
{
"task_type": "style_update",
"app_id": "react-workflows",
"framework": "react+vite",
"outcome": "success",
"time_to_apply_ms": 412,
"retries": 0,
"denied_on_first_try": false,
"task_id": "task-abc123",
"resolution": "Swapped color from rgb(255, 0, 0) to rgb(0, 128, 0) in agent-loop-target.css",
"error_message": null,
"recorded_at": "2026-05-12T20:21:14.882Z"
}
```

Field meanings — useful when this seeds the eval dashboard:

- **outcome** — `success` if the round-trip assertion passes; otherwise
`failure` with `error_message` set.
- **time_to_apply_ms** — wall-clock from simulator start to task
transitioning to `review`. Not the full round-trip — HMR and re-scan
time are reported in the Playwright test duration, not here.
- **retries** — always `0` in v1 (simulator does not loop). When the
LLM agent lands, the simulator will increment this on `denied` →
`in_progress` cycles.
- **denied_on_first_try** — placeholder for the v2 LLM apply harness.
The deterministic simulator never gets denied today.
- **task_id** / **resolution** — copied from the MCP-CLI response to
make it easy to grep back to the originating task without re-running
the suite.

## v1 caveats (what's *not* tested yet)

- The shell's inspector tool is not driven for `style_update` — tasks
are seeded via the per-MFE API. Driving the inspector tool is its own
test; the agent-loop suite focuses on what the agent does *after*
the task lands.
- The "Create Fix Task" button on `a11y_fix` is exercised in
`annotate.spec.ts`. The agent-loop suite seeds a deterministic task
shape directly so the simulator can run against a known anchor.
- The simulator's deterministic apply rules cover **one** failure mode
per task type. The v2 ticket on agent-apply quality expands rules
(or, more likely, replaces them with an LLM call) so the harness can
measure performance on the full task-type matrix.
- `retries` and `denied_on_first_try` are wired into the metric shape
but always zero/false in v1. The schema is locked so the dashboard
doesn't churn when the LLM agent ships.

## How to add a new task type

1. Add a deterministic apply function to `helpers/agent-loop/simulator.ts`.
2. Add a fixture to `AgentLoopTarget.{tsx,vue}` (or a sibling target
file) that the test can mutate to seed the failure mode.
3. Add a spec under `playgrounds/stress-test/e2e/annotask/agent-loop/`
following the same `capture → seed → drive shell → simulate →
verify → restore` pattern.
4. Extend `TaskTypeKey` in `helpers/agent-loop/metrics.ts` so the JSON
output stays type-checked.
5. Document the new task type in the table at the top of this file.
1 change: 1 addition & 0 deletions package.json
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,7 @@
"stress-test:down": "docker compose -f playgrounds/stress-test/docker-compose.yml down",
"test:e2e:stress": "playwright test --config playgrounds/stress-test/e2e/playwright.config.ts",
"test:e2e:stress:annotask": "playwright test --config playgrounds/stress-test/e2e/playwright.config.ts annotask/ || true",
"test:e2e:stress:annotask:agent-loop": "playwright test --config playgrounds/stress-test/e2e/agent-loop.config.ts",
"typecheck": "tsc --noEmit && vue-tsc --noEmit -p src/shell/tsconfig.json",
"test": "vitest run",
"test:watch": "vitest",
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
/**
* Test-only target for agent-loop e2e tests.
*
* Always mounted but visually inert by default. The e2e tests in
* `playgrounds/stress-test/e2e/annotask/agent-loop/` mutate
* `agent-loop-target.css` to drive a known style change through Vite
* HMR and verify the round-trip. They also mutate this file to seed
* a11y violations and console errors, then run the agent simulator
* to apply a fix and restore the file in `afterEach`.
*
* The "Agent-loop e2e target" landmark only renders when the URL hash
* is `#agent-loop-target` so it stays invisible in normal stress-test
* use.
*/
import { useEffect, useState } from 'react'
import './agent-loop-target.css'

function useShowTarget(): boolean {
const [show, setShow] = useState(
typeof window !== 'undefined' && window.location.hash === '#agent-loop-target',
)
useEffect(() => {
const handler = () => setShow(window.location.hash === '#agent-loop-target')
window.addEventListener('hashchange', handler)
return () => window.removeEventListener('hashchange', handler)
}, [])
return show
}

export function AgentLoopTarget(): JSX.Element | null {
const show = useShowTarget()
if (!show) return null
return (
<section data-testid="agent-loop-target" aria-labelledby="agent-loop-target-heading">
<h2 id="agent-loop-target-heading">Agent-loop e2e target</h2>
<p data-agent-loop-target="paragraph">Tracer element for agent-loop e2e tests.</p>
<img
data-agent-loop-target="image"
src="data:image/svg+xml;utf8,%3Csvg xmlns='http://www.w3.org/2000/svg' width='8' height='8'%3E%3C/svg%3E"
alt=""
width={8}
height={8}
/>
</section>
)
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
/*
* Agent-loop e2e: tracer stylesheet. Rewritten by the simulator during
* style_update tests, then restored in afterEach. Vite HMR picks up
* each edit and the test asserts the iframe's computed style flipped.
*/
[data-agent-loop-target='paragraph'] {
color: rgb(255, 0, 0);
}
12 changes: 12 additions & 0 deletions playgrounds/stress-test/apps/mfe-react-workflows/src/main.tsx
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ import { bootstrapTheme } from '@annotask/stress-ui-tokens'
import { StrictMode } from 'react'
import { createRoot } from 'react-dom/client'
import { Root } from './Root'
import { AgentLoopTarget } from './AgentLoopTarget'

bootstrapTheme()

Expand All @@ -13,3 +14,14 @@ createRoot(document.getElementById('app')!).render(
<Root />
</StrictMode>,
)

// Agent-loop e2e target — only renders when the page hash is
// `#agent-loop-target`. Inert otherwise.
const agentLoopHost = document.createElement('div')
agentLoopHost.id = 'agent-loop-host'
document.body.appendChild(agentLoopHost)
createRoot(agentLoopHost).render(
<StrictMode>
<AgentLoopTarget />
</StrictMode>,
)
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
<!--
Test-only target for agent-loop e2e tests. See the React sibling
`AgentLoopTarget.tsx` for the full rationale. Only renders when the
page is loaded with the `#agent-loop-target` hash.
-->
<script setup lang="ts">
import { onMounted, onUnmounted, ref } from 'vue'
import './agent-loop-target.css'

const show = ref(
typeof window !== 'undefined' && window.location.hash === '#agent-loop-target',
)

function update() {
show.value = window.location.hash === '#agent-loop-target'
}

onMounted(() => window.addEventListener('hashchange', update))
onUnmounted(() => window.removeEventListener('hashchange', update))
</script>

<template>
<section
v-if="show"
data-testid="agent-loop-target"
aria-labelledby="agent-loop-target-heading"
>
<h2 id="agent-loop-target-heading">Agent-loop e2e target</h2>
<p data-agent-loop-target="paragraph">Tracer element for agent-loop e2e tests.</p>
<img
data-agent-loop-target="image"
src="data:image/svg+xml;utf8,%3Csvg xmlns='http://www.w3.org/2000/svg' width='8' height='8'%3E%3C/svg%3E"
alt=""
width="8"
height="8"
/>
</section>
</template>
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
/*
* Agent-loop e2e: tracer stylesheet. See the React sibling's notes.
*/
[data-agent-loop-target='paragraph'] {
color: rgb(255, 0, 0);
}
8 changes: 8 additions & 0 deletions playgrounds/stress-test/apps/mfe-vue-data-lab/src/main.ts
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,15 @@ import '@annotask/stress-ui-tokens/tokens.css'
import { bootstrapTheme } from '@annotask/stress-ui-tokens'
import { createApp } from 'vue'
import App from './App.vue'
import AgentLoopTarget from './AgentLoopTarget.vue'

bootstrapTheme()

createApp(App).mount('#app')

// Agent-loop e2e target — only renders when the page hash is
// `#agent-loop-target`. Inert otherwise.
const agentLoopHost = document.createElement('div')
agentLoopHost.id = 'agent-loop-host'
document.body.appendChild(agentLoopHost)
createApp(AgentLoopTarget).mount(agentLoopHost)
39 changes: 39 additions & 0 deletions playgrounds/stress-test/e2e/agent-loop.config.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
/**
* Focused Playwright config for the agent-loop e2e suite. Only spins up
* the host shell plus the two target MFEs (react-workflows,
* vue-data-lab) — the rest of the stress cluster is overkill for these
* specs and would triple the CI runtime.
*
* If you need to run against the full stress cluster instead, use
* `pnpm test:e2e:stress:annotask` which loads the broader config.
*/
import { defineConfig, devices } from '@playwright/test'

const webServers = [
{ name: 'stress-host', command: 'pnpm dev:stress-host', url: 'http://localhost:4200' },
{ name: 'stress-react-workflows', command: 'pnpm dev:stress-react-workflows', url: 'http://localhost:4210' },
{ name: 'stress-vue-data-lab', command: 'pnpm dev:stress-vue-data-lab', url: 'http://localhost:4220' },
]

export default defineConfig({
testDir: './annotask/agent-loop',
timeout: 90_000,
expect: { timeout: 15_000 },
fullyParallel: false,
workers: 1,
retries: 0,
reporter: [['list'], ['./annotask/reporter.ts']],
use: {
trace: 'on-first-retry',
baseURL: 'http://localhost:4200',
...devices['Desktop Chrome'],
},
webServer: webServers.map(s => ({
command: s.command,
url: s.url,
reuseExistingServer: true,
timeout: 120_000,
stdout: 'ignore',
stderr: 'pipe',
})),
})
Loading
Loading