kurtstohrer · kurtstohrer · May 12, 2026
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -58,3 +58,33 @@ jobs:
       - run: npx playwright install --with-deps chromium
 
       - run: pnpm test:e2e --project=${{ matrix.project }}
+
+  agent-loop:
+    runs-on: ubuntu-latest
+    needs: build-and-test
+    steps:
+      - uses: actions/checkout@v4
+
+      - uses: pnpm/action-setup@v4
+        with:
+          version: 10
+
+      - uses: actions/setup-node@v4
+        with:
+          node-version: 20
+          cache: pnpm
+
+      - run: pnpm install --frozen-lockfile
+
+      - run: pnpm build
+
+      - run: npx playwright install --with-deps chromium
+
+      - run: pnpm test:e2e:stress:annotask:agent-loop
+
+      - uses: actions/upload-artifact@v4
+        if: always()
+        with:
+          name: agent-loop-metrics
+          path: playgrounds/stress-test/e2e/annotask/reports/agent-loop/
+          if-no-files-found: warn
diff --git a/docs/agent-loop-evals.md b/docs/agent-loop-evals.md
@@ -0,0 +1,137 @@
+# Agent-loop evaluation harness
+
+The agent-loop e2e suite measures the Annotask round-trip — user
+annotates → task lands in MCP → coding agent applies the change →
+re-render verifies the fix — for the three highest-leverage task types
+on the stress-test playground.
+
+It is the credibility artifact behind the public demo and the
+design-partner pitch deck. The numbers it emits are how we'll know
+whether shipping the next task type is helping or regressing the loop.
+
+> **v1 scope.** The simulator that stands in for the coding agent is
+> deterministic and rule-based — not LLM-driven. The harness is here
+> to measure *plumbing reliability* (does the task land, do the MCP
+> tools work, does HMR pick the fix up, do metrics persist) so we can
+> ship the public demo without a paid-LLM dependency. The follow-up
+> ticket on **agent-apply quality** (tracked under
+> [ANN-1](/ANN/issues/ANN-1) child issues) is where the real LLM gets
+> wired into this same harness.
+
+## What each test proves
+
+| Task type     | Test surface                                                            | Round-trip assertion                                                          |
+| ------------- | ----------------------------------------------------------------------- | ----------------------------------------------------------------------------- |
+| `style_update`| Tracer stylesheet on a known `data-agent-loop-target` element.          | Iframe `getComputedStyle().color` flips after Vite HMR.                       |
+| `a11y_fix`    | `<img>` in the test-only target component with `alt` attribute removed. | `axe-core` rescan reports zero `image-alt` violations after the fix.           |
+| `error_fix`   | `console.error(<tracer>)` injected into the target component.           | Console listener sees zero tracer errors after the fix lands.                  |
+
+All three tests run on both **React+Vite** (`react-workflows`, port
+4210) and **Vue+Vite** (`vue-data-lab`, port 4220) MFEs. Adding a new
+framework target is a single entry in
+`playgrounds/stress-test/e2e/annotask/helpers/agent-loop/targets.ts`.
+
+## How the simulator stands in for an agent
+
+The agent simulator
+(`playgrounds/stress-test/e2e/annotask/helpers/agent-loop/simulator.ts`)
+calls the same `annotask` CLI flags a real coding agent would (`--mcp`,
+`--server=…`) so we exercise the MCP-shaped tool surface end-to-end:
+
+1. `annotask task <id> --mcp` — hydrate full task detail
+2. `annotask update-task <id> --status=in_progress --mcp` — lock it
+3. **Apply step (rule-based for v1):**
+   - `style_update` — replace the `before` rgb literal with `after` in
+     `agent-loop-target.css`
+   - `a11y_fix` — for `rule: image-alt`, regex-inject `alt=""` on any
+     `<img>` missing the attribute
+   - `error_fix` — strip every line containing the test's tracer
+     comment marker
+4. `annotask update-task <id> --status=review --resolution="…" --mcp`
+
+The apply step is what an LLM coding agent will replace in the v2
+ticket. The rest of the loop — lock, fetch context, mark review,
+re-fetch denied tasks — is the production path.
+
+## Running the suite
+
+```bash
+pnpm build                       # CLI must be built first; simulator uses dist/cli.js
+pnpm test:e2e:stress:annotask    # runs everything under playgrounds/stress-test/e2e/annotask/
+```
+
+The Playwright config under `playgrounds/stress-test/e2e/` boots the
+host shell, the seven stress MFEs, and the four fast native API
+services with `reuseExistingServer: true`. First boot takes about a
+minute while Vite optimizes deps.
+
+The agent-loop specs run in `serial` mode per (framework × task type)
+because each test mutates the AgentLoopTarget component file and
+restores it in `afterEach`. Two concurrent style_update tests on the
+same MFE would race on the file.
+
+## Reading the metrics output
+
+Each test writes one JSON file under
+`playgrounds/stress-test/e2e/annotask/reports/agent-loop/`:
+
+```json
+{
+  "task_type": "style_update",
+  "app_id": "react-workflows",
+  "framework": "react+vite",
+  "outcome": "success",
+  "time_to_apply_ms": 412,
+  "retries": 0,
+  "denied_on_first_try": false,
+  "task_id": "task-abc123",
+  "resolution": "Swapped color from rgb(255, 0, 0) to rgb(0, 128, 0) in agent-loop-target.css",
+  "error_message": null,
+  "recorded_at": "2026-05-12T20:21:14.882Z"
+}
+```
+
+Field meanings — useful when this seeds the eval dashboard:
+
+- **outcome** — `success` if the round-trip assertion passes; otherwise
+  `failure` with `error_message` set.
+- **time_to_apply_ms** — wall-clock from simulator start to task
+  transitioning to `review`. Not the full round-trip — HMR and re-scan
+  time are reported in the Playwright test duration, not here.
+- **retries** — always `0` in v1 (simulator does not loop). When the
+  LLM agent lands, the simulator will increment this on `denied` →
+  `in_progress` cycles.
+- **denied_on_first_try** — placeholder for the v2 LLM apply harness.
+  The deterministic simulator never gets denied today.
+- **task_id** / **resolution** — copied from the MCP-CLI response to
+  make it easy to grep back to the originating task without re-running
+  the suite.
+
+## v1 caveats (what's *not* tested yet)
+
+- The shell's inspector tool is not driven for `style_update` — tasks
+  are seeded via the per-MFE API. Driving the inspector tool is its own
+  test; the agent-loop suite focuses on what the agent does *after*
+  the task lands.
+- The "Create Fix Task" button on `a11y_fix` is exercised in
+  `annotate.spec.ts`. The agent-loop suite seeds a deterministic task
+  shape directly so the simulator can run against a known anchor.
+- The simulator's deterministic apply rules cover **one** failure mode
+  per task type. The v2 ticket on agent-apply quality expands rules
+  (or, more likely, replaces them with an LLM call) so the harness can
+  measure performance on the full task-type matrix.
+- `retries` and `denied_on_first_try` are wired into the metric shape
+  but always zero/false in v1. The schema is locked so the dashboard
+  doesn't churn when the LLM agent ships.
+
+## How to add a new task type
+
+1. Add a deterministic apply function to `helpers/agent-loop/simulator.ts`.
+2. Add a fixture to `AgentLoopTarget.{tsx,vue}` (or a sibling target
+   file) that the test can mutate to seed the failure mode.
+3. Add a spec under `playgrounds/stress-test/e2e/annotask/agent-loop/`
+   following the same `capture → seed → drive shell → simulate →
+   verify → restore` pattern.
+4. Extend `TaskTypeKey` in `helpers/agent-loop/metrics.ts` so the JSON
+   output stays type-checked.
+5. Document the new task type in the table at the top of this file.
diff --git a/package.json b/package.json
@@ -61,6 +61,7 @@
     "stress-test:down": "docker compose -f playgrounds/stress-test/docker-compose.yml down",
     "test:e2e:stress": "playwright test --config playgrounds/stress-test/e2e/playwright.config.ts",
     "test:e2e:stress:annotask": "playwright test --config playgrounds/stress-test/e2e/playwright.config.ts annotask/ || true",
+    "test:e2e:stress:annotask:agent-loop": "playwright test --config playgrounds/stress-test/e2e/agent-loop.config.ts",
     "typecheck": "tsc --noEmit && vue-tsc --noEmit -p src/shell/tsconfig.json",
     "test": "vitest run",
     "test:watch": "vitest",

diff --git a/playgrounds/stress-test/apps/mfe-react-workflows/src/AgentLoopTarget.tsx b/playgrounds/stress-test/apps/mfe-react-workflows/src/AgentLoopTarget.tsx
@@ -0,0 +1,46 @@
+/**
+ * Test-only target for agent-loop e2e tests.
+ *
+ * Always mounted but visually inert by default. The e2e tests in
+ * `playgrounds/stress-test/e2e/annotask/agent-loop/` mutate
+ * `agent-loop-target.css` to drive a known style change through Vite
+ * HMR and verify the round-trip. They also mutate this file to seed
+ * a11y violations and console errors, then run the agent simulator
+ * to apply a fix and restore the file in `afterEach`.
+ *
+ * The "Agent-loop e2e target" landmark only renders when the URL hash
+ * is `#agent-loop-target` so it stays invisible in normal stress-test
+ * use.
+ */
+import { useEffect, useState } from 'react'
+import './agent-loop-target.css'
+
+function useShowTarget(): boolean {
+  const [show, setShow] = useState(
+    typeof window !== 'undefined' && window.location.hash === '#agent-loop-target',
+  )
+  useEffect(() => {
+    const handler = () => setShow(window.location.hash === '#agent-loop-target')
+    window.addEventListener('hashchange', handler)
+    return () => window.removeEventListener('hashchange', handler)
+  }, [])
+  return show
+}
+
+export function AgentLoopTarget(): JSX.Element | null {
+  const show = useShowTarget()
+  if (!show) return null
+  return (
+    <section data-testid="agent-loop-target" aria-labelledby="agent-loop-target-heading">
+      <h2 id="agent-loop-target-heading">Agent-loop e2e target</h2>
+      <p data-agent-loop-target="paragraph">Tracer element for agent-loop e2e tests.</p>
+      <img
+        data-agent-loop-target="image"
+        src="data:image/svg+xml;utf8,%3Csvg xmlns='http://www.w3.org/2000/svg' width='8' height='8'%3E%3C/svg%3E"
+        alt=""
+        width={8}
+        height={8}
+      />
+    </section>
+  )
+}
diff --git a/playgrounds/stress-test/apps/mfe-react-workflows/src/agent-loop-target.css b/playgrounds/stress-test/apps/mfe-react-workflows/src/agent-loop-target.css
@@ -0,0 +1,8 @@
+/*
+ * Agent-loop e2e: tracer stylesheet. Rewritten by the simulator during
+ * style_update tests, then restored in afterEach. Vite HMR picks up
+ * each edit and the test asserts the iframe's computed style flipped.
+ */
+[data-agent-loop-target='paragraph'] {
+  color: rgb(255, 0, 0);
+}
diff --git a/playgrounds/stress-test/apps/mfe-react-workflows/src/main.tsx b/playgrounds/stress-test/apps/mfe-react-workflows/src/main.tsx
@@ -5,6 +5,7 @@ import { bootstrapTheme } from '@annotask/stress-ui-tokens'
 import { StrictMode } from 'react'
 import { createRoot } from 'react-dom/client'
 import { Root } from './Root'
+import { AgentLoopTarget } from './AgentLoopTarget'
 
 bootstrapTheme()
 
@@ -13,3 +14,14 @@ createRoot(document.getElementById('app')!).render(
     <Root />
   </StrictMode>,
 )
+
+// Agent-loop e2e target — only renders when the page hash is
+// `#agent-loop-target`. Inert otherwise.
+const agentLoopHost = document.createElement('div')
+agentLoopHost.id = 'agent-loop-host'
+document.body.appendChild(agentLoopHost)
+createRoot(agentLoopHost).render(
+  <StrictMode>
+    <AgentLoopTarget />
+  </StrictMode>,
+)
diff --git a/playgrounds/stress-test/apps/mfe-vue-data-lab/src/AgentLoopTarget.vue b/playgrounds/stress-test/apps/mfe-vue-data-lab/src/AgentLoopTarget.vue
@@ -0,0 +1,38 @@
+<!--
+  Test-only target for agent-loop e2e tests. See the React sibling
+  `AgentLoopTarget.tsx` for the full rationale. Only renders when the
+  page is loaded with the `#agent-loop-target` hash.
+-->
+<script setup lang="ts">
+import { onMounted, onUnmounted, ref } from 'vue'
+import './agent-loop-target.css'
+
+const show = ref(
+  typeof window !== 'undefined' && window.location.hash === '#agent-loop-target',
+)
+
+function update() {
+  show.value = window.location.hash === '#agent-loop-target'
+}
+
+onMounted(() => window.addEventListener('hashchange', update))
+onUnmounted(() => window.removeEventListener('hashchange', update))
+</script>
+
+<template>
+  <section
+    v-if="show"
+    data-testid="agent-loop-target"
+    aria-labelledby="agent-loop-target-heading"
+  >
+    <h2 id="agent-loop-target-heading">Agent-loop e2e target</h2>
+    <p data-agent-loop-target="paragraph">Tracer element for agent-loop e2e tests.</p>
+    <img
+      data-agent-loop-target="image"
+      src="data:image/svg+xml;utf8,%3Csvg xmlns='http://www.w3.org/2000/svg' width='8' height='8'%3E%3C/svg%3E"
+      alt=""
+      width="8"
+      height="8"
+    />
+  </section>
+</template>
diff --git a/playgrounds/stress-test/apps/mfe-vue-data-lab/src/agent-loop-target.css b/playgrounds/stress-test/apps/mfe-vue-data-lab/src/agent-loop-target.css
@@ -0,0 +1,6 @@
+/*
+ * Agent-loop e2e: tracer stylesheet. See the React sibling's notes.
+ */
+[data-agent-loop-target='paragraph'] {
+  color: rgb(255, 0, 0);
+}
diff --git a/playgrounds/stress-test/apps/mfe-vue-data-lab/src/main.ts b/playgrounds/stress-test/apps/mfe-vue-data-lab/src/main.ts
@@ -2,7 +2,15 @@ import '@annotask/stress-ui-tokens/tokens.css'
 import { bootstrapTheme } from '@annotask/stress-ui-tokens'
 import { createApp } from 'vue'
 import App from './App.vue'
+import AgentLoopTarget from './AgentLoopTarget.vue'
 
 bootstrapTheme()
 
 createApp(App).mount('#app')
+
+// Agent-loop e2e target — only renders when the page hash is
+// `#agent-loop-target`. Inert otherwise.
+const agentLoopHost = document.createElement('div')
+agentLoopHost.id = 'agent-loop-host'
+document.body.appendChild(agentLoopHost)
+createApp(AgentLoopTarget).mount(agentLoopHost)
diff --git a/playgrounds/stress-test/e2e/agent-loop.config.ts b/playgrounds/stress-test/e2e/agent-loop.config.ts
@@ -0,0 +1,39 @@
+/**
+ * Focused Playwright config for the agent-loop e2e suite. Only spins up
+ * the host shell plus the two target MFEs (react-workflows,
+ * vue-data-lab) — the rest of the stress cluster is overkill for these
+ * specs and would triple the CI runtime.
+ *
+ * If you need to run against the full stress cluster instead, use
+ * `pnpm test:e2e:stress:annotask` which loads the broader config.
+ */
+import { defineConfig, devices } from '@playwright/test'
+
+const webServers = [
+  { name: 'stress-host', command: 'pnpm dev:stress-host', url: 'http://localhost:4200' },
+  { name: 'stress-react-workflows', command: 'pnpm dev:stress-react-workflows', url: 'http://localhost:4210' },
+  { name: 'stress-vue-data-lab', command: 'pnpm dev:stress-vue-data-lab', url: 'http://localhost:4220' },
+]
+
+export default defineConfig({
+  testDir: './annotask/agent-loop',
+  timeout: 90_000,
+  expect: { timeout: 15_000 },
+  fullyParallel: false,
+  workers: 1,
+  retries: 0,
+  reporter: [['list'], ['./annotask/reporter.ts']],
+  use: {
+    trace: 'on-first-retry',
+    baseURL: 'http://localhost:4200',
+    ...devices['Desktop Chrome'],
+  },
+  webServer: webServers.map(s => ({
+    command: s.command,
+    url: s.url,
+    reuseExistingServer: true,
+    timeout: 120_000,
+    stdout: 'ignore',
+    stderr: 'pipe',
+  })),
+})