fix(evals): portable hooks + OpenRouter CI + code-exec grader — eval system review-ready#65
fix(evals): portable hooks + OpenRouter CI + code-exec grader — eval system review-ready#65skillets-the-skill-intern[bot] wants to merge 3 commits into
Conversation
Replaces hardcoded /Users/matt.apperson Mac paths and shell-chained 'cd X && cmd' hooks (which fail because waza runs hooks without a shell) with a single shell-less helper: evals/_hooks/prepare-skill.ts. The helper syncs skills/<name>/ into ~/.agents/skills/<name>/ and npm-installs bundled scripts, resolving the repo root from its own location. Applied to all five suites with hooks (models, images, stt, tts, video). Verified end-to-end on Linux against OpenRouter: hook populates the skill dir, installs deps, and the suite passes. Adds README documenting the hook gotchas + OpenRouter recipe.
There was a problem hiding this comment.
Perry's Review
Replaces five sets of hardcoded Mac-path + shell-chained hooks with a single portable prepare-skill.ts helper that is cwd-independent and shell-free.
Verdict: ✅ LGTM
Details
Risk: 🟢 Low — eval config and a new hook helper only; no auth/payment/production data path touched
CI: only the in-progress Perry check (no other CI configured on this repo)
Findings (see inline comment for full context):
- nit:
prepare-skill.ts:13— comment shows wrong hook path (see inline)
Codex: disabled (small tier)
Research: disabled (small tier)
Security: no concerns — skill name is hardcoded in each eval.yaml, never caller-supplied at runtime
Test coverage: no unit test for the hook helper; acceptable for a waza eval fixture script (integration-tested by running the actual eval)
Unresolved threads: none
Review: tier=small · model=claude-sonnet-latest · score=?
Replaces the Azure azd-extension install with the waza binary + Bun + Node, and drives waza through OpenRouter via the COPILOT_PROVIDER_* BYOK env (no GitHub Copilot account needed). Adds .github/scripts/run-evals.ts which runs every AUTHORED suite (skipping the 4 'TODO: scaffolding only' stubs), uploads per-suite results, and fails the build if any suite fails. A guard job skips the eval cleanly with a warning when the OPENROUTER_API_KEY repo secret is absent (a repo admin must add it). Adds workflow_dispatch with a model input. Verified locally: the exact waza invocation (hooks + graders + --output-dir artifact) passes end-to-end against OpenRouter.
Superseded by updated Perry review
There was a problem hiding this comment.
Perry's Review
Replaces the Azure azd waza CI path with a portable Bun-driven eval runner and an OpenRouter-BYOK workflow, making the waza eval suite runnable headless.
Verdict: ✅ LGTM
Details
Re-review: incremental — 133 new lines since prior review at 957d9fe (prior verdict was APPROVE). Net-new is two new files only: .github/scripts/run-evals.ts and the rewritten .github/workflows/eval.yml. The seven evals files were byte-identical to the prior review and are out of differential scope.
Risk: 🟢 Low — CI and eval-harness only; no production, auth, billing, or routing code touched. The workflow runs only on PRs targeting main, so it does not even execute on this stacked PR.
CI: all passing ✅ (Perry's own review check)
Findings (see inline comment):
- 🟡
.github/scripts/run-evals.ts:24— stub detection matches the marker anywhere in the file rather than the description field; low-probability false-skip.
Strengths: the new guard job (secret-presence check plus graceful warning skip), concurrency cancel-in-progress, and the shell-less prepare-skill.ts hook are all solid defensive CI design. Default model slug (anthropic/claude-opus-4.8) is consistent across the EVAL_MODEL env and the script fallback, and is a valid family-first OpenRouter ID.
Security: no concerns — the key is read via the repo secret, never echoed; the guard job emits only a boolean presence flag.
Test coverage: not applicable — this is test and eval infrastructure itself.
Blast radius: the sole caller of the new script is the workflow run-evals step. The discover routine will run every authored (non-stub) suite, including openrouter-oauth — note that suite carries no before_run sync hook (it predates this PR, from parent #43), so it relies on the skill already being present; not introduced or worsened here.
Stack: this PR sits on setup-waza (#43 — waza eval infra), which targets main.
Unresolved threads: one — Perry's own prior nit on evals/_hooks/prepare-skill.ts:13 (a comment-path detail). That file is unchanged in this push, so it is out of differential scope and not re-raised here.
Review: tier=medium · model=claude-opus-latest · score=4.5
|
|
||
| function isStub(specPath: string): boolean { | ||
| const body = readFileSync(specPath, "utf-8"); | ||
| return body.includes(STUB_MARKER); |
There was a problem hiding this comment.
[suggestion] isStub() matches the marker anywhere in the file, but the file-header comment says stubs are gated on the description starting with it — a stray match elsewhere would silently skip a real suite.
Details
Why: Line 10 documents the contract as "Skips stub suites (description starts with ...)", but body.includes(STUB_MARKER) returns true if the marker string appears ANYWHERE in the YAML — a task prompt, a grader rubric, or a comment that happens to quote "TODO: scaffolding only". An authored suite that mentions the phrase in passing would be silently dropped from CI, and the failure mode is invisible (it just prints skip (stub) and exits 0). Today every stub places the marker in the top-level description:, so it works — but the check is looser than its stated intent.
Fix: Parse the YAML and test the description field specifically, e.g. read description and check description.trimStart().startsWith(STUB_MARKER), falling back to the current whole-file scan only if parsing fails. A lighter-touch alternative: anchor the marker to a line start (/^\s*TODO: scaffolding only/m) so only a leading description line counts.
Prompt for agents
In .github/scripts/run-evals.ts, isStub() at line 22-25 reads the whole eval.yaml and returns body.includes(STUB_MARKER). The documented contract (line 10) is that a suite is a stub when its `description` field starts with "TODO: scaffolding only". Tighten isStub so it only treats a suite as a stub when the marker leads the description: either parse the YAML and check description.trimStart().startsWith(STUB_MARKER), or match the marker anchored to a line start with /^\s*TODO: scaffolding only/m. Keep behavior identical for the current stub suites.
Reviewed at 5d60778
Adds evals/_graders/verify-code-runs.ts — a waza `program` grader that reads the agent's response on stdin, extracts JS/TS code blocks, and syntax/type-checks them (node --check / tsc --noEmit). Closes the 'does the produced code actually run' gap from the OpenAI skill-eval methodology, so building-skill tasks verify the code parses rather than only that a judge likes the prose. Wired into the openrouter-oauth happy-path task as the first real usage; proven end-to-end on OpenRouter (produced_code_parses passed 1.00). Adds README documenting the grader + the waza program-grader stdin/stderr facts.
Makes the
setup-wazaeval system functional, headless-runnable, and methodologically complete on Linux/CI via OpenRouter. Three parts:1. Portable
before_runhooksReplaced hardcoded Mac paths + shell-chained hooks (which fail — waza runs hooks without a shell) with one shell-less helper
evals/_hooks/prepare-skill.ts. Applied to all 5 suites with hooks.2. CI rework — run waza on OpenRouter (no azd)
Swapped the Azure
azdinstall for waza binary + Bun + Node, driving waza through OpenRouter viaCOPILOT_PROVIDER_*BYOK env..github/scripts/run-evals.tsruns every authored suite (skips the 4 stubs), uploads results, fails on any suite failure. Guard job skips cleanly ifOPENROUTER_API_KEYsecret is absent.3. Code-execution verification grader (closes methodology gap #4)
evals/_graders/verify-code-runs.ts— aprogramgrader that reads the agent's response on stdin, extracts JS/TS code blocks, and syntax/type-checks them (node --check/tsc --noEmit). This closes the "does the produced code actually run" gap from OpenAI's skill-eval methodology: building-skill tasks now verify the code parses, not just that a judge likes the prose. Wired into the openrouter-oauth happy-path task as first usage.Verified live (all on OpenRouter)
~/.agents/skills, ran suite, hook rebuilt + installed deps, passed.run-evals.tspath end-to-end, exit 0.produced_code_parsespassed 1.00 on the oauth task in a real run.CI needs an
OPENROUTER_API_KEYrepo secret (Settings → Secrets and variables → Actions). Our GitHub App has no admin here. Until set, the eval job skips with a warning rather than failing.Follow-up (out of scope)
.waza.yamlMac-prototype model labels should be reconciled; remaining methodology gaps (efficiency assertions, more negative/contextual cases, failure→new-test-case habit) are per-skill task-authoring, added incrementally.View Slack thread
🤖 Generated with Claude Code