fix(evals): portable hooks + OpenRouter CI + code-exec grader — eval system review-ready by skillets-the-skill-intern[bot] · Pull Request #65 · OpenRouterTeam/skills

skillets-the-skill-intern · 2026-06-26T01:58:38Z

Makes the setup-waza eval system functional, headless-runnable, and methodologically complete on Linux/CI via OpenRouter. Three parts:

1. Portable `before_run` hooks

Replaced hardcoded Mac paths + shell-chained hooks (which fail — waza runs hooks without a shell) with one shell-less helper evals/_hooks/prepare-skill.ts. Applied to all 5 suites with hooks.

2. CI rework — run waza on OpenRouter (no azd)

Swapped the Azure azd install for waza binary + Bun + Node, driving waza through OpenRouter via COPILOT_PROVIDER_* BYOK env. .github/scripts/run-evals.ts runs every authored suite (skips the 4 stubs), uploads results, fails on any suite failure. Guard job skips cleanly if OPENROUTER_API_KEY secret is absent.

3. Code-execution verification grader (closes methodology gap #4)

evals/_graders/verify-code-runs.ts — a program grader that reads the agent's response on stdin, extracts JS/TS code blocks, and syntax/type-checks them (node --check / tsc --noEmit). This closes the "does the produced code actually run" gap from OpenAI's skill-eval methodology: building-skill tasks now verify the code parses, not just that a judge likes the prose. Wired into the openrouter-oauth happy-path task as first usage.

Verified live (all on OpenRouter)

Hooks: wiped ~/.agents/skills, ran suite, hook rebuilt + installed deps, passed.
CI invocation: ran the exact run-evals.ts path end-to-end, exit 0.
Gap-Add openrouter-oauth skill #4 grader: unit-tested (good→pass, syntax-error→fail, no-code→fail) and proven end-to-end — produced_code_parses passed 1.00 on the oauth task in a real run.

⚠️ Action needed from a repo admin

CI needs an OPENROUTER_API_KEY repo secret (Settings → Secrets and variables → Actions). Our GitHub App has no admin here. Until set, the eval job skips with a warning rather than failing.

Follow-up (out of scope)

.waza.yaml Mac-prototype model labels should be reconciled; remaining methodology gaps (efficiency assertions, more negative/contextual cases, failure→new-test-case habit) are per-skill task-authoring, added incrementally.

View Slack thread

🤖 Generated with Claude Code

Replaces hardcoded /Users/matt.apperson Mac paths and shell-chained 'cd X && cmd' hooks (which fail because waza runs hooks without a shell) with a single shell-less helper: evals/_hooks/prepare-skill.ts. The helper syncs skills/<name>/ into ~/.agents/skills/<name>/ and npm-installs bundled scripts, resolving the repo root from its own location. Applied to all five suites with hooks (models, images, stt, tts, video). Verified end-to-end on Linux against OpenRouter: hook populates the skill dir, installs deps, and the suite passes. Adds README documenting the hook gotchas + OpenRouter recipe.

perry-the-pr-maintainer

Perry's Review

Replaces five sets of hardcoded Mac-path + shell-chained hooks with a single portable prepare-skill.ts helper that is cwd-independent and shell-free.

Verdict: ✅ LGTM

Details

Risk: 🟢 Low — eval config and a new hook helper only; no auth/payment/production data path touched

CI: only the in-progress Perry check (no other CI configured on this repo)

Findings (see inline comment for full context):

nit: prepare-skill.ts:13 — comment shows wrong hook path (see inline)

Codex: disabled (small tier)

Research: disabled (small tier)

Security: no concerns — skill name is hardcoded in each eval.yaml, never caller-supplied at runtime

Test coverage: no unit test for the hook helper; acceptable for a waza eval fixture script (integration-tested by running the actual eval)

Unresolved threads: none

Review: tier=small · model=claude-sonnet-latest · score=?

Replaces the Azure azd-extension install with the waza binary + Bun + Node, and drives waza through OpenRouter via the COPILOT_PROVIDER_* BYOK env (no GitHub Copilot account needed). Adds .github/scripts/run-evals.ts which runs every AUTHORED suite (skipping the 4 'TODO: scaffolding only' stubs), uploads per-suite results, and fails the build if any suite fails. A guard job skips the eval cleanly with a warning when the OPENROUTER_API_KEY repo secret is absent (a repo admin must add it). Adds workflow_dispatch with a model input. Verified locally: the exact waza invocation (hooks + graders + --output-dir artifact) passes end-to-end against OpenRouter.

Superseded by updated Perry review

perry-the-pr-maintainer

Perry's Review

Replaces the Azure azd waza CI path with a portable Bun-driven eval runner and an OpenRouter-BYOK workflow, making the waza eval suite runnable headless.

Verdict: ✅ LGTM

Details

Re-review: incremental — 133 new lines since prior review at 957d9fe (prior verdict was APPROVE). Net-new is two new files only: .github/scripts/run-evals.ts and the rewritten .github/workflows/eval.yml. The seven evals files were byte-identical to the prior review and are out of differential scope.

Risk: 🟢 Low — CI and eval-harness only; no production, auth, billing, or routing code touched. The workflow runs only on PRs targeting main, so it does not even execute on this stacked PR.

CI: all passing ✅ (Perry's own review check)

Findings (see inline comment):

🟡 .github/scripts/run-evals.ts:24 — stub detection matches the marker anywhere in the file rather than the description field; low-probability false-skip.

Strengths: the new guard job (secret-presence check plus graceful warning skip), concurrency cancel-in-progress, and the shell-less prepare-skill.ts hook are all solid defensive CI design. Default model slug (anthropic/claude-opus-4.8) is consistent across the EVAL_MODEL env and the script fallback, and is a valid family-first OpenRouter ID.

Security: no concerns — the key is read via the repo secret, never echoed; the guard job emits only a boolean presence flag.

Test coverage: not applicable — this is test and eval infrastructure itself.

Blast radius: the sole caller of the new script is the workflow run-evals step. The discover routine will run every authored (non-stub) suite, including openrouter-oauth — note that suite carries no before_run sync hook (it predates this PR, from parent #43), so it relies on the skill already being present; not introduced or worsened here.

Stack: this PR sits on setup-waza (#43 — waza eval infra), which targets main.

Unresolved threads: one — Perry's own prior nit on evals/_hooks/prepare-skill.ts:13 (a comment-path detail). That file is unchanged in this push, so it is out of differential scope and not re-raised here.

Review: tier=medium · model=claude-opus-latest · score=4.5

perry-the-pr-maintainer · 2026-06-26T02:35:42Z

+
+function isStub(specPath: string): boolean {
+  const body = readFileSync(specPath, "utf-8");
+  return body.includes(STUB_MARKER);


[suggestion] isStub() matches the marker anywhere in the file, but the file-header comment says stubs are gated on the description starting with it — a stray match elsewhere would silently skip a real suite.

Details

Why: Line 10 documents the contract as "Skips stub suites (description starts with ...)", but body.includes(STUB_MARKER) returns true if the marker string appears ANYWHERE in the YAML — a task prompt, a grader rubric, or a comment that happens to quote "TODO: scaffolding only". An authored suite that mentions the phrase in passing would be silently dropped from CI, and the failure mode is invisible (it just prints skip (stub) and exits 0). Today every stub places the marker in the top-level description:, so it works — but the check is looser than its stated intent.

Fix: Parse the YAML and test the description field specifically, e.g. read description and check description.trimStart().startsWith(STUB_MARKER), falling back to the current whole-file scan only if parsing fails. A lighter-touch alternative: anchor the marker to a line start (/^\s*TODO: scaffolding only/m) so only a leading description line counts.

Prompt for agents

In .github/scripts/run-evals.ts, isStub() at line 22-25 reads the whole eval.yaml and returns body.includes(STUB_MARKER). The documented contract (line 10) is that a suite is a stub when its `description` field starts with "TODO: scaffolding only". Tighten isStub so it only treats a suite as a stub when the marker leads the description: either parse the YAML and check description.trimStart().startsWith(STUB_MARKER), or match the marker anchored to a line start with /^\s*TODO: scaffolding only/m. Keep behavior identical for the current stub suites.

Reviewed at 5d60778

Adds evals/_graders/verify-code-runs.ts — a waza `program` grader that reads the agent's response on stdin, extracts JS/TS code blocks, and syntax/type-checks them (node --check / tsc --noEmit). Closes the 'does the produced code actually run' gap from the OpenAI skill-eval methodology, so building-skill tasks verify the code parses rather than only that a judge likes the prose. Wired into the openrouter-oauth happy-path task as the first real usage; proven end-to-end on OpenRouter (produced_code_parses passed 1.00). Adds README documenting the grader + the waza program-grader stdin/stderr facts.

perry-the-pr-maintainer Bot previously approved these changes Jun 26, 2026

View reviewed changes

Comment thread evals/_hooks/prepare-skill.ts

skillets-the-skill-intern Bot changed the title ~~fix(evals): portable, headless-runnable before_run hooks~~ fix(evals): portable hooks + OpenRouter CI — make waza functional headless Jun 26, 2026

perry-the-pr-maintainer Bot approved these changes Jun 26, 2026

View reviewed changes

skillets-the-skill-intern Bot changed the title ~~fix(evals): portable hooks + OpenRouter CI — make waza functional headless~~ fix(evals): portable hooks + OpenRouter CI + code-exec grader — eval system review-ready Jun 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(evals): portable hooks + OpenRouter CI + code-exec grader — eval system review-ready#65

fix(evals): portable hooks + OpenRouter CI + code-exec grader — eval system review-ready#65
skillets-the-skill-intern[bot] wants to merge 3 commits into
setup-wazafrom
waza-openrouter-hooks

skillets-the-skill-intern Bot commented Jun 26, 2026 •

edited

Loading

Uh oh!

perry-the-pr-maintainer Bot left a comment

Uh oh!

Uh oh!

perry-the-pr-maintainer Bot left a comment

Uh oh!

perry-the-pr-maintainer Bot Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants

Uh oh!

Conversation

skillets-the-skill-intern Bot commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. Portable before_run hooks

2. CI rework — run waza on OpenRouter (no azd)

3. Code-execution verification grader (closes methodology gap #4)

Verified live (all on OpenRouter)

⚠️ Action needed from a repo admin

Follow-up (out of scope)

Uh oh!

perry-the-pr-maintainer Bot left a comment

Choose a reason for hiding this comment

Perry's Review

Uh oh!

Uh oh!

perry-the-pr-maintainer Bot left a comment

Choose a reason for hiding this comment

Perry's Review

Uh oh!

perry-the-pr-maintainer Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants

skillets-the-skill-intern Bot commented Jun 26, 2026 •

edited

Loading

1. Portable `before_run` hooks