feat(eval): validate trace (behavioral) expectations (expect.trace)#16
Open
kunalkushwaha wants to merge 1 commit into
Open
feat(eval): validate trace (behavioral) expectations (expect.trace)#16kunalkushwaha wants to merge 1 commit into
kunalkushwaha wants to merge 1 commit into
Conversation
Implements the previously stubbed `expect.trace` assertions in the eval runner,
so tests can check *how* an agent produced an answer, not just the content.
- ValidateTrace: pure validator for tool_calls (subset), llm_calls (exact),
execution_path (ordered subsequence), and min/max_steps.
- buildObservedTrace: normalizes the EvalServer trace + invoke `tools_called`
into an ObservedTrace (distinct tools, LLM-call count, path, step count).
- HTTPTarget.FetchTrace: fetches a run's trace from GET /traces/{id}.
- Runner: after the content match, fetches the trace and validates it; tool
calls fall back to the invoke response when the trace can't be fetched.
- Docs: new "Trace (Behavioral) Assertions" section in docs/EVAL.md.
Adds the eval package's first unit tests (validator, normalizer, and an
httptest-backed FetchTrace), so this is verified without a live LLM.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Implements the previously stubbed
expect.traceassertions in the eval runner. Today the runner has a literal// TODO: Validate trace expectationsand the fully-typedTraceExpectationstruct goes unused. This wires it up so eval tests can check how an agent reached its answer — not just the output text.This is feature B2 (behavioral assertions) from the
FEATURES.mdroadmap.How it works
After the content match passes, the runner fetches the run's trace from the EvalServer (
GET /traces/{id}) and evaluates the assertions. Tool calls also use thetools_calledfield from the/invokeresponse, sotool_callsis still checked even if the trace fetch fails.tool_callsllm_callsexecution_pathmin_steps/max_stepsChanges
internal/eval/trace_validator.go—ValidateTrace(pure) +buildObservedTracenormalizer + minimal decode types.internal/eval/http_target.go—FetchTrace(traceID)againstGET /traces/{id}.internal/eval/runner.go— wires validation in after the content match (replaces the TODO).docs/EVAL.md— new "Trace (Behavioral) Assertions" section documenting the realexpect.traceschema.Testing
go build,go vet,go test ./...,gofmtall green.evalpackage:ValidateTrace(11 cases),buildObservedTrace(with/without a trace),isOrderedSubsequence, and anhttptest-backedFetchTrace(success + error paths) — so the behavior is verified without a live LLM/EvalServer.🤖 Generated with Claude Code