fix(agent): align results-history loader with results.json schema#219
Merged
Conversation
… regression false-positive) Three coordinated schema-alignment fixes to agent/sources/results_history.py: 1. _summarize reads top-level aggregate_metrics (the field core/results.py actually writes), falling back to legacy metrics/run_metrics. 2. _summarize prefers summary.overall_passed for run_pass, falling back to legacy shapes. 3. _summarize orders runs by finished_at/started_at (the fields results.json actually contains), not just the legacy timestamp list. 4. _collect_local_runs now includes .agentops/results/latest/ when it is the only local results directory. In CI, generated workflows write only to that path (per services/cicd.py:28 _CI_EVAL_OUTPUT), so excluding it left local_runs empty and the regression check fell back on the stale Foundry cloud listing - flagging the previous PR's run as if it were current. Adds 4 unit tests covering CI mode, dev-mode dedup, the metrics field, and finished_at ordering. Full suite: 817 passed, 1 skipped. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fix Doctor regression false-positive in CI
Problem
In CI, the AgentOps PR gate could flag the just-completed candidate's
metrics as a regression against itself. The failure looked like this in PO's
step 15 run:
agentops eval runwrote.agentops/results/latest/results.jsonwithaggregate_metrics.coherence = 5.0,summary.overall_passed = true(the prompt fix worked).
agentops doctorthen reported a critical regression oncoherence,citing a previous run's id (
evalrun_7223c507c18a4ccaacfde601e16e3990,the step-14 regression run that scored 2.67) as
latest_run_id.This blocked the PR for the wrong reason and broke the tutorial story.
Root cause
Three coordinated bugs in
src/agentops/agent/sources/results_history.py:_summarizereaddata.get("metrics") or data.get("run_metrics"), but the orchestrator writes top-levelaggregate_metrics(seecore/results.py:111). Every localRunSummarycame back withmetrics = {}, so the regression checkcould never see the current run's numbers.
_summarizereadrun_passfrom the legacymetrics/summaryshape, missing the canonicalsummary.overall_passed.latest/excluded in CI._collect_local_runsunconditionallyskipped
.agentops/results/latest/. In dev that's correct (it's apointer to the freshest timestamped dir). In CI the generated
workflow runs
agentops eval run --output .agentops/results/latestand writes nowhere else (see
services/cicd.py:28), solocal_runswas always[]. Doctor then fell back to the Foundrycloud listing — which has eventual-consistency lag of seconds to
minutes — so
latest = cloud_runs[-1] = the previous PR run.Combined: latest had empty metrics → regression check couldn't see the
current numbers → comparison ran against the previous regression run as
if it were current → critical finding, PR blocked.
A secondary issue (also fixed): timestamp ordering used the field list
timestamp/created_at/summary.timestamp— none of whichresults.jsonactually contains. The canonical fields arestarted_atand
finished_at, so every run defaulted to epoch-zero ordering.Fix
Schema-align
results_history.py:_summarizeprefersaggregate_metrics, falls back to legacymetrics/run_metrics._summarizepreferssummary.overall_passed, falls back to legacysummary.run_pass/metrics.run_pass._summarizeorders bytimestamp→finished_at→started_at→created_at→summary.timestamp._collect_local_runscaptureslatest/into a separate slot andincludes it only when no timestamped sibling exists. Dev-mode dedup
is preserved; CI-mode emptiness is fixed.
No CLI flags added, no public contract changed, no exit-code semantics
touched.
Tests
Added 4 unit tests to
tests/unit/test_agent_results_history.py:test_collect_results_history_loads_latest_only_in_ci—latest/isthe only dir → loader returns it.
test_collect_results_history_prefers_timestamped_over_latest— whena timestamped sibling exists,
latest/is still skipped (existingdev-mode behavior).
test_collect_results_history_reads_aggregate_metrics_field— schemaalignment for the metrics field.
test_collect_results_history_orders_by_finished_at— schemaalignment for ordering.
Full suite:
python -m pytest tests/ -x -q→ 817 passed, 1 skipped.Release urgency
Recommend cutting v0.3.3 immediately after merge. PO is mid-tutorial-
recording and his repo's generated workflows install
agentops-accelerator @ git+...@main, so the fix reaches his next PR runas soon as it lands on
main. The PyPI release tag covers external users.Out of scope (filed for later)
opex._check_flaky_metricandrelease_readiness._check_baselinemay double-count if the same run shows up in both the local loader and
the Foundry cloud listing. Low impact, deferred.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>