fix(agent): align results-history loader with results.json schema by placerda · Pull Request #219 · Azure/agentops

placerda · 2026-05-31T13:19:42Z

Fix Doctor regression false-positive in CI

Problem

In CI, the AgentOps PR gate could flag the just-completed candidate's
metrics as a regression against itself. The failure looked like this in PO's
step 15 run:

agentops eval run wrote .agentops/results/latest/results.json with
aggregate_metrics.coherence = 5.0, summary.overall_passed = true
(the prompt fix worked).
agentops doctor then reported a critical regression on coherence,
citing a previous run's id (evalrun_7223c507c18a4ccaacfde601e16e3990,
the step-14 regression run that scored 2.67) as latest_run_id.

This blocked the PR for the wrong reason and broke the tutorial story.

Root cause

Three coordinated bugs in src/agentops/agent/sources/results_history.py:

Wrong metrics field. _summarize read data.get("metrics") or data.get("run_metrics"), but the orchestrator writes top-level
aggregate_metrics (see core/results.py:111). Every local
RunSummary came back with metrics = {}, so the regression check
could never see the current run's numbers.
Wrong pass field. _summarize read run_pass from the legacy
metrics/summary shape, missing the canonical
summary.overall_passed.
latest/ excluded in CI. _collect_local_runs unconditionally
skipped .agentops/results/latest/. In dev that's correct (it's a
pointer to the freshest timestamped dir). In CI the generated
workflow runs agentops eval run --output .agentops/results/latest
and writes nowhere else (see services/cicd.py:28), so
local_runs was always []. Doctor then fell back to the Foundry
cloud listing — which has eventual-consistency lag of seconds to
minutes — so latest = cloud_runs[-1] = the previous PR run.

Combined: latest had empty metrics → regression check couldn't see the
current numbers → comparison ran against the previous regression run as
if it were current → critical finding, PR blocked.

A secondary issue (also fixed): timestamp ordering used the field list
timestamp / created_at / summary.timestamp — none of which
results.json actually contains. The canonical fields are started_at
and finished_at, so every run defaulted to epoch-zero ordering.

Fix

Schema-align results_history.py:

_summarize prefers aggregate_metrics, falls back to legacy
metrics / run_metrics.
_summarize prefers summary.overall_passed, falls back to legacy
summary.run_pass / metrics.run_pass.
_summarize orders by timestamp → finished_at → started_at →
created_at → summary.timestamp.
_collect_local_runs captures latest/ into a separate slot and
includes it only when no timestamped sibling exists. Dev-mode dedup
is preserved; CI-mode emptiness is fixed.

No CLI flags added, no public contract changed, no exit-code semantics
touched.

Tests

Added 4 unit tests to tests/unit/test_agent_results_history.py:

test_collect_results_history_loads_latest_only_in_ci — latest/ is
the only dir → loader returns it.
test_collect_results_history_prefers_timestamped_over_latest — when
a timestamped sibling exists, latest/ is still skipped (existing
dev-mode behavior).
test_collect_results_history_reads_aggregate_metrics_field — schema
alignment for the metrics field.
test_collect_results_history_orders_by_finished_at — schema
alignment for ordering.

Full suite: python -m pytest tests/ -x -q → 817 passed, 1 skipped.

Release urgency

Recommend cutting v0.3.3 immediately after merge. PO is mid-tutorial-
recording and his repo's generated workflows install
agentops-accelerator @ git+...@main, so the fix reaches his next PR run
as soon as it lands on main. The PyPI release tag covers external users.

Out of scope (filed for later)

opex._check_flaky_metric and release_readiness._check_baseline
may double-count if the same run shows up in both the local loader and
the Foundry cloud listing. Low impact, deferred.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

… regression false-positive) Three coordinated schema-alignment fixes to agent/sources/results_history.py: 1. _summarize reads top-level aggregate_metrics (the field core/results.py actually writes), falling back to legacy metrics/run_metrics. 2. _summarize prefers summary.overall_passed for run_pass, falling back to legacy shapes. 3. _summarize orders runs by finished_at/started_at (the fields results.json actually contains), not just the legacy timestamp list. 4. _collect_local_runs now includes .agentops/results/latest/ when it is the only local results directory. In CI, generated workflows write only to that path (per services/cicd.py:28 _CI_EVAL_OUTPUT), so excluding it left local_runs empty and the regression check fell back on the stale Foundry cloud listing - flagging the previous PR's run as if it were current. Adds 4 unit tests covering CI mode, dev-mode dedup, the metrics field, and finished_at ordering. Full suite: 817 passed, 1 skipped. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

placerda merged commit 5ea090c into develop May 31, 2026
12 checks passed

placerda deleted the feature/fix-regression-false-positive branch May 31, 2026 13:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(agent): align results-history loader with results.json schema#219

fix(agent): align results-history loader with results.json schema#219
placerda merged 1 commit into
developfrom
feature/fix-regression-false-positive

placerda commented May 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

placerda commented May 31, 2026

Fix Doctor regression false-positive in CI

Problem

Root cause

Fix

Tests

Release urgency

Out of scope (filed for later)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant