Skip to content

fix(agent): align results-history loader with results.json schema#219

Merged
placerda merged 1 commit into
developfrom
feature/fix-regression-false-positive
May 31, 2026
Merged

fix(agent): align results-history loader with results.json schema#219
placerda merged 1 commit into
developfrom
feature/fix-regression-false-positive

Conversation

@placerda
Copy link
Copy Markdown
Contributor

Fix Doctor regression false-positive in CI

Problem

In CI, the AgentOps PR gate could flag the just-completed candidate's
metrics as a regression against itself. The failure looked like this in PO's
step 15 run:

  • agentops eval run wrote .agentops/results/latest/results.json with
    aggregate_metrics.coherence = 5.0, summary.overall_passed = true
    (the prompt fix worked).
  • agentops doctor then reported a critical regression on coherence,
    citing a previous run's id (evalrun_7223c507c18a4ccaacfde601e16e3990,
    the step-14 regression run that scored 2.67) as latest_run_id.

This blocked the PR for the wrong reason and broke the tutorial story.

Root cause

Three coordinated bugs in src/agentops/agent/sources/results_history.py:

  1. Wrong metrics field. _summarize read data.get("metrics") or data.get("run_metrics"), but the orchestrator writes top-level
    aggregate_metrics (see core/results.py:111). Every local
    RunSummary came back with metrics = {}, so the regression check
    could never see the current run's numbers.
  2. Wrong pass field. _summarize read run_pass from the legacy
    metrics/summary shape, missing the canonical
    summary.overall_passed.
  3. latest/ excluded in CI. _collect_local_runs unconditionally
    skipped .agentops/results/latest/. In dev that's correct (it's a
    pointer to the freshest timestamped dir). In CI the generated
    workflow runs agentops eval run --output .agentops/results/latest
    and writes nowhere else (see services/cicd.py:28), so
    local_runs was always []. Doctor then fell back to the Foundry
    cloud listing — which has eventual-consistency lag of seconds to
    minutes — so latest = cloud_runs[-1] = the previous PR run.

Combined: latest had empty metrics → regression check couldn't see the
current numbers → comparison ran against the previous regression run as
if it were current → critical finding, PR blocked.

A secondary issue (also fixed): timestamp ordering used the field list
timestamp / created_at / summary.timestamp — none of which
results.json actually contains. The canonical fields are started_at
and finished_at, so every run defaulted to epoch-zero ordering.

Fix

Schema-align results_history.py:

  1. _summarize prefers aggregate_metrics, falls back to legacy
    metrics / run_metrics.
  2. _summarize prefers summary.overall_passed, falls back to legacy
    summary.run_pass / metrics.run_pass.
  3. _summarize orders by timestampfinished_atstarted_at
    created_atsummary.timestamp.
  4. _collect_local_runs captures latest/ into a separate slot and
    includes it only when no timestamped sibling exists. Dev-mode dedup
    is preserved; CI-mode emptiness is fixed.

No CLI flags added, no public contract changed, no exit-code semantics
touched.

Tests

Added 4 unit tests to tests/unit/test_agent_results_history.py:

  • test_collect_results_history_loads_latest_only_in_cilatest/ is
    the only dir → loader returns it.
  • test_collect_results_history_prefers_timestamped_over_latest — when
    a timestamped sibling exists, latest/ is still skipped (existing
    dev-mode behavior).
  • test_collect_results_history_reads_aggregate_metrics_field — schema
    alignment for the metrics field.
  • test_collect_results_history_orders_by_finished_at — schema
    alignment for ordering.

Full suite: python -m pytest tests/ -x -q817 passed, 1 skipped.

Release urgency

Recommend cutting v0.3.3 immediately after merge. PO is mid-tutorial-
recording and his repo's generated workflows install
agentops-accelerator @ git+...@main, so the fix reaches his next PR run
as soon as it lands on main. The PyPI release tag covers external users.

Out of scope (filed for later)

  • opex._check_flaky_metric and release_readiness._check_baseline
    may double-count if the same run shows up in both the local loader and
    the Foundry cloud listing. Low impact, deferred.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

… regression false-positive)

Three coordinated schema-alignment fixes to agent/sources/results_history.py:

1. _summarize reads top-level aggregate_metrics (the field core/results.py actually writes), falling back to legacy metrics/run_metrics.

2. _summarize prefers summary.overall_passed for run_pass, falling back to legacy shapes.

3. _summarize orders runs by finished_at/started_at (the fields results.json actually contains), not just the legacy timestamp list.

4. _collect_local_runs now includes .agentops/results/latest/ when it is the only local results directory. In CI, generated workflows write only to that path (per services/cicd.py:28 _CI_EVAL_OUTPUT), so excluding it left local_runs empty and the regression check fell back on the stale Foundry cloud listing - flagging the previous PR's run as if it were current.

Adds 4 unit tests covering CI mode, dev-mode dedup, the metrics field, and finished_at ordering. Full suite: 817 passed, 1 skipped.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@placerda placerda merged commit 5ea090c into develop May 31, 2026
12 checks passed
@placerda placerda deleted the feature/fix-regression-false-positive branch May 31, 2026 13:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant