Skip to content

feat: add Scenario-Check-Pass-Rate metric and update baselines#13

Open
wumingxiami wants to merge 1 commit intomainfrom
scenario_check_0503
Open

feat: add Scenario-Check-Pass-Rate metric and update baselines#13
wumingxiami wants to merge 1 commit intomainfrom
scenario_check_0503

Conversation

@wumingxiami
Copy link
Copy Markdown
Collaborator

Summary

  • Add new Scenario-Check-Pass-Rate metric to evaluate model behavior in realistic scenarios
  • Update M2.5 and M2.7 baseline metrics with May 2026 evaluation results
  • Add validator/scenario_check.py for scenario-based validation
  • Clean up obsolete output files (comparison_report, metrics_report_new, monitor.log, status.json)

Changes

  • validator/scenario_check.py — New scenario check module
  • verify.py — Integrate scenario check into verification pipeline
  • scripts/ — Update metric calculation and comparison scripts
  • sample.jsonl — Add new test case
  • README.md / README_CN.md — Document new metric
  • output-dir/ — Refresh M2.5/M2.7 baseline results

Test plan

  • Run python verify.py with M2.5 provider config and verify Scenario-Check-Pass-Rate is reported
  • Run python verify.py with M2.7 provider config and verify baseline comparison works
  • Verify existing metrics (ToolCalls-Match-Rate, ToolCalls-Accuracy) are unaffected

…26 baseline

Add special scenario validation (Scenario-Check-Pass-Rate) as a new evaluation metric
and update baseline data with May 2026 test results (100 concurrency, 10 rounds).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant