feat: add Scenario-Check-Pass-Rate metric and update baselines by wumingxiami · Pull Request #13 · MiniMax-AI/MiniMax-Provider-Verifier

wumingxiami · 2026-05-03T05:23:58Z

Summary

Add new Scenario-Check-Pass-Rate metric to evaluate model behavior in realistic scenarios
Update M2.5 and M2.7 baseline metrics with May 2026 evaluation results
Add validator/scenario_check.py for scenario-based validation
Clean up obsolete output files (comparison_report, metrics_report_new, monitor.log, status.json)

Changes

validator/scenario_check.py — New scenario check module
verify.py — Integrate scenario check into verification pipeline
scripts/ — Update metric calculation and comparison scripts
sample.jsonl — Add new test case
README.md / README_CN.md — Document new metric
output-dir/ — Refresh M2.5/M2.7 baseline results

Test plan

Run python verify.py with M2.5 provider config and verify Scenario-Check-Pass-Rate is reported
Run python verify.py with M2.7 provider config and verify baseline comparison works
Verify existing metrics (ToolCalls-Match-Rate, ToolCalls-Accuracy) are unaffected

…26 baseline Add special scenario validation (Scenario-Check-Pass-Rate) as a new evaluation metric and update baseline data with May 2026 test results (100 concurrency, 10 rounds). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add Scenario-Check-Pass-Rate metric and update baselines#13

feat: add Scenario-Check-Pass-Rate metric and update baselines#13
wumingxiami wants to merge 1 commit intomainfrom
scenario_check_0503

wumingxiami commented May 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wumingxiami commented May 3, 2026

Summary

Changes

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant