Conversation
Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
There was a problem hiding this comment.
Code Review
This pull request consolidates Llama benchmarking examples into a single directory and introduces a new configuration for Qwen models. The review feedback highlights a critical issue regarding an incompatible dataset preset for the Qwen model which could lead to incorrect results. Other suggestions include improving documentation clarity for dataset downloads, reverting unnecessary import reordering in the download script to maintain PEP 8 compliance, and removing hardcoded dates from configuration files to improve reusability.
There was a problem hiding this comment.
Pull request overview
This PR consolidates the separate Llama2-70B and Llama3.1-8B example folders into a single examples/05_Llama_Examples/ directory to simplify discovery and maintenance of Llama benchmarking configs/docs.
Changes:
- Update
examples/README.mdto point to a unified Llama examples folder. - Add a consolidated Llama README plus new/relocated YAML configs and helper assets under
examples/05_Llama_Examples/. - Remove the old per-model example READMEs.
Reviewed changes
Copilot reviewed 7 out of 10 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| examples/README.md | Updates the examples index to reference the consolidated Llama examples folder. |
| examples/05_Llama_Examples/README.md | New consolidated instructions for running Llama 3.1-8B and Llama 2-70B benchmarks. |
| examples/05_Llama_Examples/offline_llama3_8b_cnn.yaml | Adds offline Llama 3.1-8B CNN/DailyMail config. |
| examples/05_Llama_Examples/online_llama3_8b_cnn.yaml | Adds online Llama 3.1-8B CNN/DailyMail config. |
| examples/05_Llama_Examples/online_llama2_70b_orca.yaml | Adds online (submission-mode) Llama 2-70B OpenOrca config. |
| examples/05_Llama_Examples/download_cnndm.py | Keeps/relocates CNN/DailyMail download helper script. |
| examples/05_Llama_Examples/calibration-list.txt | Adds a checked-in calibration ID list referenced by the Llama3.1 workflow. |
| examples/05_Llama_Examples/qwen-multiple-concurrency.yaml | Introduces a Qwen/SGLang config (currently schema-invalid and out of PR scope). |
| examples/05_Llama3.1-8B_Example/README.md | Removed in favor of the consolidated Llama README. |
| examples/06_Llama2-70B_Example/README.md | Removed in favor of the consolidated Llama README. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ### Offline mode | ||
|
|
||
| ```bash | ||
| uv run inference-endpoint benchmark from-config -c offline_llama3_8b_cnn.yaml --timeout 600 | ||
| ``` | ||
|
|
||
| ### Online mode | ||
|
|
||
| ```bash | ||
| uv run inference-endpoint benchmark from-config -c online_llama3_8b_cnn.yaml --timeout 600 | ||
| ``` |
There was a problem hiding this comment.
The commands shown for the Llama-3.1 configs will run performance only by default: benchmark from-config defaults to --mode perf unless type: submission. Since the YAMLs include an accuracy dataset entry, readers may assume accuracy is being run when it isn’t. Either document using --mode both (or --mode acc) or change these configs to type: submission with an appropriate benchmark_mode so from-config defaults to BOTH.
| ### Offline mode | ||
|
|
||
| ```bash | ||
| uv run inference-endpoint benchmark from-config -c offline_llama3_8b_cnn.yaml --timeout 600 | ||
| ``` | ||
|
|
||
| ### Online mode | ||
|
|
||
| ```bash | ||
| uv run inference-endpoint benchmark from-config -c online_llama3_8b_cnn.yaml --timeout 600 | ||
| ``` |
There was a problem hiding this comment.
The Llama-3.1 YAML configs include an accuracy dataset using eval_method: rouge, but this section doesn’t mention installing the ROUGE scoring deps (nltk, evaluate, rouge_score) that are required when running --mode acc/--mode both (the tool raises ImportError otherwise). Consider adding the same optional accuracy-setup snippet here (or explicitly stating these runs are perf-only).
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 10 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| For post-training quantization calibration, use the repository-provided [`calibration-list.txt`](./calibration-list.txt), which corresponds to the [cnn-dailymail-calibration-list](https://github.com/mlcommons/inference/blob/v4.0/calibration/CNNDailyMail/calibration-list.txt): | ||
|
|
||
| ```bash | ||
| uv run python download_cnndm.py --save-dir data --calibration-ids-file calibration-list.txt --split train | ||
| ``` | ||
|
|
There was a problem hiding this comment.
This README instructs users to download calibration-list.txt via curl, but the repo now also includes calibration-list.txt in this directory. That’s redundant and can lead to confusion (or overwriting the checked-in file). Either remove the checked-in list and keep the download step, or update the instructions to use the bundled file and drop the curl command.
What does this PR do?
Consolidate the llama2-70b and llama3.1-8b examples into a single folder.
Type of change
Related issues
Testing
Checklist