[AMD] improve dsr1 fp4 disagg perf on mi355x#1236
[AMD] improve dsr1 fp4 disagg perf on mi355x#1236billishyahao wants to merge 72 commits intomainfrom
Conversation
…transformers v5 Transformers v5 incorrectly rebuilds pre_tokenizer/decoder components for models like DeepSeek-R1 that use LlamaTokenizerFast with a non-Llama tokenizer architecture. The sglang server fixes this at startup, but the benchmark client loads the tokenizer without these fixes, causing a ~5x token count inflation (e.g. 7000 tokens -> 35000 tokens) and false performance regressions in TTFT and throughput benchmarks. Apply the same tokenizer fixes (pre_tokenizer/decoder restoration and add_bos_token recovery) that sglang server applies, so client and server tokenize identically. No-op on transformers v4. Made-with: Cursor
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25248112683 |
1 similar comment
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25248112683 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25254822788 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25255592771 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25267403349 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25268431600 |
|
Can we got the review for this patch ? @functionstackx @Oseltamivir @cquil11 Sweep 19 of 20 passed, 1 is canceled by user https://github.com/SemiAnalysisAI/InferenceX/actions/runs/25241387090 Eval all passed https://github.com/SemiAnalysisAI/InferenceX/actions/runs/25268431600/ |
functionstackx
left a comment
There was a problem hiding this comment.
added an comment related to ur current code of "if evals: set xyz"
|
|
||
| unset MORI_MOE_MAX_INPUT_TOKENS_PREFILL | ||
| unset MORI_MOE_MAX_INPUT_TOKENS_DECODE | ||
| unset SGLANG_MORI_FP8_COMB |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25269775978 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25273191587 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25282687262 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25284166545 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25284187965 |
| DECODE_SERVER_CONFIG=$(echo "$DECODE_SERVER_CONFIG" | sed 's/--ep-dispatch-algorithm fake//g') | ||
| unset MORI_MOE_MAX_INPUT_TOKENS_PREFILL | ||
| unset MORI_MOE_MAX_INPUT_TOKENS_DECODE | ||
| unset SGLANG_MORI_FP8_COMB |
There was a problem hiding this comment.
@billishyahao I don't understand why we are unsetting fp8 combine for evals only but using can we not performance benchmark.
It seems like the only thing we should change for evals specific is context len to fit the shots and not setting fp8 combine.
can you work with @Oseltamivir to figure it out? happy to dedicate time on our end to work with you on it
replacement of #983
The new patch is adding the following optimization: