issue/350 - Support ChatGLM model by rubik-hua · Pull Request #374 · InfiniTensor/InfiniLM

rubik-hua · 2026-05-14T03:22:29Z

1、attention层与GLM4一致，复用。
2、decode层相比GLM4少了post_self_attn_layernorm和post_mlp_layernorm
3、各个layer层名字与llama进行映射

测试命令：python examples/test_infer.py --device nvidia --model=/data/rubik/models/chatglm3-6b/

测试命令：python examples/test_infer.py --device nvidia --model=/data/rubik/models/chatglm3-6b/ --enable-paged-attn

推理服务测试命令：CUDA_VISIBLE_DEVICES=0,1 python python/infinilm/server/inference_server.py --device nvidia --model=/data/rubik/models/chatglm3-6b/ --enable-paged-attn --tp=2

修改了python/infinilm/modeling_utils.py中的remap逻辑，验证GLM4逻辑没变化
python examples/test_infer.py --device nvidia --model=/data/rubik/models/GLM-4-9B-0414/ --enable-paged-attn

batch测试：
python examples/test_infer.py --device nvidia --model=/data/rubik/models/chatglm3-6b/ --enable-paged-attn --batch-size=8 --prompt="山东最高的山是？"

bench测试：
python examples/bench.py --device nvidia --model=/data/rubik/models/chatglm3-6b/ --enable-paged-attn

英文输出丢空格验证：python examples/test_infer.py --device nvidia --model=/data/rubik/models/chatglm3-6b/ --enable-paged-attn --prompt="introduce yourself"

test_benchmark.py能跑测试：python test/bench/test_benchmark.py --device nvidia --model=/data/rubik/models/chatglm3-6b/ --enable-paged-attn --bench=mmlu --split=val

pengcheng888 · 2026-05-14T04:55:17Z

在上一次提交的chatglm的pr中#353 ，还修改了tokenizer, 添加了add_special_tokens变量等。

请问这个pr，还需要tokenizer的改动么

pengcheng888 · 2026-05-14T06:50:38Z

在上一次提交的chatglm的pr中#353 ，还修改了tokenizer, 添加了add_special_tokens变量等。

请问这个pr，还需要tokenizer的改动么

@rubik-hua 华老师，请您看下ChatGLM模型的这个问题

wooway777 · 2026-05-14T07:08:04Z

1.1 英文输出无空格

1.2 模型多batch无输出

infer_backup.py保留了原单请求入口，应该有一些操作需要加到ChatGLM3的通用流程中以确保各脚本行为正常

没有正确支持bench.py
目前离线性能测试会报错，与batch size无关

没有正确支持test_benchmark.py
精度测试报错，是否开启paged attention行为一致

rubik-hua · 2026-05-14T08:21:59Z

我在解决，这个模型有点问题，离线和推理服务都有，
@pengcheng888

1、attention层与GLM4一致，复用。 2、decoder与标准llama一样 3、各个layer层名字与llama进行映射 4、修复examples/test_infer.py中的batch处理 5、修复bench测试时的key名字不一致问题

rubik-hua · 2026-05-15T09:32:18Z

上面的检视意见的问题都已经解决和修复，并且在最上面的comment中粘贴验证截图了。

wooway777 · 2026-05-15T09:47:10Z

模型的中文输出异常
bench.py

test_infer.py

服务+test_perf是正常的

rubik-hua · 2026-05-15T11:10:25Z

@wooway777 你发截图的这两个我理解是模型本身权重就有点问题的，这两天我也发现了，我绕过InfiniLM用transformers去测也有类似问题

然后你试试你的问题后面带个问号，它就正常了，跟模型本身的行为应该一致的。

wooway777 · 2026-05-15T11:14:30Z

@wooway777 你发截图的这两个我理解是模型本身权重就有点问题的，这两天我也发现了，我绕过InfiniLM用transformers去测也有类似问题然后你试试你的问题后面带个问号，它就正常了，跟模型本身的行为应该一致的。

了解，感谢老师~~

wooway777

详见关于test_infer中括号的最新评论，感谢

rubik-hua · 2026-05-15T18:38:17Z

这个地方我研究了一个晚上，仍然感觉原先不加中括号是个bug，我描述下我的思路，看看哪里有漏洞就帮我指出来：
首先发起一次测试，调度器调用链是这样的：
LLM.chat()-->LLM.generate()-->LLMEngine.add_request()-->scheduler.add_request()

后续的流程大概就是触发调度schedule()，拿到调度结果，构建推理推理的输入：
model_input = self.processor.build_model_inputs(
scheduler_output,
self.config.temperature,
self.config.top_p,
self.config.top_k,
)

所谓batch处理，我理解的就是只要调度器的策略认为哪些请求是可以放一起处理的，就是它的调度结果。

现在的问题是，原来不带中括号的情况下
for content in contents:
request_id = f"cmpl-{uuid.uuid4().hex}"
processed_inputs = None
if apply_chat_template:
prompt = self.engine.apply_chat_template(
content, add_generation_prompt=True
)
...
requests.append(req)
self.engine.add_request(req)
这里的contents=[[{'role': 'user', 'content': [{'type': 'text', 'text': '山东最高的山是？'}]}, {'role': 'user', 'content': [{'type': 'text', 'text': '山东最高的山是？'}]}, {'role': 'user', 'content': [{'type': 'text', 'text': '山东最高的山是？'}]}]]
这个 for content in contents:循环其实就只执行了一次，也就是一次测试最终只给调度器发了一个请求，昨天看到这里时就觉得不对，单个请求没法batch
最终在BasicLLMProcessor.apply_chat_template方法中，normalized_conversation=[
{"role": "user", "content": "山东最高的山是？"},
{"role": "user", "content": "山东最高的山是？"},
{"role": "user", "content": "山东最高的山是？"}
] 喂给了模型

以上是分析代码感觉原来的conversations是有点问题的。

下面做一些验证：
直接验证ok的模型，DeepSeek-R1-Distill-Qwen-7B
当conversations不带中括号的时候，输出打印日志调整了下：
for i, output in enumerate(outputs):
print(f"Resquest {i}:")
print("===Query===")
print(output.prompt)
print("===Response===")
print("output.outputs len=",len(output.outputs))
for comp_output in output.outputs:
print(comp_output.text)
print("")
实际输出如下：

应该需要输出3个response才对，我还以为是嵌套在output.outputs中的，但实际上可以看到就只有一个response

而conversations改成带中括号，输出就是有3条了，我觉得这样才是正确的。

而原先conversations不带中括号的情况下，我理解实际上就是只有一个请求，只不过把三个message拼到一起了，然后模型居然还能吭哧吭哧写出来。

所以，这个地方还是仔细研究一下看看， @pengcheng888 @wooway777

wooway777 · 2026-05-16T01:03:16Z

test_infer这个地方，模拟的是已经扎好batch之后的批处理场景。就是多个相同的请求扎成一个batch然后做一次推理。显示结果时只显示第一个请求。

后来前端重构的时候为了省事，直接调用了LLM，才经过了调度器。这个我们后面确实可以考虑再简化成之前的版本。

目前来说，好像加了中括号之后ChatGLM行为确实相对正常。但是对于其他已经支持的模型，batch size是多少就会做多少次单batch单请求推理，可以从耗时中直接体现出来，与预期不符。

rubik-hua · 2026-05-16T02:37:05Z

@wooway777 @pengcheng888
我还是没get到你说的点，我的理解，不加中括号，其实就是一个长请求，长度是batch-size倍的，没有batch处理流程的。
但是加上中括号，从代码流程来看，其实是真正的batch处理。
然后你说的“显示结果时只显示第一个请求”。我理解就只有一个请求，你想显示2个也显示不出来的。

然后还是从实际跑的结果来看，还是抛开chatglm不谈，直接跑已支持的模型DeepSeek-R1-Distill-Qwen-7B
如果说 “但是对于其他已经支持的模型，batch size是多少就会做多少次单batch单请求推理，可以从耗时中直接体现出来，与预期不符” 是这样的话，我理解执行时间应该是线性的吧，batch-size越大，测试时间越长，约等于batch-size倍。

但实际运行不是这样的。

比如
batch-size=1时：11221.17 ms
batch-size=2时：11518.49 ms
batch-size=3时：10777.32 ms
batch-size=8时：11996.36 ms
batch-size=16时：12041.67 ms
从上面结果来看，时间基本一样。

调度器默认的max_batch_size为16，可以预见当batch_size=32，时间就会加大
batch-size=32时：13393.15 ms
batch-size=64时：16263.91 ms
batch-size=128时：19079.1 ms
batch-size=256时：26423.16 ms

下面为测试截图：
batch=1

batch=2

batch=3

batch=8

batch=16

batch=32

batch=64

batch=128

batch=256

wooway777 · 2026-05-16T02:59:59Z

有可能我之前合代码不仔细出了纰漏，我再看一看。
周一我也喊该这部分代码的同事确认一下。

不过我这边观察到的现象是：
加中括号bs=1:

加中括号bs=8:

从总耗时上看很串行

去掉中括号bs=1:

去掉中括号bs=8:

从总耗时上看不像长请求

具体代码我再核对一下吧，不好意思了

rubik-hua · 2026-05-16T03:19:15Z

@wooway777 非常感谢。麻烦再核对一下，也可以考虑拉个视频会议交流一下。

上面comment我忘记说了一点，batch处理得开启paged-attention，运行时得加上--enable-paged-attn
默认是static的话，LLMEngine初始化时 max_batch_size: int = 16 不生效
python/infinilm/llm/llm.py大概110行。
107 # Initialize KV cache based on cache type
108 if config.cache_type == "static":
109 cache_config = StaticKVCacheConfig(
110 max_batch_size=1, max_cache_len=config.max_cache_len
111 )
112 self.scheduler = StaticScheduler(max_cache_len=config.max_cache_len)
113 logger.info(
114 f"Using Static KV Cache with max_cache_len={config.max_cache_len}"
115 )

这个实测也是符合预期的。
batch-size =1 python examples/test_infer.py --device nvidia --model=/data/rubik/models/DeepSeek-R1-Distill-Qwen-7B/ --batch-size=1 --prompt="山东最高的山是？"
执行时间：7861.05 ms

batch-size =8 python examples/test_infer.py --device nvidia --model=/data/rubik/models/DeepSeek-R1-Distill-Qwen-7B/ --batch-size=8 --prompt="山东最高的山是？"
执行时间：61173.32 ms

rubik-hua requested a review from a team May 14, 2026 03:22

pengcheng888 reviewed May 14, 2026

View reviewed changes

Comment thread csrc/models/chatglm/chatglm_for_causal_lm.hpp

rubik-hua force-pushed the chatglm branch from 8b31bc8 to 0efd13a Compare May 14, 2026 05:02

issue/350 - Support ChatGLM model

3c6a0f2

1、attention层与GLM4一致，复用。 2、decoder与标准llama一样 3、各个layer层名字与llama进行映射 4、修复examples/test_infer.py中的batch处理 5、修复bench测试时的key名字不一致问题

rubik-hua force-pushed the chatglm branch from 0efd13a to 3c6a0f2 Compare May 15, 2026 09:24

Conversation

rubik-hua commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

pengcheng888 commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pengcheng888 commented May 14, 2026

Uh oh!

wooway777 commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rubik-hua commented May 14, 2026

Uh oh!

rubik-hua commented May 15, 2026

Uh oh!

Uh oh!

wooway777 commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rubik-hua commented May 15, 2026

Uh oh!

wooway777 commented May 15, 2026

Uh oh!

Uh oh!

wooway777 left a comment

Choose a reason for hiding this comment

Uh oh!

rubik-hua commented May 15, 2026

Uh oh!

wooway777 commented May 16, 2026

Uh oh!

rubik-hua commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wooway777 commented May 16, 2026

Uh oh!

rubik-hua commented May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rubik-hua commented May 14, 2026 •

edited

Loading

pengcheng888 commented May 14, 2026 •

edited

Loading

wooway777 commented May 14, 2026 •

edited

Loading

wooway777 commented May 15, 2026 •

edited

Loading

rubik-hua commented May 16, 2026 •

edited

Loading