Skip to content

refactor: unify linear/quantization architecture and remove deprecate…#366

Open
qinyiqun wants to merge 5 commits into
mainfrom
refactor/unify-linear-quantization
Open

refactor: unify linear/quantization architecture and remove deprecate…#366
qinyiqun wants to merge 5 commits into
mainfrom
refactor/unify-linear-quantization

Conversation

@qinyiqun
Copy link
Copy Markdown
Contributor

@qinyiqun qinyiqun commented May 12, 2026

Summary

  • Move linear module from InfiniCore to InfiniLM with quantization-based dispatch
  • Add GPTQ->GPTQ_QY weight conversion gated by QY device type
  • Implement fused linear weight splitting and re-registration
  • Fix TP split dimensions for all quantization schemes
  • Add alpha scaling parameter and logical dim size delegation
  • Move set_zeros/set_minus_one to utils.hpp as shared utilities

Motivation

Linear/quantization should be moved to InfiniLM from InfiniCore.

Closes #

Type of Change

  • feat — new feature / new model
  • fix — bug fix
  • perf — performance improvement (no behavioral change)
  • refactor — code restructuring without behavior change
  • test — adding or fixing tests only
  • docs — documentation only
  • build / ci — build system or CI configuration
  • chore — tooling, formatting, or other non-code changes
  • Breaking change

Test Results of Involved Models on Supported Platforms (Please attach screenshots)

Benchmark / Performance Impact

Notes for Reviewers


Checklist

Every contributor must verify every item below before requesting
review. Tick each box only after the check has actually been performed —
do not tick speculatively. If an item truly does not apply, replace the
checkbox with N/A and briefly explain why in an inline comment.

Title, Branch, and Commits

  • PR title follows Conventional Commits (e.g. feat(nvidia): …, fix(cuda/gemm): …).
  • Branch name follows <type>/xxx-yyyy-zzzz where <type> matches the PR title's Conventional Commits type and words are joined with hyphens (see CONTRIBUTING.md §Branches).
  • Each commit message follows Conventional Commits.
  • Small PR is a single squashable commit; or, for a large PR, every commit is meaningful, well-formed, and independently reviewable (see CONTRIBUTING.md §Pull Requests).
  • No stray merge commits from main — the branch is rebased cleanly on top of the current main.
  • No fixup! / squash! / wip commits remain.
  • Existing PR/branch/commit that followed the legacy issue format.

Scope and Design

  • Changes are minimal — nothing unrelated to the stated motivation was added (CONTRIBUTING.md §Code/General).
  • No dead code, commented-out blocks, debug prints, printf/std::cout/print(...) left behind, or TODO without an owner and issue link.
  • No unrelated formatting churn that would obscure the diff.
  • Public API changes (if any) are intentional, documented, and reflected in affected callers/tests.

General Code Hygiene (applies to all languages)

  • The code is self-explanatory; comments were added only where the why is non-obvious (CONTRIBUTING.md §Code/General).
  • Every modified or added file ends with a single trailing newline (CONTRIBUTING.md §Code/General).
  • No trailing whitespace, tab/space mixing, or stray BOMs.
  • Identifiers in comments and error messages are wrapped in backticks (e.g. the `seqlens_k` tensor) (CONTRIBUTING.md §Code/General).
  • All comments and error messages are in English (CONTRIBUTING.md §Code/General).
  • Comments and error messages are complete sentences — capitalized first letter, terminal punctuation — unless the language/framework convention says otherwise (CONTRIBUTING.md §Code/General; §Python).

C++ Specific (if C++ files changed)

  • Code follows the Google C++ Style Guide strictly.
  • Error and warning message wording follows the LLVM Coding Standards (CONTRIBUTING.md §C++).
  • Constructor initializer list order matches member declaration order (CONTRIBUTING.md §C++).
  • No raw new/delete; RAII / smart pointers / existing allocators are used.
  • Changed files are formatted by scripts/format.py.
  • No changes/reference to csrc/models/llama_legacy/.

Python Specific (if Python files changed)

  • Code is PEP 8 compliant.
  • Comments are complete English sentences, starting with a capital letter and ending with punctuation; Markdown backticks are used for code references (CONTRIBUTING.md §Python).
  • Docstrings (if any) follow PEP 257 (CONTRIBUTING.md §Python).
  • Changed files are formatted by scripts/format.py.
  • No changes/reference to python/infinilm/auto_config.py.

Testing

  • For any platform that could not be tested, an explicit reason is given in the table and a reviewer with access has been tagged.
  • Passed single request test (examples/test_infer.py), or specify the reason for skipping.
  • Passed offline performance test (examples/bench.py), or specify the reason for skipping.
  • Passed sanity test (test/bench/test_benchmark.py), or specify the reason for skipping.
  • Passed service test (python/infinilm/server/inference_server.py + scripts/test_perf.py), or specify the reason for skipping.

Build, CI, and Tooling

  • The project builds cleanly from a fresh directory on at least one affected platform.

Documentation

  • README.md, CONTRIBUTING.md, or inline docs updated when behavior, build flags, or developer workflow changed.
  • Any user-visible breaking change is called out explicitly under "Motivation" and in the commit/PR title with a ! or BREAKING CHANGE: footer.

Security and Safety

  • No secrets, access tokens, internal URLs, customer data, or personal hardware identifiers have been committed.
  • Third-party code is license-compatible and attributed.
  • No unsafe pointer arithmetic, uninitialized reads, or missing bounds checks were introduced.

…d interfaces

- Move linear module from InfiniCore to InfiniLM with quantization-based dispatch
- Add GPTQ->GPTQ_QY weight conversion gated by QY device type
- Implement fused linear weight splitting and re-registration
- Fix TP split dimensions for all quantization schemes
- Add alpha scaling parameter and logical dim size delegation
- Move set_zeros/set_minus_one to utils.hpp as shared utilities
@qinyiqun qinyiqun requested a review from a team May 12, 2026 03:27
@qinyiqun
Copy link
Copy Markdown
Contributor Author

需要讨论的问题:module在InfiniLM中应该以宏的形式声明和初始化还是应该以智能指针的形式存在。

Comment thread csrc/cache/kv_cache.cpp
Comment thread csrc/engine/rank_worker.cpp
Comment thread csrc/layers/attention/attention.cpp
Comment thread csrc/layers/attention/attention.cpp
Comment thread csrc/layers/mlp/mlp.cpp
Comment thread csrc/layers/mlp/moe_mlp.cpp Outdated
Comment thread csrc/models/qwen3/qwen3_attention.cpp
Comment thread csrc/models/model_factory.hpp
Comment thread csrc/layers/linear/linear.hpp
Comment thread csrc/engine/compiler/paged_compiler.cpp
qinyiqun added 4 commits May 14, 2026 02:56
…ma_legacy

Extract legacy QKVParallelLinear and GateUpParallelLinear into
legacy_fused_linear.hpp/cpp under llama_legacy, keeping them based on
infinicore::nn::ColumnParallelLinear and infinicore::quantization.
Update llama_legacy's attention and MLP to use these legacy classes
with INFINILM_LEGACY_* macros, preserving the original code structure.
Remove manual quant_scheme switch in MoeMLP and pass quantization_method
to linear layer constructors, consistent with mlp.cpp pattern. Also move
type aliases and fused_linear include to top of linear.hpp.
Move fused_linear.hpp include and type aliases back after class definitions,
since fused_linear.hpp depends on ColumnParallelLinear being defined.
@qinyiqun qinyiqun requested a review from pengcheng888 May 14, 2026 06:43
Comment thread csrc/models/minicpm_sala/minicpm_sala_attention.cpp
Copy link
Copy Markdown
Collaborator

@pengcheng888 pengcheng888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(1) 得给出 fm9g qwen3两个典型模型的测试通过截图
(2) INFINICORE_NN_MODULE_INIT初始化方式是否去掉再确认下
(3) linear模块和量化算法代码量挺多的,可能都是已有代码的迁移,未仔细看
(4) 需要测试几个典型case,看看有性能损失么,不确定这个要不要测试
(5) 近几天有新pr合并,最终版本需要rebase main

Copy link
Copy Markdown
Collaborator

@pengcheng888 pengcheng888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

需要再等其他的人approve才能合并

@qinyiqun
Copy link
Copy Markdown
Contributor Author

(1) 得给出 fm9g qwen3两个典型模型的测试通过截图 (2) INFINICORE_NN_MODULE_INIT初始化方式是否去掉再确认下 (3) linear模块和量化算法代码量挺多的,可能都是已有代码的迁移,未仔细看 (4) 需要测试几个典型case,看看有性能损失么,不确定这个要不要测试 (5) 近几天有新pr合并,最终版本需要rebase main

image image

Copy link
Copy Markdown
Collaborator

@wooway777 wooway777 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rms norm, attention, mlp等非linear模块的初始化修改没必要且不应该出现在这个pr中

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants