webgpu: Increase MatMulNBits K-parallelism with tile_size_k_vec=32 by qjia7 · Pull Request #27834 · microsoft/onnxruntime

qjia7 · 2026-03-25T07:18:07Z

Use tile_size_k_vec=32 (instead of 16) for MatMulNBits default kernel, doubling the number of threads working on K-dimension reduction per output row. This improves token generation throughput by ~3% on NVIDIA GPUs by better utilizing memory bandwidth.

Intel devices retain tile_size_k_vec=16 due to different subgroup and cache characteristics.

Changes:

matmul_nbits.h: Add tile_size_k_vec parameter (default 16) to MatMulNBitsProgram constructor.
matmul_nbits.cc: Select tile_size_k_vec=32 for non-Intel vendors, pass to program constructor.

Copilot

Pull request overview

Updates the WebGPU MatMulNBits default shader variant to increase K-reduction parallelism on non-Intel GPUs by making tile_size_k_vec configurable and selecting a larger default for better throughput.

Changes:

Add a tile_size_k_vec parameter (default 16) to MatMulNBitsProgram so K-parallelism can be tuned per device.
Use tile_size_k_vec = 32 for non-Intel adapters and keep 16 for Intel adapters when constructing the default MatMulNBits program.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
`onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits.h`	Extends `MatMulNBitsProgram` to store a configurable `tile_size_k_vec_` used during shader generation.
`onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits.cc`	Plumbs `tile_size_k_vec_` into WGSL template parameters and selects 16 vs 32 based on adapter vendor.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Use tile_size_k_vec=32 (instead of 16) for MatMulNBits default kernel, doubling the number of threads working on K-dimension reduction per output row. This improves token generation throughput by ~3% on NVIDIA GPUs by better utilizing memory bandwidth. Intel devices retain tile_size_k_vec=16 due to different subgroup and cache characteristics. Changes: - matmul_nbits.h: Add tile_size_k_vec parameter (default 16) to MatMulNBitsProgram constructor. - matmul_nbits.cc: Select tile_size_k_vec=32 for non-Intel vendors, pass to program constructor.

qjia7 marked this pull request as ready for review March 25, 2026 07:51

qjia7 requested a review from Copilot March 25, 2026 07:52

Copilot started reviewing on behalf of qjia7 March 25, 2026 07:53 View session

Copilot AI reviewed Mar 25, 2026

View reviewed changes

qjia7 force-pushed the webgpu-matmulnbits-step1-correct branch from 183b6e3 to 304383d Compare March 25, 2026 08:00

guschmue approved these changes Mar 25, 2026

View reviewed changes

guschmue added the ep:WebGPU ort-web webgpu provider label Mar 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

webgpu: Increase MatMulNBits K-parallelism with tile_size_k_vec=32#27834

webgpu: Increase MatMulNBits K-parallelism with tile_size_k_vec=32#27834
qjia7 wants to merge 1 commit intomainfrom
webgpu-matmulnbits-step1-correct

qjia7 commented Mar 25, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

qjia7 commented Mar 25, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants