Skip to content

webgpu: Increase MatMulNBits K-parallelism with tile_size_k_vec=32#27834

Open
qjia7 wants to merge 1 commit intomainfrom
webgpu-matmulnbits-step1-correct
Open

webgpu: Increase MatMulNBits K-parallelism with tile_size_k_vec=32#27834
qjia7 wants to merge 1 commit intomainfrom
webgpu-matmulnbits-step1-correct

Conversation

@qjia7
Copy link
Contributor

@qjia7 qjia7 commented Mar 25, 2026

Use tile_size_k_vec=32 (instead of 16) for MatMulNBits default kernel, doubling the number of threads working on K-dimension reduction per output row. This improves token generation throughput by ~3% on NVIDIA GPUs by better utilizing memory bandwidth.

Intel devices retain tile_size_k_vec=16 due to different subgroup and cache characteristics.

Changes:

  • matmul_nbits.h: Add tile_size_k_vec parameter (default 16) to MatMulNBitsProgram constructor.
  • matmul_nbits.cc: Select tile_size_k_vec=32 for non-Intel vendors, pass to program constructor.

@qjia7 qjia7 marked this pull request as ready for review March 25, 2026 07:51
@qjia7 qjia7 requested a review from Copilot March 25, 2026 07:52
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the WebGPU MatMulNBits default shader variant to increase K-reduction parallelism on non-Intel GPUs by making tile_size_k_vec configurable and selecting a larger default for better throughput.

Changes:

  • Add a tile_size_k_vec parameter (default 16) to MatMulNBitsProgram so K-parallelism can be tuned per device.
  • Use tile_size_k_vec = 32 for non-Intel adapters and keep 16 for Intel adapters when constructing the default MatMulNBits program.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits.h Extends MatMulNBitsProgram to store a configurable tile_size_k_vec_ used during shader generation.
onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits.cc Plumbs tile_size_k_vec_ into WGSL template parameters and selects 16 vs 32 based on adapter vendor.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Use tile_size_k_vec=32 (instead of 16) for MatMulNBits default kernel,
doubling the number of threads working on K-dimension reduction per
output row. This improves token generation throughput by ~3% on NVIDIA
GPUs by better utilizing memory bandwidth.

Intel devices retain tile_size_k_vec=16 due to different subgroup and
cache characteristics.

Changes:
- matmul_nbits.h: Add tile_size_k_vec parameter (default 16) to
  MatMulNBitsProgram constructor.
- matmul_nbits.cc: Select tile_size_k_vec=32 for non-Intel vendors,
  pass to program constructor.
@qjia7 qjia7 force-pushed the webgpu-matmulnbits-step1-correct branch from 183b6e3 to 304383d Compare March 25, 2026 08:00
@guschmue guschmue added the ep:WebGPU ort-web webgpu provider label Mar 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ep:WebGPU ort-web webgpu provider

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants