Configurations
By clicking Run, anonymous GPU stats and results are saved. Privacy policy
Research
The science behind the benchmarks
Fusing the entire autoregressive decoding loop into a single GPU dispatch achieves 66–458× over unfused dispatch. The parallel kernel beats PyTorch MPS by 7.5–161× at all tested sizes.
Gunaydin, A.B. (2026)
Single-Kernel Fusion for Autoregressive Transformer Decoding via WebGPU Compute Shaders
doi:10.5281/zenodo.19344277