WebGPU Bench
GPU ComputeTransformer Fusion

Transformer Fusion Benchmark

Fused vs unfused autoregressive decoding. Single-threaded and parallel (64 threads). 9 configs + sequence scaling.

Configurations

By clicking Run, anonymous GPU stats and results are saved. Privacy policy

Research

The science behind the benchmarks

Fusing the entire autoregressive decoding loop into a single GPU dispatch achieves 66–458× over unfused dispatch. The parallel kernel beats PyTorch MPS by 7.5–161× at all tested sizes.

Gunaydin, A.B. (2026)

Single-Kernel Fusion for Autoregressive Transformer Decoding via WebGPU Compute Shaders

doi:10.5281/zenodo.19344277