The science behind the benchmarks

Fusing the entire autoregressive decoding loop into a single GPU dispatch achieves 66–458× over unfused dispatch. The parallel kernel beats PyTorch MPS by 7.5–161× at all tested sizes.

Gunaydin, A.B. (2026)

Single-Kernel Fusion for Autoregressive Transformer Decoding via WebGPU Compute Shaders

doi:10.5281/zenodo.19344276

Companion projects

The research line and the end-to-end projects that build on it.

TheoryResearch line

kernelfusion.dev

The research line. Two published preprints, one npm SDK, 92 unique devices across 7 GPU vendors. The theory that all the applied projects build on.

Electron track-structure simulation ported from the CNRS/IN2P3 Geant4-DNA toolkit to WebGPU. One thread per primary, full 10 keV history in a single for-loop. Radiolysis chemistry and DNA damage scoring live in a browser tab.

See the simulation →LLM inference

zerotvm.com

Phi-3-mini (3.8B) running end-to-end via 10 kernel roles across 27 WGSL files, replacing the 85 TVM-autotuned shaders WebLLM needs. ~40 tok/s on M2 Pro, 22% behind WebLLM.

Run it live →Visualization

neuropulse.live

A real forward pass of Phi-3-mini visualised tensor-by-tensor. 3.8 billion parameters, your GPU, your browser — every glow is a live activation read back from WebGPU. Zero server, zero API key.

Watch it think →Quantum

webgpu-q.vercel.app

Statevector + MPS quantum simulator running on commodity hardware via WebGPU compute. Six-level research ladder from bandwidth-bound statevector through MPS, kernel fusion, WebRTC swarm, IBM hardware cross-verify, to chemistry/VQE. No CUDA, no install.

Open the simulator →Personal

barisgunaydin.com

Personal site and project hub.

About →

Transformer Fusion Benchmark

Configurations