Configurations
By clicking Run, anonymous GPU stats and results are saved. Privacy policy
The science behind the benchmarks
Fusing the entire autoregressive decoding loop into a single GPU dispatch achieves 66–458× over unfused dispatch. The parallel kernel beats PyTorch MPS by 7.5–161× at all tested sizes.
Gunaydin, A.B. (2026)
Single-Kernel Fusion for Autoregressive Transformer Decoding via WebGPU Compute Shaders
doi:10.5281/zenodo.19344276The research line and the end-to-end projects that build on it.
kernelfusion.dev
The research line. Two published preprints, one npm SDK, 92 unique devices across 7 GPU vendors. The theory that all the applied projects build on.
webgpudna.com
Electron track-structure simulation ported from the CNRS/IN2P3 Geant4-DNA toolkit to WebGPU. One thread per primary, full 10 keV history in a single for-loop. Radiolysis chemistry and DNA damage scoring live in a browser tab.
See the simulation →LLM inferencezerotvm.com
Phi-3-mini (3.8B) running end-to-end via 10 kernel roles across 27 WGSL files, replacing the 85 TVM-autotuned shaders WebLLM needs. ~40 tok/s on M2 Pro, 22% behind WebLLM.
Run it live →Visualizationneuropulse.live
A real forward pass of Phi-3-mini visualised tensor-by-tensor. 3.8 billion parameters, your GPU, your browser — every glow is a live activation read back from WebGPU. Zero server, zero API key.
Watch it think →Quantumwebgpu-q.vercel.app
Statevector + MPS quantum simulator running on commodity hardware via WebGPU compute. Six-level research ladder from bandwidth-bound statevector through MPS, kernel fusion, WebRTC swarm, IBM hardware cross-verify, to chemistry/VQE. No CUDA, no install.
Open the simulator →Personalbarisgunaydin.com
Personal site and project hub.
About →