We proved that browsers can run GPU computation at near-native speed. No download. No account. No expensive hardware. Just open a page.
For decades, running computation on a graphics card meant navigating three barriers that kept most of the world locked out.
CUDA only runs on NVIDIA GPUs. If you have a Mac, an AMD card, or an Intel integrated chip — you're out.
Python environments, CUDA drivers, cuDNN versions, framework dependencies. One mismatch and nothing works.
Cloud GPU instances cost $2–4/hour. A university lab without GPU budget simply can't participate.
WebGPU is a new browser standard that gives web pages direct access to your graphics card. We showed it's fast enough for real scientific computation.
“For fitness functions with sequential dependencies — common in reinforcement learning, financial simulation, and control systems — custom compute shaders dramatically outperform framework-based GPU code.”
From the preprint, Section 4.2
This isn't about replacing data centers. It's about giving GPU access to the people who never had it.
A computer science student in Lagos, Bangalore, or rural anywhere can now run real GPU-accelerated experiments on their $400 laptop. Open a browser tab, not a grant application.
Instead of slides about parallel computing, show it running live in the classroom. Every student's laptop becomes a GPU workstation. No lab setup, no admin permissions.
“To reproduce our results, open this URL.” Not “install Python 3.10.12, CUDA 12.1, cuDNN 8.9.7, match the exact driver version.” Reproducibility should be one click.
A biotech team in Nairobi running molecular screening. A fintech in Sao Paulo backtesting strategies. Their scientists' laptops ARE the compute cluster. No AWS required.
The technical details are in the preprint. Here's the intuition.
The graphics chip in your laptop — whether it's Apple Silicon, AMD, Intel, or NVIDIA — is a parallel processor with thousands of cores. It spends most of its time idle.
WebGPU is a new web standard (shipping in Chrome since 2023) that lets web pages run computation directly on your GPU. Not just graphics — real number crunching.
Instead of sending thousands of small tasks to the GPU one by one (like PyTorch does), we pack the entire computation into a single instruction. One dispatch instead of 22,500. The browser only adds 48% overhead vs native, and it's still faster than PyTorch.
30 independent runs per experiment. Statistical tests. Comparisons against 8 systems on 2 hardware platforms. Not hype — evidence.
This isn't theoretical. Here's what's different tomorrow.
“Reproduce our results” → Install Python 3.10.12 → Install CUDA 12.1 → Install cuDNN 8.9.7 → Match driver version → Debug for 3 hours → Maybe it works
Click this link. Results run in your browser. Verified in 30 seconds.
“I need GPU compute” → AWS account → GPU instance ($2–4/hr) → DevOps setup → $5,000/month bill → Need funding before building
Users' own GPUs do the work. Server cost: $0. Ship on day one.
“Today we'll learn parallel computing” → Show slides → Students nod → Nobody runs anything because the school has no GPU lab
“Open this URL on your laptop.” 30 students run GPU computation simultaneously. They see it, touch it, modify it.
“Run screening on patient data” → 6-month legal review to upload to cloud → HIPAA/GDPR audit → $200K contract → Finally start work
Data never leaves the laptop. GPU compute runs in the browser. Compliance by architecture, not by contract.
The paper measured 159–720× on two machines. Then 592 people ran it on their own devices, and the numbers got bigger.
In the paper, we tested across four GPU APIs on two hardware platforms. On a Tesla T4: hand-fused CUDA 720×, JAX lax.scan 172×, Triton 27×. On an M2 Pro: WebGPU 159× over PyTorch MPS. The pattern was consistent: fusion eliminates dispatch overhead.
Since publishing, 592 devices have confirmed this — and the real-world numbers are larger. Apple Silicon averages 2,865×. Qualcomm Adreno (the chip in most Android phones) averages 623×. NVIDIA desktops average 79×.
The papers measured on 2 machines: an Apple M2 Pro laptop and a Tesla T4 server. Both are fast GPUs with efficient command dispatching. That's why the speedups were “only” 159–720×.
Real-world devices include phones, tablets, Chromebooks, and laptops with integrated GPUs — hardware that was never designed for compute workloads. These GPUs have much worse dispatch overhead.
Kernel fusion eliminates dispatch overhead. So the worse a device is at dispatching, the more it benefits. NVIDIA desktop GPUs (good dispatching) see ~79×. Apple Silicon laptops see ~2,865×. Android phones see ~623×. The devices that need fusion most, benefit from it most.
Run real GPU benchmarks on your hardware, right now, in your browser.
Every result from every device is public. No cherry-picking. Verify any claim yourself.