GPU Acceleration Deep-Dive
TLD2 leverages WebGPU for hardware-accelerated neural network inference, delivering near-instant text-to-speech synthesis. This guide provides a comprehensive technical breakdown of GPU acceleration, performance benchmarks, and troubleshooting for developers and hardware enthusiasts.
WebGPU vs. WASM: Architecture Overview
The Two Execution Paths
TLD2's TTS engine (StreamingKokoroJS / ONNX Runtime Web) supports two execution backends:
WebGPU Mode (Recommended)
Execution Environment: GPU shaders via WebGPU API
How it works:
- ONNX model operators compiled to GPU compute shaders
- Tensor operations parallelized across GPU cores
- Memory managed in GPU VRAM for fast matrix multiplications
- Typical speedup: 2-10x faster than WASM
WASM Mode (Fallback)
Execution Environment: CPU via WebAssembly
How it works:
- ONNX operators executed in compiled WebAssembly
- SIMD instructions for vectorized operations (if CPU supports)
- Multi-threading via Web Workers (limited parallelism)
- Typical performance: Functional but 2-10x slower than WebGPU
Selection Logic
TLD2 automatically detects GPU availability at runtime:
Performance Benchmarks
Real-World TTS Synthesis Times
Measured on the Kokoro 82M quantized model (q8f16, 86MB):
| Hardware Configuration | Backend | First Chunk Latency | Tokens/Second | Full Article (500 words) |
|---|---|---|---|---|
| NVIDIA RTX 3080 (Desktop) | WebGPU | 150-250ms | ~180 | 2-3 seconds |
| Apple M1 Pro (MacBook) | WebGPU | 200-350ms | ~140 | 3-4 seconds |
| Intel Iris Xe (Integrated) | WebGPU | 400-600ms | ~80 | 5-7 seconds |
| AMD Radeon RX 6700 XT | WebGPU | 180-300ms | ~150 | 3-4 seconds |
| Intel i7-11700 (CPU only) | WASM | 1.5-2.5s | ~30 | 15-20 seconds |
| Apple M1 Pro (CPU only) | WASM | 800ms-1.5s | ~50 | 10-12 seconds |
| AMD Ryzen 5 5600X (CPU only) | WASM | 1-2s | ~35 | 12-16 seconds |
Key Takeaway: WebGPU delivers 3-8x faster first-chunk latency and 2-5x higher throughput compared to WASM.
Hardware Requirements & Compatibility
WebGPU Browser Support
- Chrome 113+ (Stable, full support)
- Edge 113+ (Chromium-based, full support)
- Firefox 121+ (Experimental, enable via
about:config) - Safari 18+ (Partial support, macOS/iOS)
GPU Compatibility
NVIDIA GPUs:
- GTX 900 series and newer (Maxwell architecture+)
- RTX 20/30/40 series (optimal performance)
- Quadro/Tesla workstation cards
AMD GPUs:
- RX 400 series and newer (Polaris+)
- RX 5000/6000/7000 series (RDNA architecture)
- Integrated Ryzen APUs (Vega, RDNA2)
Intel GPUs:
- Iris Xe (11th gen and newer)
- Arc A-series (dedicated GPUs)
- UHD Graphics 600 series+ (limited performance)
Apple Silicon:
- M1, M1 Pro, M1 Max, M1 Ultra
- M2, M2 Pro, M2 Max
- M3 series (best performance)
Minimum System Requirements
For WebGPU Mode:
- OS: Windows 10+, macOS 11+, Linux (kernel 5.4+)
- RAM: 4GB minimum, 8GB recommended
- GPU: Any GPU from 2015+ with driver updates
- VRAM: 2GB minimum for model storage
For WASM Mode (CPU-only):
- CPU: Modern multi-core (2019+)
- RAM: 4GB minimum
- SIMD: Automatically used if CPU supports (most modern CPUs do)
How to Check GPU Availability
Method 1: Browser Console
Open DevTools (F12) and paste this into the Console:
Expected output if GPU is available:
Method 2: Chrome Flags Check
- Navigate to
chrome://gpuin your browser - Search for "WebGPU" in the page
- Look for "WebGPU: Hardware accelerated"
- Check "Graphics Feature Status" table
If WebGPU shows "Disabled" or "Software only", troubleshoot using the next section.
Troubleshooting GPU Detection
Issue: "WebGPU not available" despite having a GPU
Common Causes & Fixes:
1. Outdated GPU Drivers
- NVIDIA: Download latest drivers from nvidia.com/drivers
- AMD: Update via AMD Software Adrenalin
- Intel: Use Intel Driver & Support Assistant
2. WebGPU Disabled in Chrome Flags
- Go to
chrome://flags - Search for "WebGPU"
- Set "Unsafe WebGPU" to
Enabled - Restart Chrome
3. Hardware Acceleration Disabled
- Go to
chrome://settings - Search for "hardware acceleration"
- Enable "Use hardware acceleration when available"
- Restart Chrome
4. GPU Blacklisted by Chrome
- Check
chrome://gpufor "Problems Detected" - If your GPU is blacklisted, try:
chrome://flags/#ignore-gpu-blacklist→ Enable - Warning: Use at own risk—may cause instability
Issue: TLD2 Slow Despite GPU Available
Diagnostics:
- Check Task Manager (Windows) / Activity Monitor (Mac) during TTS synthesis
- GPU usage should spike to 20-60% during generation
- If CPU spikes instead, TLD2 may be falling back to WASM
Force WebGPU Logging:
Check console for messages like "WebGPU backend initialized successfully."
Performance Optimization Tips
1. Use Quantized Models
TLD2 uses model_q8f16.onnx (86MB, mixed precision) by default. This provides optimal size/quality trade-off:
- q8f16: 8-bit int + 16-bit float → 86MB, minimal quality loss
- fp32 (full precision): 300MB+ → Slower, no perceptible quality gain
2. Close Resource-Heavy Browser Tabs
WebGPU shares GPU resources with:
- Other browser tabs using WebGL/WebGPU
- Hardware-accelerated video playback
- GPU-intensive web apps (games, 3D renderers)
For maximum TTS performance, close unused tabs before synthesizing long articles.
3. Enable Power/Performance Mode
- Windows: Set power plan to "High Performance"
- Mac: Disable "Low Power Mode" (laptops)
- Linux: Use
cpupower frequency-set -g performance
4. Monitor Thermal Throttling
If your laptop gets hot during TTS synthesis:
- GPU may thermal-throttle, reducing performance
- Use cooling pads or elevate laptop for better airflow
- Consider limiting summary length or using WASM mode to reduce heat
Technical Deep-Dive: ONNX Runtime Web
How TLD2 Uses ONNX
The Kokoro TTS model is exported from PyTorch to ONNX format for browser compatibility:
- Model Loading: ONNX file fetched from local cache (86MB)
- Session Creation: ONNX Runtime Web creates inference session
- Backend Selection: Runtime picks WebGPU or WASM based on availability
- Graph Optimization: Operator fusion, constant folding for speed
- Tensor Inference: Input text → phonemes → mel-spectrograms → audio waveform
WebGPU Shader Compilation
ONNX operators (matrix multiply, convolution, etc.) are compiled to GPU compute shaders:
This parallelization across hundreds/thousands of GPU cores is why WebGPU is so much faster than sequential CPU execution.
Future Optimizations
Upcoming Features
- Model Caching: Persistent GPU memory for instant re-use
- Batch Processing: Synthesize multiple summaries in parallel
- Adaptive Quality: Auto-downgrade to faster model on slower GPUs
- WebNN Support: Alternative backend for older hardware
Want Even More Performance?
Check out the TTS Implementation Documentation for low-level details on optimizing chunking strategies, parallel synthesis pipelines, and memory management.