Advanced Technical Guide

GPU Acceleration Deep-Dive

Last updated: January 2025 • 8 min read • For developers & power users

TLD2 leverages WebGPU for hardware-accelerated neural network inference, delivering near-instant text-to-speech synthesis. This guide provides a comprehensive technical breakdown of GPU acceleration, performance benchmarks, and troubleshooting for developers and hardware enthusiasts.

WebGPU vs. WASM: Architecture Overview

The Two Execution Paths

TLD2's TTS engine (StreamingKokoroJS / ONNX Runtime Web) supports two execution backends:

WebGPU Mode (Recommended)

Execution Environment: GPU shaders via WebGPU API

How it works:

  • ONNX model operators compiled to GPU compute shaders
  • Tensor operations parallelized across GPU cores
  • Memory managed in GPU VRAM for fast matrix multiplications
  • Typical speedup: 2-10x faster than WASM

WASM Mode (Fallback)

Execution Environment: CPU via WebAssembly

How it works:

  • ONNX operators executed in compiled WebAssembly
  • SIMD instructions for vectorized operations (if CPU supports)
  • Multi-threading via Web Workers (limited parallelism)
  • Typical performance: Functional but 2-10x slower than WebGPU

Selection Logic

TLD2 automatically detects GPU availability at runtime:

// Simplified detection logic if (navigator.gpu) { const adapter = await navigator.gpu.requestAdapter(); if (adapter) { // ✅ WebGPU available - use GPU acceleration device = 'webgpu'; } else { // ⚠️ WebGPU API exists but no adapter - fallback to WASM device = 'wasm'; } } else { // ❌ WebGPU not supported - use WASM device = 'wasm'; }

Performance Benchmarks

Real-World TTS Synthesis Times

Measured on the Kokoro 82M quantized model (q8f16, 86MB):

Hardware Configuration Backend First Chunk Latency Tokens/Second Full Article (500 words)
NVIDIA RTX 3080 (Desktop) WebGPU 150-250ms ~180 2-3 seconds
Apple M1 Pro (MacBook) WebGPU 200-350ms ~140 3-4 seconds
Intel Iris Xe (Integrated) WebGPU 400-600ms ~80 5-7 seconds
AMD Radeon RX 6700 XT WebGPU 180-300ms ~150 3-4 seconds
Intel i7-11700 (CPU only) WASM 1.5-2.5s ~30 15-20 seconds
Apple M1 Pro (CPU only) WASM 800ms-1.5s ~50 10-12 seconds
AMD Ryzen 5 5600X (CPU only) WASM 1-2s ~35 12-16 seconds

Key Takeaway: WebGPU delivers 3-8x faster first-chunk latency and 2-5x higher throughput compared to WASM.

Hardware Requirements & Compatibility

WebGPU Browser Support

  • Chrome 113+ (Stable, full support)
  • Edge 113+ (Chromium-based, full support)
  • Firefox 121+ (Experimental, enable via about:config)
  • Safari 18+ (Partial support, macOS/iOS)

GPU Compatibility

NVIDIA GPUs:

  • GTX 900 series and newer (Maxwell architecture+)
  • RTX 20/30/40 series (optimal performance)
  • Quadro/Tesla workstation cards

AMD GPUs:

  • RX 400 series and newer (Polaris+)
  • RX 5000/6000/7000 series (RDNA architecture)
  • Integrated Ryzen APUs (Vega, RDNA2)

Intel GPUs:

  • Iris Xe (11th gen and newer)
  • Arc A-series (dedicated GPUs)
  • UHD Graphics 600 series+ (limited performance)

Apple Silicon:

  • M1, M1 Pro, M1 Max, M1 Ultra
  • M2, M2 Pro, M2 Max
  • M3 series (best performance)

Minimum System Requirements

For WebGPU Mode:

  • OS: Windows 10+, macOS 11+, Linux (kernel 5.4+)
  • RAM: 4GB minimum, 8GB recommended
  • GPU: Any GPU from 2015+ with driver updates
  • VRAM: 2GB minimum for model storage

For WASM Mode (CPU-only):

  • CPU: Modern multi-core (2019+)
  • RAM: 4GB minimum
  • SIMD: Automatically used if CPU supports (most modern CPUs do)

How to Check GPU Availability

Method 1: Browser Console

Open DevTools (F12) and paste this into the Console:

// Check if WebGPU is available if (navigator.gpu) { navigator.gpu.requestAdapter().then(adapter => { if (adapter) { console.log('✅ WebGPU Available'); console.log('GPU:', adapter.info); } else { console.log('⚠️ WebGPU API present but no adapter found'); } }); } else { console.log('❌ WebGPU not supported in this browser'); }

Expected output if GPU is available:

✅ WebGPU Available GPU: {vendor: "nvidia", architecture: "ampere", device: "NVIDIA GeForce RTX 3080", ...}

Method 2: Chrome Flags Check

  1. Navigate to chrome://gpu in your browser
  2. Search for "WebGPU" in the page
  3. Look for "WebGPU: Hardware accelerated"
  4. Check "Graphics Feature Status" table

If WebGPU shows "Disabled" or "Software only", troubleshoot using the next section.

Troubleshooting GPU Detection

Issue: "WebGPU not available" despite having a GPU

Common Causes & Fixes:

1. Outdated GPU Drivers

  • NVIDIA: Download latest drivers from nvidia.com/drivers
  • AMD: Update via AMD Software Adrenalin
  • Intel: Use Intel Driver & Support Assistant

2. WebGPU Disabled in Chrome Flags

  1. Go to chrome://flags
  2. Search for "WebGPU"
  3. Set "Unsafe WebGPU" to Enabled
  4. Restart Chrome

3. Hardware Acceleration Disabled

  1. Go to chrome://settings
  2. Search for "hardware acceleration"
  3. Enable "Use hardware acceleration when available"
  4. Restart Chrome

4. GPU Blacklisted by Chrome

  • Check chrome://gpu for "Problems Detected"
  • If your GPU is blacklisted, try: chrome://flags/#ignore-gpu-blacklist → Enable
  • Warning: Use at own risk—may cause instability

Issue: TLD2 Slow Despite GPU Available

Diagnostics:

  • Check Task Manager (Windows) / Activity Monitor (Mac) during TTS synthesis
  • GPU usage should spike to 20-60% during generation
  • If CPU spikes instead, TLD2 may be falling back to WASM

Force WebGPU Logging:

// In TLD2 source (if self-hosting / debugging) env.backends.webgpu = { verbose: true };

Check console for messages like "WebGPU backend initialized successfully."

Performance Optimization Tips

1. Use Quantized Models

TLD2 uses model_q8f16.onnx (86MB, mixed precision) by default. This provides optimal size/quality trade-off:

  • q8f16: 8-bit int + 16-bit float → 86MB, minimal quality loss
  • fp32 (full precision): 300MB+ → Slower, no perceptible quality gain

2. Close Resource-Heavy Browser Tabs

WebGPU shares GPU resources with:

  • Other browser tabs using WebGL/WebGPU
  • Hardware-accelerated video playback
  • GPU-intensive web apps (games, 3D renderers)

For maximum TTS performance, close unused tabs before synthesizing long articles.

3. Enable Power/Performance Mode

  • Windows: Set power plan to "High Performance"
  • Mac: Disable "Low Power Mode" (laptops)
  • Linux: Use cpupower frequency-set -g performance

4. Monitor Thermal Throttling

If your laptop gets hot during TTS synthesis:

  • GPU may thermal-throttle, reducing performance
  • Use cooling pads or elevate laptop for better airflow
  • Consider limiting summary length or using WASM mode to reduce heat

Technical Deep-Dive: ONNX Runtime Web

How TLD2 Uses ONNX

The Kokoro TTS model is exported from PyTorch to ONNX format for browser compatibility:

  1. Model Loading: ONNX file fetched from local cache (86MB)
  2. Session Creation: ONNX Runtime Web creates inference session
  3. Backend Selection: Runtime picks WebGPU or WASM based on availability
  4. Graph Optimization: Operator fusion, constant folding for speed
  5. Tensor Inference: Input text → phonemes → mel-spectrograms → audio waveform

WebGPU Shader Compilation

ONNX operators (matrix multiply, convolution, etc.) are compiled to GPU compute shaders:

// Simplified example: Matrix multiplication shader @group(0) @binding(0) var<storage, read> inputA: array<f32>; @group(0) @binding(1) var<storage, read> inputB: array<f32>; @group(0) @binding(2) var<storage, read_write> output: array<f32>; @compute @workgroup_size(16, 16) fn matmul(@builtin(global_invocation_id) global_id: vec3<u32>) { // Parallel matrix multiplication across GPU cores let row = global_id.x; let col = global_id.y; var sum: f32 = 0.0; for (var k: u32 = 0u; k < size; k++) { sum += inputA[row * size + k] * inputB[k * size + col]; } output[row * size + col] = sum; }

This parallelization across hundreds/thousands of GPU cores is why WebGPU is so much faster than sequential CPU execution.

Future Optimizations

Upcoming Features

  • Model Caching: Persistent GPU memory for instant re-use
  • Batch Processing: Synthesize multiple summaries in parallel
  • Adaptive Quality: Auto-downgrade to faster model on slower GPUs
  • WebNN Support: Alternative backend for older hardware

Want Even More Performance?

Check out the TTS Implementation Documentation for low-level details on optimizing chunking strategies, parallel synthesis pipelines, and memory management.