Advanced Technical Guide

GPU Acceleration Deep-Dive

Last updated: January 2025 • 8 min read • For developers & power users

TLD2 leverages WebGPU for hardware-accelerated neural network inference, delivering near-instant text-to-speech synthesis. This guide provides a comprehensive technical breakdown of GPU acceleration, performance benchmarks, and troubleshooting for developers and hardware enthusiasts.

WebGPU vs. WASM: Architecture Overview

The Two Execution Paths

TLD2's TTS engine (StreamingKokoroJS / ONNX Runtime Web) supports two execution backends:

WebGPU Mode (Recommended)

Execution Environment: GPU shaders via WebGPU API

How it works:

ONNX model operators compiled to GPU compute shaders
Tensor operations parallelized across GPU cores
Memory managed in GPU VRAM for fast matrix multiplications
Typical speedup: 2-10x faster than WASM

WASM Mode (Fallback)

Execution Environment: CPU via WebAssembly

How it works:

ONNX operators executed in compiled WebAssembly
SIMD instructions for vectorized operations (if CPU supports)
Multi-threading via Web Workers (limited parallelism)
Typical performance: Functional but 2-10x slower than WebGPU

Selection Logic

TLD2 automatically detects GPU availability at runtime:

// Simplified detection logic
if (navigator.gpu) {
  const adapter = await navigator.gpu.requestAdapter();
  if (adapter) {
    // ✅ WebGPU available - use GPU acceleration
    device = 'webgpu';
  } else {
    // ⚠️ WebGPU API exists but no adapter - fallback to WASM
    device = 'wasm';
  }
} else {
  // ❌ WebGPU not supported - use WASM
  device = 'wasm';
}
        

Performance Benchmarks

Real-World TTS Synthesis Times

Measured on the Kokoro 82M quantized model (q8f16, 86MB):

Hardware Configuration	Backend	First Chunk Latency	Tokens/Second	Full Article (500 words)
NVIDIA RTX 3080 (Desktop)	WebGPU	150-250ms	~180	2-3 seconds
Apple M1 Pro (MacBook)	WebGPU	200-350ms	~140	3-4 seconds
Intel Iris Xe (Integrated)	WebGPU	400-600ms	~80	5-7 seconds
AMD Radeon RX 6700 XT	WebGPU	180-300ms	~150	3-4 seconds
Intel i7-11700 (CPU only)	WASM	1.5-2.5s	~30	15-20 seconds
Apple M1 Pro (CPU only)	WASM	800ms-1.5s	~50	10-12 seconds
AMD Ryzen 5 5600X (CPU only)	WASM	1-2s	~35	12-16 seconds

Key Takeaway: WebGPU delivers 3-8x faster first-chunk latency and 2-5x higher throughput compared to WASM.

Hardware Requirements & Compatibility

WebGPU Browser Support

Chrome 113+ (Stable, full support)
Edge 113+ (Chromium-based, full support)
Firefox 121+ (Experimental, enable via about:config)
Safari 18+ (Partial support, macOS/iOS)

GPU Compatibility

NVIDIA GPUs:

GTX 900 series and newer (Maxwell architecture+)
RTX 20/30/40 series (optimal performance)
Quadro/Tesla workstation cards

AMD GPUs:

RX 400 series and newer (Polaris+)
RX 5000/6000/7000 series (RDNA architecture)
Integrated Ryzen APUs (Vega, RDNA2)

Intel GPUs:

Iris Xe (11th gen and newer)
Arc A-series (dedicated GPUs)
UHD Graphics 600 series+ (limited performance)

Apple Silicon:

M1, M1 Pro, M1 Max, M1 Ultra
M2, M2 Pro, M2 Max
M3 series (best performance)

Minimum System Requirements

For WebGPU Mode:

OS: Windows 10+, macOS 11+, Linux (kernel 5.4+)
RAM: 4GB minimum, 8GB recommended
GPU: Any GPU from 2015+ with driver updates
VRAM: 2GB minimum for model storage

For WASM Mode (CPU-only):

CPU: Modern multi-core (2019+)
RAM: 4GB minimum
SIMD: Automatically used if CPU supports (most modern CPUs do)

How to Check GPU Availability

Method 1: Browser Console

Open DevTools (F12) and paste this into the Console:

// Check if WebGPU is available
if (navigator.gpu) {
  navigator.gpu.requestAdapter().then(adapter => {
    if (adapter) {
      console.log('✅ WebGPU Available');
      console.log('GPU:', adapter.info);
    } else {
      console.log('⚠️ WebGPU API present but no adapter found');
    }
  });
} else {
  console.log('❌ WebGPU not supported in this browser');
}
        

Expected output if GPU is available:

✅ WebGPU Available
GPU: {vendor: "nvidia", architecture: "ampere", device: "NVIDIA GeForce RTX 3080", ...}
        

Method 2: Chrome Flags Check

Navigate to chrome://gpu in your browser
Search for "WebGPU" in the page
Look for "WebGPU: Hardware accelerated"
Check "Graphics Feature Status" table

If WebGPU shows "Disabled" or "Software only", troubleshoot using the next section.

Troubleshooting GPU Detection

Issue: "WebGPU not available" despite having a GPU

Common Causes & Fixes:

1. Outdated GPU Drivers

NVIDIA: Download latest drivers from nvidia.com/drivers
AMD: Update via AMD Software Adrenalin
Intel: Use Intel Driver & Support Assistant

2. WebGPU Disabled in Chrome Flags

Go to chrome://flags
Search for "WebGPU"
Set "Unsafe WebGPU" to Enabled
Restart Chrome

3. Hardware Acceleration Disabled

Go to chrome://settings
Search for "hardware acceleration"
Enable "Use hardware acceleration when available"
Restart Chrome

4. GPU Blacklisted by Chrome

Check chrome://gpu for "Problems Detected"
If your GPU is blacklisted, try: chrome://flags/#ignore-gpu-blacklist → Enable
Warning: Use at own risk—may cause instability

Issue: TLD2 Slow Despite GPU Available

Diagnostics:

Check Task Manager (Windows) / Activity Monitor (Mac) during TTS synthesis
GPU usage should spike to 20-60% during generation
If CPU spikes instead, TLD2 may be falling back to WASM

Force WebGPU Logging:

// In TLD2 source (if self-hosting / debugging)
env.backends.webgpu = { verbose: true };
        

Check console for messages like "WebGPU backend initialized successfully."

Performance Optimization Tips

1. Use Quantized Models

TLD2 uses model_q8f16.onnx (86MB, mixed precision) by default. This provides optimal size/quality trade-off:

q8f16: 8-bit int + 16-bit float → 86MB, minimal quality loss
fp32 (full precision): 300MB+ → Slower, no perceptible quality gain

2. Close Resource-Heavy Browser Tabs

WebGPU shares GPU resources with:

Other browser tabs using WebGL/WebGPU
Hardware-accelerated video playback
GPU-intensive web apps (games, 3D renderers)

For maximum TTS performance, close unused tabs before synthesizing long articles.

3. Enable Power/Performance Mode

Windows: Set power plan to "High Performance"
Mac: Disable "Low Power Mode" (laptops)
Linux: Use cpupower frequency-set -g performance

4. Monitor Thermal Throttling

If your laptop gets hot during TTS synthesis:

GPU may thermal-throttle, reducing performance
Use cooling pads or elevate laptop for better airflow
Consider limiting summary length or using WASM mode to reduce heat

Technical Deep-Dive: ONNX Runtime Web

How TLD2 Uses ONNX

The Kokoro TTS model is exported from PyTorch to ONNX format for browser compatibility:

Model Loading: ONNX file fetched from local cache (86MB)
Session Creation: ONNX Runtime Web creates inference session
Backend Selection: Runtime picks WebGPU or WASM based on availability
Graph Optimization: Operator fusion, constant folding for speed
Tensor Inference: Input text → phonemes → mel-spectrograms → audio waveform

WebGPU Shader Compilation

ONNX operators (matrix multiply, convolution, etc.) are compiled to GPU compute shaders:

// Simplified example: Matrix multiplication shader
@group(0) @binding(0) var<storage, read> inputA: array<f32>;
@group(0) @binding(1) var<storage, read> inputB: array<f32>;
@group(0) @binding(2) var<storage, read_write> output: array<f32>;

@compute @workgroup_size(16, 16)
fn matmul(@builtin(global_invocation_id) global_id: vec3<u32>) {
  // Parallel matrix multiplication across GPU cores
  let row = global_id.x;
  let col = global_id.y;
  var sum: f32 = 0.0;
  for (var k: u32 = 0u; k < size; k++) {
    sum += inputA[row * size + k] * inputB[k * size + col];
  }
  output[row * size + col] = sum;
}
        

This parallelization across hundreds/thousands of GPU cores is why WebGPU is so much faster than sequential CPU execution.

Future Optimizations

Upcoming Features

Model Caching: Persistent GPU memory for instant re-use
Batch Processing: Synthesize multiple summaries in parallel
Adaptive Quality: Auto-downgrade to faster model on slower GPUs
WebNN Support: Alternative backend for older hardware

Want Even More Performance?

Check out the TTS Implementation Documentation for low-level details on optimizing chunking strategies, parallel synthesis pipelines, and memory management.