StreamingKokoroJS Neural Text-to-Speech Implementation Guide
Table of Contents
Overview
StreamingKokoroJS is a browser-based implementation of the Kokoro TTS (Text-to-Speech) system that enables high-quality, unlimited neural voice synthesis entirely locally without server dependencies. Built on the lightweight 82-million-parameter Kokoro TTS model, it delivers natural-sounding speech at 24 kHz sample rate with real-time streaming capabilities.
Why StreamingKokoroJS?
- 100% Client-Side: No data sent to external servers—complete privacy
- High Quality: Neural vocoder produces human-like speech comparable to cloud TTS services
- Real-Time Streaming: Low-latency chunk processing for immediate playback
- Hardware Accelerated: WebGPU support for 2-10x speedup vs CPU-only
- Lightweight: 86MB quantized model (q8f16) with minimal quality loss
- Open Source: Apache 2.0 licensed model weights
Key Specifications
| Property | Value |
|---|---|
| Model Name | Kokoro-82M-v1.0-ONNX |
| Parameters | 82 million |
| Sample Rate | 24 kHz |
| Model Size (fp32) | ~300 MB |
| Model Size (q8f16) | 86 MB (recommended) |
| Model Size (quantized 8-bit) | 93 MB |
| License | Apache 2.0 |
| NPM Package | kokoro-js |
| Source Repository | Hugging Face: onnx-community/Kokoro-82M-v1.0-ONNX |
Model Architecture
Kokoro TTS follows a modern neural TTS architecture consisting of three primary stages:
1. Text-to-Phoneme Conversion
Converts input text into phonetic representations using a learned tokenizer. Handles pronunciation rules, stress patterns, and phoneme boundaries for natural prosody.
2. Acoustic Model (Phoneme-to-Mel)
Transformer-based acoustic model generates mel-spectrograms from phoneme sequences. This stage determines prosody, pitch, and timing characteristics of the output speech.
3. Neural Vocoder (Mel-to-Waveform)
Converts mel-spectrograms into raw audio waveforms at 24 kHz. Uses convolutional layers optimized for real-time synthesis with minimal artifacts.
Architecture Diagram
[Text Input]
↓
[Tokenizer] → Phoneme IDs
↓
[Acoustic Model] → Mel-Spectrograms (80 bins @ 24kHz)
↓
[Neural Vocoder] → Audio Waveform (Float32Array)
↓
[Web Audio API] → Playback
Model Weights
The Kokoro model is exported to ONNX format for cross-platform compatibility and optimized inference via ONNX Runtime Web. ONNX (Open Neural Network Exchange) provides standardized operators and efficient execution across WebGPU and WebAssembly backends.
Dependencies & Setup
Core Dependencies
| Package | Version | Purpose |
|---|---|---|
kokoro-js |
^1.0.0 | Primary TTS library, model loading, inference orchestration |
@xenova/transformers |
^2.17.2 | Underlying ONNX model execution, tokenization, WebGPU/WASM inference |
onnxruntime-web |
^1.20.0 | ONNX Runtime for browser, WebGPU and WASM backend support |
Installation
# Install via NPM
npm install kokoro-js @xenova/transformers onnxruntime-web
# Or via Yarn
yarn add kokoro-js @xenova/transformers onnxruntime-web
Browser Requirements
- Chrome 120+ or Edge (Chromium-based) for full WebGPU support
- Chrome 113+ minimum for partial WebGPU (hardware dependent)
- Hardware acceleration enabled in browser flags (chrome://flags/#enable-unsafe-webgpu)
- 4GB RAM minimum, 8GB+ recommended for optimal performance
- WebGPU-compatible GPU: NVIDIA, AMD, Intel, or Apple Silicon
Hardware Acceleration Check
// Check WebGPU availability
if (navigator.gpu) {
const adapter = await navigator.gpu.requestAdapter();
if (adapter) {
console.log('✅ WebGPU Available');
console.log('GPU:', adapter.info);
} else {
console.log('⚠️ WebGPU adapter request failed - falling back to WASM');
}
} else {
console.log('❌ WebGPU not supported - using WASM');
}
Model Initialization
Loading the Kokoro model with proper configuration is critical for performance. Use quantized models for reduced memory footprint and faster loading.
Basic Initialization
import { KokoroTTS } from 'kokoro-js';
async function initTTS() {
try {
// Model ID: Hugging Face repository
const modelId = 'onnx-community/Kokoro-82M-v1.0-ONNX';
// Initialize with WebGPU detection
const tts = await KokoroTTS.from_pretrained(modelId, {
dtype: 'q8', // Quantized for speed/size (q8f16 recommended)
device: navigator.gpu ? 'webgpu' : 'wasm', // Auto-detect hardware
progress_callback: (data) => {
const progress = (data.loaded / data.total * 100).toFixed(1);
console.log(`Loading model: ${progress}%`);
updateProgressUI(progress); // Update your UI
}
});
console.log('✅ TTS model loaded successfully');
return tts;
} catch (error) {
console.error('❌ TTS initialization failed:', error);
throw error;
}
}
// Usage
const tts = await initTTS();
Local Model Path (Chrome Extension)
For Chrome extensions with Manifest V3 CSP compliance, bundle the model locally and override remote paths:
import { KokoroTTS, env } from 'kokoro-js';
// Configure local model paths
env.allowRemoteModels = false; // Block remote downloads
env.localModelPath = chrome.runtime.getURL('models/kokoro/');
async function initTTSLocal() {
const tts = await KokoroTTS.from_pretrained('kokoro-82m-q8f16', {
dtype: 'q8',
device: navigator.gpu ? 'webgpu' : 'wasm',
local_files_only: true, // Enforce local model loading
progress_callback: (data) => {
updateStatus(`Loading: ${(data.loaded / data.total * 100).toFixed(0)}%`);
}
});
return tts;
}
Lazy Loading Strategy
To reduce initial extension load time, defer model loading until first TTS request:
// Global singleton with lazy initialization
let ttsInstance = null;
async function getTTS() {
if (!ttsInstance) {
console.log('First TTS request - loading model...');
ttsInstance = await initTTS();
}
return ttsInstance;
}
// Usage in TTS generation
async function generateSpeech(text) {
const tts = await getTTS(); // Lazy load on first call
return await tts.generate(text, { voice: 'af_sky' });
}
WebGPU Acceleration
WebGPU provides 2-10x speedup over CPU-only WASM execution by offloading tensor operations to the GPU. Understanding WebGPU configuration is critical for optimal TTS performance.
How WebGPU Accelerates TTS
- Matrix Multiplication: Acoustic model and vocoder use large matrix ops—GPU parallelization dramatically reduces latency
- Convolution Layers: Neural vocoder relies on 1D convolutions for waveform synthesis—GPU excels at parallel convolution
- Tensor Operations: Element-wise ops (activation functions, normalization) execute in parallel on GPU cores
- Memory Bandwidth: GPU memory bandwidth (100-900 GB/s) far exceeds CPU (20-50 GB/s) for large model weights
WebGPU Adapter Selection
async function selectBestGPU() {
if (!navigator.gpu) {
return null; // WebGPU not available
}
try {
// Request default adapter
const adapter = await navigator.gpu.requestAdapter({
powerPreference: 'high-performance' // Prefer discrete GPU
});
if (!adapter) {
console.warn('No WebGPU adapter available');
return null;
}
// Log adapter info
console.log('WebGPU Adapter:', {
vendor: adapter.info.vendor,
architecture: adapter.info.architecture,
device: adapter.info.device,
description: adapter.info.description
});
// Create device
const device = await adapter.requestDevice();
console.log('✅ WebGPU device created');
return device;
} catch (error) {
console.error('WebGPU adapter request failed:', error);
return null;
}
}
Performance Comparison
| Hardware | Backend | First Chunk Latency | Speedup vs WASM |
|---|---|---|---|
| NVIDIA RTX 3080 | WebGPU | 150-250ms | 8-10x faster |
| Apple M1 Pro | WebGPU | 200-350ms | 5-7x faster |
| AMD RX 6800 | WebGPU | 180-300ms | 6-9x faster |
| Intel Iris Xe | WebGPU | 400-600ms | 3-5x faster |
| CPU-only (8-core) | WASM | 1500-2500ms | 1x (baseline) |
Measured with Kokoro q8f16 model generating 50-token sentence.
WebGPU Configuration in ONNX Runtime
import * as ort from 'onnxruntime-web';
// Configure ONNX Runtime for WebGPU
ort.env.wasm.numThreads = 4; // Multi-threading for WASM fallback
ort.env.wasm.simd = true; // Enable SIMD instructions
// Set execution provider preference
const executionProviders = ['webgpu', 'wasm']; // Fallback chain
// Create inference session
const session = await ort.InferenceSession.create(modelPath, {
executionProviders: executionProviders,
graphOptimizationLevel: 'all', // Enable all optimizations
enableCpuMemArena: true, // Reduce memory allocations
enableMemPattern: true // Optimize memory reuse
});
WASM Fallback
When WebGPU is unavailable, ONNX Runtime Web falls back to WebAssembly (WASM) for CPU-based inference. While slower than GPU execution, WASM provides universal compatibility across all modern browsers.
WASM Path Configuration (Chrome Extension)
Chrome extension CSP requires local bundling of WASM binaries. Override default paths:
import * as ort from 'onnxruntime-web';
// Override WASM binary paths for Manifest V3 CSP compliance
ort.env.wasm.wasmPaths = {
'ort-wasm.wasm': chrome.runtime.getURL('lib/onnxruntime-web/ort-wasm.wasm'),
'ort-wasm-simd.wasm': chrome.runtime.getURL('lib/onnxruntime-web/ort-wasm-simd.wasm'),
'ort-wasm-threaded.wasm': chrome.runtime.getURL('lib/onnxruntime-web/ort-wasm-threaded.wasm')
};
// Disable remote model downloads
ort.env.remoteModels = false;
console.log('✅ WASM paths configured for local execution');
WASM Performance Optimization
// Enable multi-threading (if available)
ort.env.wasm.numThreads = navigator.hardwareConcurrency || 4;
// Enable SIMD for 2-4x speedup on compatible CPUs
ort.env.wasm.simd = true;
// Enable proxy mode for better performance
ort.env.wasm.proxy = false; // Direct execution
console.log(`WASM configured: ${ort.env.wasm.numThreads} threads, SIMD: ${ort.env.wasm.simd}`);
Automatic Fallback Detection
async function initWithFallback() {
let device = 'wasm'; // Default to WASM
// Try WebGPU first
if (navigator.gpu) {
try {
const adapter = await navigator.gpu.requestAdapter();
if (adapter) {
device = 'webgpu';
console.log('✅ Using WebGPU acceleration');
} else {
console.warn('⚠️ WebGPU adapter unavailable - using WASM');
}
} catch (error) {
console.warn('⚠️ WebGPU request failed - using WASM:', error);
}
}
// Initialize TTS with detected device
const tts = await KokoroTTS.from_pretrained(modelId, {
dtype: 'q8',
device: device
});
return { tts, device };
}
Inference Pipeline
The Kokoro TTS inference pipeline executes through multiple stages powered by Transformers.js and ONNX Runtime Web:
Pipeline Stages
[Text Input] → [Tokenization] → [Phoneme Generation] → [Acoustic Model] → [Vocoder] → [Audio Output]
Stage 1: Tokenization
- Input: "Hello world"
- Output: [2341, 5672, 8901] (token IDs)
Stage 2: Phoneme Generation
- Input: Token IDs
- Output: ["HH", "AH", "L", "OW", "W", "ER", "L", "D"] (phoneme sequence)
Stage 3: Acoustic Model (Text-to-Mel)
- Input: Phoneme IDs + Voice Embedding
- Output: Mel-Spectrogram (80 bins × T frames @ 24kHz)
Stage 4: Neural Vocoder (Mel-to-Waveform)
- Input: Mel-Spectrogram
- Output: Float32Array audio waveform (24kHz sample rate)
Voice Selection
Kokoro includes multiple pre-trained voice embeddings:
// List available voices
const voices = await tts.list_voices();
console.log('Available voices:', voices);
// Output: ["af_sky", "af_nicole", "bm_fable", "bm_lewis", ...]
// Generate with specific voice
const audio = await tts.generate("Hello world", {
voice: 'af_sky', // American Female - Sky
speed: 1.0,
pitch: 1.0
});
Internal Processing Flow
// Simplified internal pipeline (for understanding)
async function internalPipeline(text, voice) {
// 1. Tokenize text
const tokens = tokenizer.encode(text);
// 2. Generate phonemes
const phonemes = await phonemeModel.forward({ input_ids: tokens });
// 3. Generate mel-spectrogram
const voiceEmbedding = loadVoiceEmbedding(voice);
const mel = await acousticModel.forward({
phoneme_ids: phonemes,
voice_embedding: voiceEmbedding
});
// 4. Generate audio waveform
const waveform = await vocoder.forward({ mel_spectrogram: mel });
// 5. Convert to AudioBuffer
const audioBuffer = createAudioBuffer(waveform, 24000);
return audioBuffer;
}
Streaming Architecture
Real-time streaming is essential for responsive TTS. StreamingKokoroJS processes text in chunks, yielding audio incrementally for immediate playback.
Basic Streaming Example
import { KokoroTTS, TextSplitterStream } from 'kokoro-js';
async function streamTTS(tts, text, settings) {
// Create text splitter for sentence-based chunking
const splitter = new TextSplitterStream();
// Create streaming generator
const stream = tts.stream(splitter);
// Split text into sentences
const sentences = text.match(/[^.!?]+[.!?]+/g) || [text];
console.log(`Streaming ${sentences.length} sentences...`);
// Process stream asynchronously
(async () => {
let chunkIndex = 0;
for await (const { text, phonemes, audio } of stream) {
console.log(`Chunk ${chunkIndex}: "${text}"`);
console.log(`Phonemes: ${phonemes.join(' ')}`);
// Play audio chunk immediately
await playAudioChunk(audio, settings.speed);
chunkIndex++;
}
console.log('✅ Streaming complete');
})();
// Push sentences to stream with delay
for (let i = 0; i < sentences.length; i++) {
splitter.push(sentences[i]);
await new Promise(resolve => setTimeout(resolve, 100)); // Natural pacing
}
splitter.close(); // Signal end of stream
}
Advanced Streaming with Progress Tracking
async function streamWithProgress(tts, summaryText, callbacks) {
const splitter = new TextSplitterStream();
const stream = tts.stream(splitter);
// Chunk text into sentences
const sentences = summaryText.match(/[^.!?]+[.!?]+/g) || [summaryText];
const totalSentences = sentences.length;
let processedChunks = 0;
let isPlaying = false;
const audioQueue = [];
// Audio queue processor (Web Audio API scheduling)
async function processAudioQueue() {
if (isPlaying || audioQueue.length === 0) return;
isPlaying = true;
const audioData = audioQueue.shift();
await playAudioBuffer(audioData);
isPlaying = false;
processAudioQueue(); // Process next chunk
}
// Stream processor
(async () => {
for await (const { text, phonemes, audio } of stream) {
processedChunks++;
// Update progress
const progress = (processedChunks / totalSentences) * 100;
callbacks.onProgress?.(progress, text);
// Add to audio queue
audioQueue.push(audio);
processAudioQueue(); // Start playback if not already playing
console.log(`[${processedChunks}/${totalSentences}] Processed: "${text}"`);
}
callbacks.onComplete?.();
})();
// Feed sentences to stream
for (const sentence of sentences) {
splitter.push(sentence.trim());
await new Promise(resolve => setTimeout(resolve, 50));
}
splitter.close();
}
// Usage
streamWithProgress(tts, summaryText, {
onProgress: (percent, text) => {
updateProgressBar(percent);
updateStatusText(`Playing: ${text.substring(0, 50)}...`);
},
onComplete: () => {
console.log('✅ TTS playback complete');
enableControls();
}
});
Parallel Processing (Multi-Stream)
For even faster synthesis on powerful GPUs, process multiple chunks in parallel:
async function parallelStream(tts, text, settings) {
const sentences = text.match(/[^.!?]+[.!?]+/g) || [text];
const batchSize = 3; // Process 3 sentences in parallel
const audioBuffers = [];
// Process in batches
for (let i = 0; i < sentences.length; i += batchSize) {
const batch = sentences.slice(i, i + batchSize);
// Generate all chunks in parallel
const promises = batch.map(sentence =>
tts.generate(sentence, { voice: settings.voice })
);
const results = await Promise.all(promises);
audioBuffers.push(...results);
console.log(`Processed batch ${Math.floor(i / batchSize) + 1}`);
}
// Play sequentially
for (const buffer of audioBuffers) {
await playAudioBuffer(buffer);
}
}
Web Audio API Integration
Kokoro generates Float32Array audio data that must be converted to AudioBuffer for Web Audio API playback.
Basic Playback
// Create global AudioContext
const audioContext = new (window.AudioContext || window.webkitAudioContext)();
async function playAudioBuffer(audioData) {
// Convert Kokoro output to AudioBuffer
const audioBuffer = audioContext.createBuffer(
1, // Mono
audioData.length, // Sample count
24000 // Sample rate (24kHz)
);
// Copy audio data to buffer
audioBuffer.getChannelData(0).set(audioData);
// Create source node
const source = audioContext.createBufferSource();
source.buffer = audioBuffer;
// Connect to destination (speakers)
source.connect(audioContext.destination);
// Play immediately
source.start();
// Wait for playback to complete
await new Promise(resolve => {
source.onended = resolve;
});
}
Speed Control
function playWithSpeed(audioBuffer, speed) {
const source = audioContext.createBufferSource();
source.buffer = audioBuffer;
// Set playback rate (0.5x - 2.0x)
source.playbackRate.value = speed;
source.connect(audioContext.destination);
source.start();
return source; // Return for stop control
}
Pitch Control
function playWithPitch(audioBuffer, pitch) {
const source = audioContext.createBufferSource();
source.buffer = audioBuffer;
// Pitch shifting via playback rate affects speed too
// For independent pitch control, use pitch shifter library
// or inverse speed adjustment
source.playbackRate.value = pitch;
source.connect(audioContext.destination);
source.start();
return source;
}
// Pitch correction: maintain natural speed while adjusting pitch
function playWithPitchCorrection(audioBuffer, speed, usePitchCorrection) {
const source = audioContext.createBufferSource();
source.buffer = audioBuffer;
if (usePitchCorrection) {
// Inverse relationship: faster speed = higher pitch, so reduce pitch
source.playbackRate.value = speed;
// In TLD2, we let speed affect pitch naturally
} else {
// User manually set pitch, ignore auto-correction
source.playbackRate.value = speed;
}
source.connect(audioContext.destination);
source.start();
return source;
}
Advanced Audio Pipeline with Gain Control
class TTSAudioPlayer {
constructor() {
this.audioContext = new AudioContext();
this.gainNode = this.audioContext.createGain();
this.gainNode.connect(this.audioContext.destination);
this.currentSource = null;
}
async play(audioData, options = {}) {
const {
speed = 1.0,
volume = 1.0,
onProgress = null,
onComplete = null
} = options;
// Stop any current playback
this.stop();
// Create audio buffer
const audioBuffer = this.audioContext.createBuffer(
1,
audioData.length,
24000
);
audioBuffer.getChannelData(0).set(audioData);
// Create source
this.currentSource = this.audioContext.createBufferSource();
this.currentSource.buffer = audioBuffer;
this.currentSource.playbackRate.value = speed;
// Set volume
this.gainNode.gain.value = volume;
// Connect pipeline
this.currentSource.connect(this.gainNode);
// Progress tracking
const duration = audioBuffer.duration / speed;
const startTime = this.audioContext.currentTime;
if (onProgress) {
const interval = setInterval(() => {
const elapsed = this.audioContext.currentTime - startTime;
const progress = Math.min((elapsed / duration) * 100, 100);
onProgress(progress);
}, 100);
this.currentSource.onended = () => {
clearInterval(interval);
onComplete?.();
};
}
// Start playback
this.currentSource.start();
return new Promise(resolve => {
this.currentSource.onended = () => {
resolve();
onComplete?.();
};
});
}
stop() {
if (this.currentSource) {
try {
this.currentSource.stop();
} catch (e) {
// Already stopped
}
this.currentSource = null;
}
}
setVolume(volume) {
this.gainNode.gain.value = volume;
}
}
// Usage
const player = new TTSAudioPlayer();
await player.play(audioData, {
speed: 1.2,
volume: 0.8,
onProgress: (percent) => updateProgressBar(percent),
onComplete: () => console.log('Playback finished')
});
Chunking Strategies
Effective text chunking is critical for perceived latency and natural prosody. Different strategies optimize for different use cases.
Sentence-Based Chunking (Recommended)
function chunkBySentence(text) {
// Split on sentence boundaries
const sentences = text.match(/[^.!?]+[.!?]+/g) || [text];
return sentences.map(s => s.trim()).filter(s => s.length > 0);
}
const chunks = chunkBySentence("Hello world. How are you? I'm fine.");
// Output: ["Hello world.", "How are you?", "I'm fine."]
Token-Based Chunking (Fixed Size)
function chunkByTokens(text, tokensPerChunk = 50) {
const words = text.split(/\s+/);
const chunks = [];
for (let i = 0; i < words.length; i += tokensPerChunk) {
const chunk = words.slice(i, i + tokensPerChunk).join(' ');
chunks.push(chunk);
}
return chunks;
}
const chunks = chunkByTokens(longText, 40);
// Splits into ~40-word chunks
Intelligent Chunking (Sentence + Length)
function smartChunk(text, maxTokens = 60) {
const sentences = text.match(/[^.!?]+[.!?]+/g) || [text];
const chunks = [];
let currentChunk = '';
for (const sentence of sentences) {
const sentenceTokens = sentence.split(/\s+/).length;
const currentTokens = currentChunk.split(/\s+/).length;
if (currentTokens + sentenceTokens <= maxTokens) {
// Add to current chunk
currentChunk += (currentChunk ? ' ' : '') + sentence;
} else {
// Start new chunk
if (currentChunk) chunks.push(currentChunk.trim());
currentChunk = sentence;
}
}
if (currentChunk) chunks.push(currentChunk.trim());
return chunks;
}
// Combines sentences into chunks up to 60 tokens
const chunks = smartChunk(article, 60);
Performance Comparison
| Strategy | Pros | Cons | Best For |
|---|---|---|---|
| Sentence-based | Natural prosody, good pacing | Variable latency (long sentences) | General articles, summaries |
| Token-based (fixed) | Predictable latency, consistent chunks | May break mid-sentence (poor prosody) | Real-time streaming, chat |
| Intelligent (hybrid) | Balances naturalness and latency | More complex logic | Long-form content, optimal UX |
Performance Optimization
Model Caching
// Transformers.js automatically caches models in IndexedDB
// Force cache refresh
import { env } from 'kokoro-js';
// Check cache status
async function checkCache() {
const cacheKeys = await caches.keys();
const hasKokoroCache = cacheKeys.some(key => key.includes('kokoro'));
console.log('Kokoro model cached:', hasKokoroCache);
}
// Clear cache (for debugging)
async function clearModelCache() {
const cacheKeys = await caches.keys();
for (const key of cacheKeys) {
if (key.includes('kokoro') || key.includes('transformers')) {
await caches.delete(key);
console.log('Cleared cache:', key);
}
}
}
Web Worker Offloading
// tts-worker.js
import { KokoroTTS } from 'kokoro-js';
let ttsInstance = null;
self.addEventListener('message', async (event) => {
const { action, data } = event.data;
switch (action) {
case 'init':
ttsInstance = await KokoroTTS.from_pretrained(data.modelId, {
dtype: data.dtype,
device: data.device
});
self.postMessage({ action: 'init', status: 'ready' });
break;
case 'generate':
const audio = await ttsInstance.generate(data.text, {
voice: data.voice
});
// Transfer audio buffer to main thread
self.postMessage({
action: 'audio',
audio: audio
}, [audio.buffer]); // Transfer ownership
break;
}
});
// main.js
const ttsWorker = new Worker('tts-worker.js', { type: 'module' });
ttsWorker.postMessage({
action: 'init',
data: { modelId: 'onnx-community/Kokoro-82M-v1.0-ONNX', dtype: 'q8', device: 'webgpu' }
});
ttsWorker.addEventListener('message', (event) => {
if (event.data.action === 'audio') {
playAudioBuffer(event.data.audio);
}
});
Parallel Chunk Processing
async function processBatch(tts, sentences, voice) {
const batchSize = 3;
const results = [];
for (let i = 0; i < sentences.length; i += batchSize) {
const batch = sentences.slice(i, i + batchSize);
// Process batch in parallel
const promises = batch.map(sentence =>
tts.generate(sentence, { voice })
);
const batchResults = await Promise.all(promises);
results.push(...batchResults);
console.log(`Batch ${Math.floor(i / batchSize) + 1} complete`);
}
return results;
}
Memory Management
// Monitor memory usage
function monitorMemory() {
if (performance.memory) {
const used = (performance.memory.usedJSHeapSize / 1048576).toFixed(2);
const limit = (performance.memory.jsHeapSizeLimit / 1048576).toFixed(2);
console.log(`Memory: ${used} MB / ${limit} MB`);
}
}
// Cleanup after TTS generation
function cleanup(audioBuffers) {
audioBuffers.length = 0; // Clear array references
// Suggest garbage collection (not guaranteed)
if (global.gc) {
global.gc();
}
}
Model Quantization
Quantization reduces model size and increases inference speed by using lower-precision weights. Kokoro offers multiple quantization levels.
Quantization Comparison
| Format | Size | Quality | Speed | Recommended |
|---|---|---|---|---|
| fp32 (Full precision) | ~300 MB | Highest | Slower | ❌ Too large |
| q8f16 (Mixed precision) | 86 MB | Near-identical | Fast | ✅ Best choice |
| quantized (8-bit) | 93 MB | Minimal loss | Fast | ✅ Alternative |
Loading Quantized Models
// Load q8f16 model (recommended)
const tts = await KokoroTTS.from_pretrained('onnx-community/Kokoro-82M-v1.0-ONNX', {
dtype: 'q8', // Uses q8f16 variant
device: 'webgpu'
});
// Or specify quantized variant explicitly
const ttsQuantized = await KokoroTTS.from_pretrained('onnx-community/Kokoro-82M-v1.0-ONNX', {
dtype: 'quantized',
device: 'webgpu'
});
Quality Assessment
For most use cases, q8f16 is indistinguishable from fp32 in blind listening tests. Quantization primarily affects:
- Subtle pitch variations (< 1% difference)
- Very quiet consonants (minimal impact)
- Long-duration synthesis (> 5 minutes) may show minor artifacts
Chrome Extension Integration
Integrating StreamingKokoroJS into Chrome extensions requires careful handling of Manifest V3 CSP restrictions and bundle size limits.
Manifest V3 Configuration
{
"manifest_version": 3,
"name": "TLD2 Extension",
"version": "1.0.0",
"permissions": [
"storage",
"activeTab",
"scripting"
],
"background": {
"service_worker": "background/service-worker.js",
"type": "module"
},
"content_security_policy": {
"extension_pages": "script-src 'self' 'wasm-unsafe-eval'; object-src 'self'"
},
"web_accessible_resources": [
{
"resources": [
"models/kokoro/*",
"lib/onnxruntime-web/*"
],
"matches": [""]
}
]
}
Static Bundling with esbuild
// build.js
import esbuild from 'esbuild';
esbuild.build({
entryPoints: ['sidebar/sidebar.js'],
bundle: true,
outfile: 'dist/sidebar/sidebar.js',
format: 'esm',
platform: 'browser',
target: 'chrome120',
external: [], // Bundle everything
loader: {
'.wasm': 'file' // Copy WASM files
},
define: {
'process.env.NODE_ENV': '"production"'
}
}).catch(() => process.exit(1));
Background Service Worker Initialization
// background/service-worker.js
import { KokoroTTS, env } from 'kokoro-js';
import * as ort from 'onnxruntime-web';
// Configure local paths
env.allowRemoteModels = false;
env.localModelPath = chrome.runtime.getURL('models/kokoro/');
ort.env.wasm.wasmPaths = {
'ort-wasm.wasm': chrome.runtime.getURL('lib/onnxruntime-web/ort-wasm.wasm'),
'ort-wasm-simd.wasm': chrome.runtime.getURL('lib/onnxruntime-web/ort-wasm-simd.wasm')
};
let ttsInstance = null;
async function initTTS() {
if (ttsInstance) return ttsInstance;
console.log('Initializing TTS model...');
ttsInstance = await KokoroTTS.from_pretrained('kokoro-82m-q8f16', {
dtype: 'q8',
device: navigator.gpu ? 'webgpu' : 'wasm',
local_files_only: true
});
console.log('✅ TTS ready');
return ttsInstance;
}
// Message handler
chrome.runtime.onMessage.addListener((message, sender, sendResponse) => {
if (message.action === 'generateTTS') {
(async () => {
const tts = await initTTS();
const audio = await tts.generate(message.text, {
voice: message.voice || 'af_sky'
});
sendResponse({ audio: audio });
})();
return true; // Async response
}
});
Storage Management
// Check extension storage limits
async function checkStorage() {
const estimate = await navigator.storage.estimate();
const usedMB = (estimate.usage / 1048576).toFixed(2);
const quotaMB = (estimate.quota / 1048576).toFixed(2);
console.log(`Storage: ${usedMB} MB / ${quotaMB} MB`);
if (estimate.usage / estimate.quota > 0.9) {
console.warn('⚠️ Storage nearly full - consider clearing cache');
}
}
// Clear old cached models
async function clearOldModels() {
const cacheKeys = await caches.keys();
for (const key of cacheKeys) {
if (key.includes('old-version')) {
await caches.delete(key);
}
}
}
Troubleshooting
Common Issues
Model fails to load
Symptoms: "Failed to fetch model" or timeout errors
Solutions:
- Check internet connection (first load requires download)
- Verify
env.localModelPathis correct for bundled models - Clear browser cache and IndexedDB:
chrome://settings/clearBrowserData - Ensure sufficient storage space (check
navigator.storage.estimate())
WebGPU not detected
Symptoms: Falls back to WASM despite having compatible GPU
Solutions:
- Enable WebGPU flag:
chrome://flags/#enable-unsafe-webgpu - Update GPU drivers to latest version
- Enable hardware acceleration:
chrome://settings/system - Check GPU compatibility:
navigator.gpu.requestAdapter()
Slow synthesis (> 3 seconds per sentence)
Symptoms: Long wait times before audio plays
Solutions:
- Verify WebGPU is active (check console logs)
- Use quantized model (
dtype: 'q8') - Enable WASM SIMD:
ort.env.wasm.simd = true - Increase WASM threads:
ort.env.wasm.numThreads = 4 - Use smaller chunk sizes for streaming
CSP violations in Chrome extension
Symptoms: "Refused to load script" or "Refused to execute inline script"
Solutions:
- Add
'wasm-unsafe-eval'to extension CSP - Bundle all dependencies with esbuild/webpack
- Override WASM paths to use
chrome.runtime.getURL() - Set
env.allowRemoteModels = false
Debug Logging
// Enable verbose logging
import { env } from 'kokoro-js';
env.logging.level = 'debug';
// Check TTS configuration
async function debugTTS(tts) {
console.log('TTS Configuration:', {
device: tts.device,
dtype: tts.dtype,
modelId: tts.modelId,
voices: await tts.list_voices()
});
// Test generation
const testAudio = await tts.generate('Hello world', { voice: 'af_sky' });
console.log('Test audio generated:', testAudio.length, 'samples');
}
Performance Profiling
async function profileTTS(tts, text) {
console.time('TTS Generation');
const startMem = performance.memory?.usedJSHeapSize || 0;
const audio = await tts.generate(text, { voice: 'af_sky' });
const endMem = performance.memory?.usedJSHeapSize || 0;
const memDelta = ((endMem - startMem) / 1048576).toFixed(2);
console.timeEnd('TTS Generation');
console.log('Memory delta:', memDelta, 'MB');
console.log('Audio samples:', audio.length);
console.log('Duration:', (audio.length / 24000).toFixed(2), 'seconds');
}