Beginner Guide

Text-to-Speech Setup Tutorial

Last updated: January 2025 • 4 min read

Configure TLD2's neural text-to-speech engine for optimal listening quality. This guide covers voice selection, speed and pitch adjustment, auto-play settings, and troubleshooting common audio issues.

Accessing TTS Settings

Open the Settings Panel

Click the settings icon (gear symbol) in the TLD2 sidebar. The settings panel will slide down from the top, displaying all configuration options.

The TTS settings are grouped in the "Voice & Audio" section of the panel.

Voice Selection

TLD2 uses StreamingKokoroJS neural voices, which provide high-quality, natural-sounding speech synthesis. Each voice has distinct characteristics:

Available Voices

Voice Name	Gender	Characteristics	Best For
`af_sky`	Female	Clear, neutral, professional	News, articles, general content
`af_nicole`	Female	Warm, conversational	Blog posts, casual reading
`bm_fable`	Male	Deep, storytelling tone	Long-form content, narratives
`bm_lewis`	Male	Authoritative, clear	Technical docs, research papers

Select Your Preferred Voice

In the settings panel, find the "Kokoro Voice" dropdown menu. Click it to view all available voices.

Select a voice to hear a preview (if preview is enabled), or simply choose one and click "Save" to test it with your next summary.

lightbulb Pro Tip: Finding Your Perfect Voice

Spend 5 minutes testing each voice with the same article. Voice preference is highly personal—what sounds natural to one person may sound robotic to another. Your ideal voice depends on your listening environment and content type.

Speed Control

Adjust playback speed to match your listening preference and comprehension pace.

Set Playback Speed

Use the "Speaker Speed" slider to adjust from 0.5x (half speed) to 2.0x (double speed).

0.5x - 0.8x: Slower than natural speech. Ideal for language learners or complex technical content.
0.9x - 1.1x: Natural speech pace. Default is 1.0x. Most comfortable for casual listening.
1.2x - 1.5x: Faster comprehension. Good for familiar content or time-saving.
1.6x - 2.0x: Very fast. Requires focused attention but saves significant time.

warning Important: Pitch Correction

At speeds above 1.3x or below 0.8x, voices can sound unnatural (chipmunk effect or sluggish). Enable "Pitch Correction" to automatically adjust pitch inversely to speed, maintaining natural intonation at any playback rate.

Pitch Adjustment

Fine-tune voice pitch for comfort and naturalness.

Manual Pitch Control

The "Speaker Pitch" slider allows manual pitch adjustment from 0.5x to 2.0x.

Default Behavior: When "Pitch Correction" is enabled, pitch automatically adjusts inversely to speed (e.g., 1.5x speed = 0.67x pitch).

Manual Override: Disable pitch correction and manually set both speed and pitch for custom effects.

Pitch Correction Toggle

The "Pitch Correction" checkbox (enabled by default) maintains natural voice quality across all speeds by automatically compensating for speed changes.

Speed	Pitch (Auto)	Result
2.0x (double)	0.5x (half)	Fast but natural-sounding
1.0x (normal)	1.0x (normal)	Default natural speech
0.5x (half)	2.0x (double)	Slow but not sluggish

Auto-play Settings

Configure Automatic Playback

The "Autoplay" checkbox (enabled by default) controls whether TTS starts automatically when a summary finishes generating.

Enabled: TTS begins immediately after the summary streams in. Ideal for hands-free listening.
Disabled: You must manually click the play button to start audio. Useful when you prefer to read first.

Advanced Audio Settings

Info Printout Toggle

Enable "Info Printout" to display real-time status text during TTS processing:

"Generating summary..."
"Processing TTS chunk 1/5"
"Buffering audio..."
"Playing chunk 3/5"

Useful for debugging or understanding what TLD2 is doing behind the scenes.

Playback Controls

The playbar offers fine-grained control during playback:

Play/Pause: Standard playback control
Progress Bar: Visual timeline with buffering indicator. Click to jump to any position.
Shuttle Controls: Skip forward/backward by 15 seconds (configurable).
Duration Display: Shows current position and total length (e.g., 0:45 / 2:30).

Optimizing TTS Performance

GPU Acceleration

TLD2 uses WebGPU for hardware-accelerated TTS synthesis when available, providing near-instant audio generation.

speed Performance Expectations

With GPU (WebGPU): First audio chunk in 200-500ms. Streaming synthesis keeps pace with playback.

Without GPU (WASM fallback): First chunk in 1-3 seconds. Synthesis slower than playback, but pre-buffers to avoid interruption.

Check the console (F12) for "WebGPU enabled" or "WASM fallback" to see which mode you're using.

When to Adjust Settings

Slow Synthesis: No GPU available. Consider using CPU-intensive mode sparingly or upgrade hardware.
Choppy Playback: Reduce speed slightly or disable other browser tabs consuming resources.
Unnatural Voice: Enable pitch correction and adjust speed to 0.9x-1.2x range.

Saving Your Settings

Click "Save" at the bottom of the settings panel to persist your configuration. Settings are stored locally in Chrome and sync across devices if Chrome sync is enabled.

Click "Close" or the X button to dismiss the settings panel without saving changes.

Frequently Asked Questions

Why does the voice sound robotic or unnatural?

This usually occurs when pitch correction is disabled and speed is set very high or low. Enable "Pitch Correction" in settings, which automatically adjusts pitch inversely to speed to maintain natural intonation.

Which voice sounds most natural?

Voice preference is subjective, but popular choices include af_sky (female, clear) and bm_fable (male, warm). Experiment with all available voices to find your favorite. Article type also matters—professional content may benefit from bm_lewis, while casual blogs work well with af_nicole.

Can I use TTS without a GPU?

Yes! TLD2 automatically falls back to WASM (CPU-only) mode if GPU/WebGPU isn't available. Synthesis will be 2-5x slower, but fully functional. You'll notice longer buffering times, but playback remains smooth once audio is generated.

Why is there a delay before audio starts playing?

TLD2 uses streaming synthesis, generating audio in chunks as the summary appears. The first chunk takes 200ms-3s depending on GPU availability. Subsequent chunks generate in parallel, maintaining smooth playback. This is intentional—instant feedback without waiting for the entire article to synthesize.

Can I change settings during playback?

Yes, but changes won't apply to currently playing audio. Speed and pitch adjustments affect future chunks. Pause, change settings, save, then replay to hear the difference.