KittenTTS + espeak-ng + phonemizer

Text-to-speech stack for podcast audio generation.

Goals

Convert the podcast essay script to audio — a complete MP3 file suitable for hosting and distribution. The requirement was a local, royalty-free TTS pipeline that produces acceptable quality without per-character API costs.

Effectiveness

Adequate. Audio quality is functional and the pipeline produces a complete MP3. The setup is fragile: three layers of dependency (KittenTTS → phonemizer → espeak-ng) each have their own installation quirks, and the path plumbing between them is not robust to environmental variation.

What made it effective

Bonus utility

espeak-ng can be used standalone for phoneme debugging: espeak-ng --ipa "word" dumps the IPA representation and confirms the library is reachable before running the full TTS pipeline.

Friction / pain points / surprises

Three layers of path plumbing, all hardcoded. The chain is: phonemizer calls espeak-ng via PHONEMIZER_ESPEAK_LIBRARY (path to the .so), espeak-ng reads its data via ESPEAK_DATA_PATH, and the calling process needs LD_LIBRARY_PATH to find the shared library. All three must be set and must point to the same installation. In the Docker image, system packages install to /usr/lib/x86_64-linux-gnu/ and the env vars point there correctly. On the host, a custom extracted espeak install lives at /home/node/.local/espeak-ng/. Code that hardcodes either path fails in the other environment. Fix: conditionally set env vars based on which path exists.

KittenTTS model and voice identifiers are not documented in any obvious place. We hit failures from using "KittenML/kitten-tts-mini-0.8" (wrong model ID) and "Jasper" (wrong voice name). The correct instantiation is KittenTTS() (no model arg) with voices like "expr-voice-3-m". The only reliable source for this was the package source code.

phonemizer's api.py doesn't pass ESPEAK_DATA_PATH to espeak_Initialize. The phonemizer library calls the espeak-ng C API without forwarding the data path from the environment. On non-standard installations espeak-ng initializes without finding its data directory and phonemizes incorrectly (or crashes). Fix: patch api.py to pass the env var value to espeak_Initialize. ensure-deps.sh applies this patch automatically after install.

Audio artifacts (clicks, silences) appear at section boundaries. KittenTTS generates audio per-section, and ffmpeg concatenates them. Section breaks are audible as brief artifacts. Mitigations: slight overlap in concatenation, or crossfade. Not yet addressed.

No progress output during generation. KittenTTS is silent during synthesis. A 30-minute episode can take 10+ minutes to generate with no indication of progress. The pipeline logs nothing between "generating audio" and "upload complete".