KittenTTS + espeak-ng + phonemizer
Text-to-speech stack for podcast audio generation.
Goals
Convert the podcast essay script to audio — a complete MP3 file suitable for hosting and distribution. The requirement was a local, royalty-free TTS pipeline that produces acceptable quality without per-character API costs.
Effectiveness
Adequate. Audio quality is functional and the pipeline produces a complete MP3. The setup is fragile: three layers of dependency (KittenTTS → phonemizer → espeak-ng) each have their own installation quirks, and the path plumbing between them is not robust to environmental variation.
What made it effective
KittenTTS()with no arguments loads the default nano model — the correct instantiation. Specifying a model name or voice by the wrong identifier breaks silently with a generic import error.ffmpeghandles audio concatenation across sections cleanly once the per-section WAV files are produced.- The model is small enough (nano) to run on CPU without a GPU, meaning audio generation doesn't compete with Ollama for VRAM.
Bonus utility
espeak-ng can be used standalone for phoneme debugging: espeak-ng --ipa "word" dumps the IPA representation and confirms the library is reachable before running the full TTS pipeline.
Friction / pain points / surprises
Three layers of path plumbing, all hardcoded. The chain is: phonemizer calls espeak-ng via PHONEMIZER_ESPEAK_LIBRARY (path to the .so), espeak-ng reads its data via ESPEAK_DATA_PATH, and the calling process needs LD_LIBRARY_PATH to find the shared library. All three must be set and must point to the same installation. In the Docker image, system packages install to /usr/lib/x86_64-linux-gnu/ and the env vars point there correctly. On the host, a custom extracted espeak install lives at /home/node/.local/espeak-ng/. Code that hardcodes either path fails in the other environment. Fix: conditionally set env vars based on which path exists.
KittenTTS model and voice identifiers are not documented in any obvious place. We hit failures from using "KittenML/kitten-tts-mini-0.8" (wrong model ID) and "Jasper" (wrong voice name). The correct instantiation is KittenTTS() (no model arg) with voices like "expr-voice-3-m". The only reliable source for this was the package source code.
phonemizer's api.py doesn't pass ESPEAK_DATA_PATH to espeak_Initialize. The phonemizer library calls the espeak-ng C API without forwarding the data path from the environment. On non-standard installations espeak-ng initializes without finding its data directory and phonemizes incorrectly (or crashes). Fix: patch api.py to pass the env var value to espeak_Initialize. ensure-deps.sh applies this patch automatically after install.
Audio artifacts (clicks, silences) appear at section boundaries. KittenTTS generates audio per-section, and ffmpeg concatenates them. Section breaks are audible as brief artifacts. Mitigations: slight overlap in concatenation, or crossfade. Not yet addressed.
No progress output during generation. KittenTTS is silent during synthesis. A 30-minute episode can take 10+ minutes to generate with no indication of progress. The pipeline logs nothing between "generating audio" and "upload complete".