KittenTTS + espeak-ng + phonemizer

Text-to-speech stack for podcast audio generation.

Goals

Convert the podcast essay script to audio — a complete MP3 file suitable for hosting and distribution. The requirement was a local, royalty-free TTS pipeline that produces acceptable quality without per-character API costs.

Effectiveness

Adequate. Audio quality is functional and the pipeline produces a complete MP3. The setup is fragile: three layers of dependency (KittenTTS → phonemizer → espeak-ng) each have their own installation quirks, and the path plumbing between them is not robust to environmental variation.

What made it effective

KittenTTS() with no arguments loads the default nano model — the correct instantiation. Specifying a model name or voice by the wrong identifier breaks silently with a generic import error.
ffmpeg handles audio concatenation across sections cleanly once the per-section WAV files are produced.
The model is small enough (nano) to run on CPU without a GPU, meaning audio generation doesn't compete with Ollama for VRAM.

Bonus utility

espeak-ng can be used standalone for phoneme debugging: espeak-ng --ipa "word" dumps the IPA representation and confirms the library is reachable before running the full TTS pipeline.

Friction / pain points / surprises

Three layers of path plumbing, all hardcoded. The chain is: phonemizer calls espeak-ng via PHONEMIZER_ESPEAK_LIBRARY (path to the .so), espeak-ng reads its data via ESPEAK_DATA_PATH, and the calling process needs LD_LIBRARY_PATH to find the shared library. All three must be set and must point to the same installation. In the Docker image, system packages install to /usr/lib/x86_64-linux-gnu/ and the env vars point there correctly. On the host, a custom extracted espeak install lives at /home/node/.local/espeak-ng/. Code that hardcodes either path fails in the other environment. Fix: conditionally set env vars based on which path exists.

KittenTTS model and voice identifiers are not documented in any obvious place. We hit failures from using "KittenML/kitten-tts-mini-0.8" (wrong model ID) and "Jasper" (wrong voice name). The correct instantiation is KittenTTS() (no model arg) with voices like "expr-voice-3-m". The only reliable source for this was the package source code.

phonemizer's api.py doesn't pass ESPEAK_DATA_PATH to espeak_Initialize. The phonemizer library calls the espeak-ng C API without forwarding the data path from the environment. On non-standard installations espeak-ng initializes without finding its data directory and phonemizes incorrectly (or crashes). Fix: patch api.py to pass the env var value to espeak_Initialize. ensure-deps.sh applies this patch automatically after install.

Audio artifacts (clicks, silences) appear at section boundaries. KittenTTS generates audio per-section, and ffmpeg concatenates them. Section breaks are audible as brief artifacts. Mitigations: slight overlap in concatenation, or crossfade. Not yet addressed.

No progress output during generation. KittenTTS is silent during synthesis. A 30-minute episode can take 10+ minutes to generate with no indication of progress. The pipeline logs nothing between "generating audio" and "upload complete".