KittenTTS converts text to audio through a multi-stage pipeline. Understanding each stage helps you tune performance and choose the right configuration for your use case.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/KittenML/KittenTTS/llms.txt
Use this file to discover all available pages before exploring further.
Inference pipeline
Text preprocessing (optional)
When See Text preprocessing for the full pipeline.
clean_text=True, the TextPreprocessor normalizes raw text before it enters the model. It expands numbers, currencies, units, abbreviations, and other non-spoken forms into words the phonemizer can handle.Chunking
Long texts are split into chunks of up to 400 characters using
chunk_text(). Splits happen at sentence boundaries (., !, ?). If a single sentence exceeds 400 characters, it is further split word by word.This keeps memory use predictable and allows the model to run efficiently without loading an entire document into a single inference pass.Phonemization
Each chunk is converted to phonemes using eSpeak-ng via the
EspeakBackend from the phonemizer library. KittenTTS uses English (en-us) with stress markers and punctuation preserved.Tokenization
The The resulting
TextCleaner maps each phoneme character to an integer ID. Start and end tokens (0) are added around the sequence:input_ids tensor is one of the three inputs to the ONNX model.Style embedding lookup
KittenTTS uses a style embedding to control voice characteristics. The embedding is selected from a preloaded NPZ file (
voices.npz) based on the length of the input token sequence — longer texts draw from different style vectors than shorter ones.The voice parameter (e.g. "Bella" or "expr-voice-2-f") determines which embedding slice is used.ONNX inference
The ONNX Runtime session runs inference with three inputs:
Inference runs entirely on CPU. No GPU is required.
| Input | Description |
|---|---|
input_ids | Tokenized phoneme sequence |
style | Voice style embedding |
speed | Playback speed multiplier (default 1.0) |
Architecture
Model types
KittenTTS supports two ONNX model variants specified in the model’sconfig.json:
| Type | Description |
|---|---|
ONNX1 | Standard precision model (e.g. kitten-tts-mini, kitten-tts-micro) |
ONNX2 | Quantized int8 model for smallest footprint (e.g. kitten-tts-nano-int8) |
speed_priors and voice_aliases for that checkpoint.
Model auto-download
KittenTTS fetches models from the Hugging Face Hub on first use. The download includes:config.json— model type, voice aliases, and speed priors- The ONNX model file
- The
voices.npzfile containing voice style embeddings
CPU efficiency
The ONNX Runtime is optimized for CPU inference. KittenTTS does not require CUDA, ROCm, or any GPU driver. This makes it well-suited for:- Edge devices and embedded systems
- Cloud functions with CPU-only instances
- Local development without GPU hardware
- Batch processing pipelines on standard servers