Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/KittenML/KittenTTS/llms.txt

Use this file to discover all available pages before exploring further.

KittenTTS converts text to audio through a multi-stage pipeline. Understanding each stage helps you tune performance and choose the right configuration for your use case.

Inference pipeline

1

Text preprocessing (optional)

When clean_text=True, the TextPreprocessor normalizes raw text before it enters the model. It expands numbers, currencies, units, abbreviations, and other non-spoken forms into words the phonemizer can handle.
# "$99 off, 1st order" becomes "ninety-nine dollars off, first order"
model.generate("$99 off, 1st order", voice="Bella", clean_text=True)
See Text preprocessing for the full pipeline.
2

Chunking

Long texts are split into chunks of up to 400 characters using chunk_text(). Splits happen at sentence boundaries (., !, ?). If a single sentence exceeds 400 characters, it is further split word by word.This keeps memory use predictable and allows the model to run efficiently without loading an entire document into a single inference pass.
3

Phonemization

Each chunk is converted to phonemes using eSpeak-ng via the EspeakBackend from the phonemizer library. KittenTTS uses English (en-us) with stress markers and punctuation preserved.
"Hello, world." → "həloʊ, wɜːld."
4

Tokenization

The TextCleaner maps each phoneme character to an integer ID. Start and end tokens (0) are added around the sequence:
[0, id₁, id₂, ..., idₙ, 10, 0]
The resulting input_ids tensor is one of the three inputs to the ONNX model.
5

Style embedding lookup

KittenTTS uses a style embedding to control voice characteristics. The embedding is selected from a preloaded NPZ file (voices.npz) based on the length of the input token sequence — longer texts draw from different style vectors than shorter ones.The voice parameter (e.g. "Bella" or "expr-voice-2-f") determines which embedding slice is used.
6

ONNX inference

The ONNX Runtime session runs inference with three inputs:
InputDescription
input_idsTokenized phoneme sequence
styleVoice style embedding
speedPlayback speed multiplier (default 1.0)
Inference runs entirely on CPU. No GPU is required.
7

Audio concatenation

Audio arrays from each chunk are concatenated into a single NumPy array. The output sample rate is 24 kHz.You can pass this array directly to soundfile or any audio processing library:
import soundfile as sf

audio = model.generate("Hello, world.", voice="Bella")
sf.write("output.wav", audio, 24000)

Architecture

Model types

KittenTTS supports two ONNX model variants specified in the model’s config.json:
TypeDescription
ONNX1Standard precision model (e.g. kitten-tts-mini, kitten-tts-micro)
ONNX2Quantized int8 model for smallest footprint (e.g. kitten-tts-nano-int8)
Both types use the same inference pipeline. The config also specifies speed_priors and voice_aliases for that checkpoint.

Model auto-download

KittenTTS fetches models from the Hugging Face Hub on first use. The download includes:
  • config.json — model type, voice aliases, and speed priors
  • The ONNX model file
  • The voices.npz file containing voice style embeddings
Files are cached locally so subsequent runs do not re-download.
from kittentts import KittenTTS

# Downloads on first run, uses cache afterward
model = KittenTTS("KittenML/kitten-tts-mini-0.8")
Set cache_dir to control where models are stored:
model = KittenTTS("KittenML/kitten-tts-mini-0.8", cache_dir="/data/models")

CPU efficiency

The ONNX Runtime is optimized for CPU inference. KittenTTS does not require CUDA, ROCm, or any GPU driver. This makes it well-suited for:
  • Edge devices and embedded systems
  • Cloud functions with CPU-only instances
  • Local development without GPU hardware
  • Batch processing pipelines on standard servers