How it works

KittenTTS converts text to audio through a multi-stage pipeline. Understanding each stage helps you tune performance and choose the right configuration for your use case.

Inference pipeline

Text preprocessing (optional)

When clean_text=True, the TextPreprocessor normalizes raw text before it enters the model. It expands numbers, currencies, units, abbreviations, and other non-spoken forms into words the phonemizer can handle.

# "$99 off, 1st order" becomes "ninety-nine dollars off, first order"
model.generate("$99 off, 1st order", voice="Bella", clean_text=True)

See Text preprocessing for the full pipeline.

Chunking

Long texts are split into chunks of up to 400 characters using chunk_text(). Splits happen at sentence boundaries (., !, ?). If a single sentence exceeds 400 characters, it is further split word by word.This keeps memory use predictable and allows the model to run efficiently without loading an entire document into a single inference pass.

Phonemization

Each chunk is converted to phonemes using eSpeak-ng via the EspeakBackend from the phonemizer library. KittenTTS uses English (en-us) with stress markers and punctuation preserved.

"Hello, world." → "həloʊ, wɜːld."

Tokenization

The TextCleaner maps each phoneme character to an integer ID. Start and end tokens (0) are added around the sequence:

[0, id₁, id₂, ..., idₙ, 10, 0]

The resulting input_ids tensor is one of the three inputs to the ONNX model.

Style embedding lookup

KittenTTS uses a style embedding to control voice characteristics. The embedding is selected from a preloaded NPZ file (voices.npz) based on the length of the input token sequence — longer texts draw from different style vectors than shorter ones.The voice parameter (e.g. "Bella" or "expr-voice-2-f") determines which embedding slice is used.

ONNX inference

The ONNX Runtime session runs inference with three inputs:

Input	Description
`input_ids`	Tokenized phoneme sequence
`style`	Voice style embedding
`speed`	Playback speed multiplier (default `1.0`)

Inference runs entirely on CPU. No GPU is required.

Audio concatenation

Audio arrays from each chunk are concatenated into a single NumPy array. The output sample rate is 24 kHz.You can pass this array directly to soundfile or any audio processing library:

import soundfile as sf

audio = model.generate("Hello, world.", voice="Bella")
sf.write("output.wav", audio, 24000)

Architecture

Model types

KittenTTS supports two ONNX model variants specified in the model’s config.json:

Type	Description
`ONNX1`	Standard precision model (e.g. `kitten-tts-mini`, `kitten-tts-micro`)
`ONNX2`	Quantized int8 model for smallest footprint (e.g. `kitten-tts-nano-int8`)

Both types use the same inference pipeline. The config also specifies speed_priors and voice_aliases for that checkpoint.

Model auto-download

KittenTTS fetches models from the Hugging Face Hub on first use. The download includes:

config.json — model type, voice aliases, and speed priors
The ONNX model file
The voices.npz file containing voice style embeddings

Files are cached locally so subsequent runs do not re-download.

from kittentts import KittenTTS

# Downloads on first run, uses cache afterward
model = KittenTTS("KittenML/kitten-tts-mini-0.8")

Set cache_dir to control where models are stored:

model = KittenTTS("KittenML/kitten-tts-mini-0.8", cache_dir="/data/models")

CPU efficiency

The ONNX Runtime is optimized for CPU inference. KittenTTS does not require CUDA, ROCm, or any GPU driver. This makes it well-suited for:

Edge devices and embedded systems
Cloud functions with CPU-only instances
Local development without GPU hardware
Batch processing pipelines on standard servers

Get Started

Concepts

Guides

Models

Inference pipeline

Architecture

Model types

Model auto-download

CPU efficiency

Get Started

Concepts

Guides

Models

Documentation Index

​Inference pipeline

​Architecture

​Model types

​Model auto-download

​CPU efficiency

Inference pipeline

Architecture

Model types

Model auto-download

CPU efficiency