Holler 0.6B

An open-source text-to-speech model with 6 American English voices, and a fast inference engine for Apple Silicon. Fine-tuned from Qwen3-TTS-0.6B, optimized for local AI assistant use cases. ~200ms time-to-first-audio, fully on-device.

This is the bf16 full-precision variant (2.3 GB). For the smaller, faster quantized version, see sentiuminc/holler-0.6b-6bit.

GitHub: github.com/sentiuminc/holler — inference engine, training pipeline, and full documentation.

Voices

Voice	Gender	Style	Description
Nora	Female	Polished, articulate, warm	Bright, expressive
Tessa	Female	Bright, sharp, articulate	Dry humor, quick
Kit	Neutral	Steady, analytical, professional	Clear, warm
Dakota	Male	Crisp, outdoorsy, steady	Grounded, natural
Joe	Male	Bright, punchy, high energy	Casual, enthusiastic
Oliver	Male	Deep, confident, clear	Measured, deliberate

Nora

Tessa

Kit

Dakota

Joe

Oliver

Quick Start

CLI (recommended)

git clone https://github.com/sentiuminc/holler.git && cd holler
./build.sh
./holler --text 'Hello world' --talk

The CLI auto-downloads this model from HuggingFace on first run. Use --voice to pick a voice, --6bit for the faster quantized model, --session for LLM streaming simulation.

./holler --text 'Your reservation is confirmed for Saturday.' --voice nora --talk
./holler --6bit --text 'Hello world' --talk
./holler --session --text 'Sure. Let me check that for you. I think the answer is forty two.'
./holler --benchmark

build.sh requires Xcode 16+ (not just Command Line Tools) because MLX needs compiled Metal shaders.

Python Server

git clone https://github.com/sentiuminc/holler.git && cd holler
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

python3 inference/server.py
# → http://localhost:8100

The server auto-downloads this model on first run. Open http://localhost:8100 in your browser for the test UI, or:

curl "http://localhost:8100/tts?text=Hello+world&voice=kit" -o hello.wav

API:

Endpoint	Description
`POST /speak`	Streaming float32 PCM (24kHz), chunked transfer encoding
`GET /tts?text=...`	Complete WAV file download
`GET /health`	Status, model name, available voices
`GET /benchmark`	TTFA/RTF benchmark
`GET /`	Browser test UI

HollerKit (Swift Package)

For integrating directly into macOS apps:

import HollerKit

let model = try await HollerModel.load()

// Simple
let audio = try await model.synthesize("Hello world", voice: "kit")

// Streaming
for try await chunk in model.stream("Hello world", voice: "kit") {
    player.scheduleBuffer(chunk.samples)
}

// LLM integration — feed tokens, get audio
let session = model.makeSession(voice: "kit")
Task {
    for try await chunk in session.audio {
        player.scheduleBuffer(chunk.samples)
    }
}
for await token in llmStream {
    session.feed(token)
}
await session.finish()

HollerKit handles sentence buffering, KV cache carryover for natural prosody across sentences, silence trimming, retry logic, and streaming AGC for consistent loudness.

Using with stock mlx-audio

The weights are standard Qwen3-TTS checkpoints — you can load them directly with mlx-audio without the Holler inference engine:

from mlx_audio.tts import load

model = load("sentiuminc/holler-0.6b")
audio = model.generate("Hello world", speaker="kit")

This works, but Holler's own inference pipeline is ~2x faster and handles the model's quirks (codec warmup silence, occasional empty generations, stochastic EOS cutoffs). For production use, we recommend the CLI, Python server, or HollerKit.

Model Variants

	bf16 (this)	6-bit
Size	2.3 GB	1.7 GB
Use case	Best quality	Streaming, real-time

Performance (M1 Pro, 16GB)

Metric	bf16 / 16cb	6-bit / 16cb	6-bit / 12cb
TTFA (median)	~200ms	~170ms	~147ms
Real-time factor	0.68	0.54	0.47
Speed	1.5x real-time	1.8x real-time	2.1x real-time
Metal RAM	~2.4 GB	~1.7 GB	~1.7 GB

About TTFA: This is time to first audible speech, not first audio chunk. Qwen3-TTS (like most codec language models) produces 80-800ms of near-silence at the start of each generation while the codec decoder warms up. Holler detects and trims this automatically, so the TTFA here is when you actually hear the voice start speaking.

Codebooks

This model has 16 codec books. Use all 16 for best quality (default), or 12 for faster streaming (~18% speed gain with negligible quality loss). Codebooks are configurable at inference time, not baked into the weights. Listen to the difference.

How This Model Was Made

Supervised fine-tune of Qwen3-TTS-12Hz-0.6B-Base with synthetic voice data:

Voice design — character descriptions → reference audio via Qwen3-TTS VoiceDesign (1.7B)
Data generation — 500 training clips per voice via 1.7B voice cloning, locally on Apple Silicon
Enhancement — DeepFilterNet3 noise removal, LUFS normalization, de-essing
Curation — manual listening pass, typically 300-400 of 500 clips kept
Training — SFT on CUDA GPU. lr=5e-7, cosine warmup, 2 epochs, embedding normalization (L2-norm to 10.0)
Quantization — 6-bit affine g64 for the fast variant

Full training pipeline, scripts, and lessons learned: github.com/sentiuminc/holler

Attribution

Fine-tune of Qwen3-TTS by the Qwen team at Alibaba Cloud (Apache 2.0). All credit for the underlying architecture goes to them.

Downloads last month: 144

Safetensors

Model size

0.9B params

Tensor type

BF16

MLX

Hardware compatibility

Quantized

Model tree for sentiuminc/holler-0.6b

Quantizations

1 model