Holler 0.6B

An open-source text-to-speech model with 6 American English voices, and a fast inference engine for Apple Silicon. Fine-tuned from Qwen3-TTS-0.6B, optimized for local AI assistant use cases. ~200ms time-to-first-audio, fully on-device.

This is the bf16 full-precision variant (2.3 GB). For the smaller, faster quantized version, see sentiuminc/holler-0.6b-6bit.

GitHub: github.com/sentiuminc/holler β€” inference engine, training pipeline, and full documentation.

Voices

Voice Gender Style Description
Nora Female Polished, articulate, warm Bright, expressive
Tessa Female Bright, sharp, articulate Dry humor, quick
Kit Neutral Steady, analytical, professional Clear, warm
Dakota Male Crisp, outdoorsy, steady Grounded, natural
Joe Male Bright, punchy, high energy Casual, enthusiastic
Oliver Male Deep, confident, clear Measured, deliberate

Nora

Tessa

Kit

Dakota

Joe

Oliver

Quick Start

CLI (recommended)

git clone https://github.com/sentiuminc/holler.git && cd holler
./build.sh
./holler --text 'Hello world' --talk

The CLI auto-downloads this model from HuggingFace on first run. Use --voice to pick a voice, --6bit for the faster quantized model, --session for LLM streaming simulation.

./holler --text 'Your reservation is confirmed for Saturday.' --voice nora --talk
./holler --6bit --text 'Hello world' --talk
./holler --session --text 'Sure. Let me check that for you. I think the answer is forty two.'
./holler --benchmark

build.sh requires Xcode 16+ (not just Command Line Tools) because MLX needs compiled Metal shaders.

Python Server

git clone https://github.com/sentiuminc/holler.git && cd holler
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

python3 inference/server.py
# β†’ http://localhost:8100

The server auto-downloads this model on first run. Open http://localhost:8100 in your browser for the test UI, or:

curl "http://localhost:8100/tts?text=Hello+world&voice=kit" -o hello.wav

API:

Endpoint Description
POST /speak Streaming float32 PCM (24kHz), chunked transfer encoding
GET /tts?text=... Complete WAV file download
GET /health Status, model name, available voices
GET /benchmark TTFA/RTF benchmark
GET / Browser test UI

HollerKit (Swift Package)

For integrating directly into macOS apps:

import HollerKit

let model = try await HollerModel.load()

// Simple
let audio = try await model.synthesize("Hello world", voice: "kit")

// Streaming
for try await chunk in model.stream("Hello world", voice: "kit") {
    player.scheduleBuffer(chunk.samples)
}

// LLM integration β€” feed tokens, get audio
let session = model.makeSession(voice: "kit")
Task {
    for try await chunk in session.audio {
        player.scheduleBuffer(chunk.samples)
    }
}
for await token in llmStream {
    session.feed(token)
}
await session.finish()

HollerKit handles sentence buffering, KV cache carryover for natural prosody across sentences, silence trimming, retry logic, and streaming AGC for consistent loudness.

Using with stock mlx-audio

The weights are standard Qwen3-TTS checkpoints β€” you can load them directly with mlx-audio without the Holler inference engine:

from mlx_audio.tts import load

model = load("sentiuminc/holler-0.6b")
audio = model.generate("Hello world", speaker="kit")

This works, but Holler's own inference pipeline is ~2x faster and handles the model's quirks (codec warmup silence, occasional empty generations, stochastic EOS cutoffs). For production use, we recommend the CLI, Python server, or HollerKit.

Model Variants

bf16 (this) 6-bit
Size 2.3 GB 1.7 GB
Use case Best quality Streaming, real-time

Performance (M1 Pro, 16GB)

Metric bf16 / 16cb 6-bit / 16cb 6-bit / 12cb
TTFA (median) ~200ms ~170ms ~147ms
Real-time factor 0.68 0.54 0.47
Speed 1.5x real-time 1.8x real-time 2.1x real-time
Metal RAM ~2.4 GB ~1.7 GB ~1.7 GB

About TTFA: This is time to first audible speech, not first audio chunk. Qwen3-TTS (like most codec language models) produces 80-800ms of near-silence at the start of each generation while the codec decoder warms up. Holler detects and trims this automatically, so the TTFA here is when you actually hear the voice start speaking.

Codebooks

This model has 16 codec books. Use all 16 for best quality (default), or 12 for faster streaming (~18% speed gain with negligible quality loss). Codebooks are configurable at inference time, not baked into the weights. Listen to the difference.

How This Model Was Made

Supervised fine-tune of Qwen3-TTS-12Hz-0.6B-Base with synthetic voice data:

  1. Voice design β€” character descriptions β†’ reference audio via Qwen3-TTS VoiceDesign (1.7B)
  2. Data generation β€” 500 training clips per voice via 1.7B voice cloning, locally on Apple Silicon
  3. Enhancement β€” DeepFilterNet3 noise removal, LUFS normalization, de-essing
  4. Curation β€” manual listening pass, typically 300-400 of 500 clips kept
  5. Training β€” SFT on CUDA GPU. lr=5e-7, cosine warmup, 2 epochs, embedding normalization (L2-norm to 10.0)
  6. Quantization β€” 6-bit affine g64 for the fast variant

Full training pipeline, scripts, and lessons learned: github.com/sentiuminc/holler

Attribution

Fine-tune of Qwen3-TTS by the Qwen team at Alibaba Cloud (Apache 2.0). All credit for the underlying architecture goes to them.

Downloads last month
144
Safetensors
Model size
0.9B params
Tensor type
BF16
Β·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for sentiuminc/holler-0.6b

Quantizations
1 model