Instructions to use sentiuminc/holler-0.6b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use sentiuminc/holler-0.6b with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir holler-0.6b sentiuminc/holler-0.6b
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
Holler 0.6B
An open-source text-to-speech model with 6 American English voices, and a fast inference engine for Apple Silicon. Fine-tuned from Qwen3-TTS-0.6B, optimized for local AI assistant use cases. ~200ms time-to-first-audio, fully on-device.
This is the bf16 full-precision variant (2.3 GB). For the smaller, faster quantized version, see sentiuminc/holler-0.6b-6bit.
GitHub: github.com/sentiuminc/holler β inference engine, training pipeline, and full documentation.
Voices
| Voice | Gender | Style | Description |
|---|---|---|---|
| Nora | Female | Polished, articulate, warm | Bright, expressive |
| Tessa | Female | Bright, sharp, articulate | Dry humor, quick |
| Kit | Neutral | Steady, analytical, professional | Clear, warm |
| Dakota | Male | Crisp, outdoorsy, steady | Grounded, natural |
| Joe | Male | Bright, punchy, high energy | Casual, enthusiastic |
| Oliver | Male | Deep, confident, clear | Measured, deliberate |
Nora
Tessa
Kit
Dakota
Joe
Oliver
Quick Start
CLI (recommended)
git clone https://github.com/sentiuminc/holler.git && cd holler
./build.sh
./holler --text 'Hello world' --talk
The CLI auto-downloads this model from HuggingFace on first run. Use --voice to pick a voice, --6bit for the faster quantized model, --session for LLM streaming simulation.
./holler --text 'Your reservation is confirmed for Saturday.' --voice nora --talk
./holler --6bit --text 'Hello world' --talk
./holler --session --text 'Sure. Let me check that for you. I think the answer is forty two.'
./holler --benchmark
build.shrequires Xcode 16+ (not just Command Line Tools) because MLX needs compiled Metal shaders.
Python Server
git clone https://github.com/sentiuminc/holler.git && cd holler
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
python3 inference/server.py
# β http://localhost:8100
The server auto-downloads this model on first run. Open http://localhost:8100 in your browser for the test UI, or:
curl "http://localhost:8100/tts?text=Hello+world&voice=kit" -o hello.wav
API:
| Endpoint | Description |
|---|---|
POST /speak |
Streaming float32 PCM (24kHz), chunked transfer encoding |
GET /tts?text=... |
Complete WAV file download |
GET /health |
Status, model name, available voices |
GET /benchmark |
TTFA/RTF benchmark |
GET / |
Browser test UI |
HollerKit (Swift Package)
For integrating directly into macOS apps:
import HollerKit
let model = try await HollerModel.load()
// Simple
let audio = try await model.synthesize("Hello world", voice: "kit")
// Streaming
for try await chunk in model.stream("Hello world", voice: "kit") {
player.scheduleBuffer(chunk.samples)
}
// LLM integration β feed tokens, get audio
let session = model.makeSession(voice: "kit")
Task {
for try await chunk in session.audio {
player.scheduleBuffer(chunk.samples)
}
}
for await token in llmStream {
session.feed(token)
}
await session.finish()
HollerKit handles sentence buffering, KV cache carryover for natural prosody across sentences, silence trimming, retry logic, and streaming AGC for consistent loudness.
Using with stock mlx-audio
The weights are standard Qwen3-TTS checkpoints β you can load them directly with mlx-audio without the Holler inference engine:
from mlx_audio.tts import load
model = load("sentiuminc/holler-0.6b")
audio = model.generate("Hello world", speaker="kit")
This works, but Holler's own inference pipeline is ~2x faster and handles the model's quirks (codec warmup silence, occasional empty generations, stochastic EOS cutoffs). For production use, we recommend the CLI, Python server, or HollerKit.
Model Variants
| bf16 (this) | 6-bit | |
|---|---|---|
| Size | 2.3 GB | 1.7 GB |
| Use case | Best quality | Streaming, real-time |
Performance (M1 Pro, 16GB)
| Metric | bf16 / 16cb | 6-bit / 16cb | 6-bit / 12cb |
|---|---|---|---|
| TTFA (median) | ~200ms | ~170ms | ~147ms |
| Real-time factor | 0.68 | 0.54 | 0.47 |
| Speed | 1.5x real-time | 1.8x real-time | 2.1x real-time |
| Metal RAM | ~2.4 GB | ~1.7 GB | ~1.7 GB |
About TTFA: This is time to first audible speech, not first audio chunk. Qwen3-TTS (like most codec language models) produces 80-800ms of near-silence at the start of each generation while the codec decoder warms up. Holler detects and trims this automatically, so the TTFA here is when you actually hear the voice start speaking.
Codebooks
This model has 16 codec books. Use all 16 for best quality (default), or 12 for faster streaming (~18% speed gain with negligible quality loss). Codebooks are configurable at inference time, not baked into the weights. Listen to the difference.
How This Model Was Made
Supervised fine-tune of Qwen3-TTS-12Hz-0.6B-Base with synthetic voice data:
- Voice design β character descriptions β reference audio via Qwen3-TTS VoiceDesign (1.7B)
- Data generation β 500 training clips per voice via 1.7B voice cloning, locally on Apple Silicon
- Enhancement β DeepFilterNet3 noise removal, LUFS normalization, de-essing
- Curation β manual listening pass, typically 300-400 of 500 clips kept
- Training β SFT on CUDA GPU. lr=5e-7, cosine warmup, 2 epochs, embedding normalization (L2-norm to 10.0)
- Quantization β 6-bit affine g64 for the fast variant
Full training pipeline, scripts, and lessons learned: github.com/sentiuminc/holler
Attribution
Fine-tune of Qwen3-TTS by the Qwen team at Alibaba Cloud (Apache 2.0). All credit for the underlying architecture goes to them.
- Downloads last month
- 144
Quantized