Instructions to use sentiuminc/holler-0.6b-6bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use sentiuminc/holler-0.6b-6bit with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir holler-0.6b-6bit sentiuminc/holler-0.6b-6bit
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
Holler 0.6B (6-bit)
6-bit quantized version of sentiuminc/holler-0.6b. Smaller download (1.7 GB vs 2.3 GB), lower RAM, faster inference. This is what ivi uses for real-time voice responses.
For full documentation, all voice samples, and the training pipeline, see the bf16 model card or the GitHub repo.
Compared to bf16
| Metric | bf16 | 6-bit (this) |
|---|---|---|
| Download | 2.3 GB | 1.7 GB |
| Metal RAM | ~2.4 GB | ~1.7 GB |
| RTF (16cb) | 0.68 | 0.54 |
| RTF (12cb) | โ | 0.47 |
| TTFA (16cb) | ~200ms | ~170ms |
| TTFA (12cb) | โ | ~147ms |
| Quality | Best | Very close |
Codebook Comparison
Same text, same 6-bit model โ 16 codebooks (best quality) vs 12 codebooks (fastest). Listen and decide.
Kit
16 codebooks:
12 codebooks:
Dakota
16 codebooks:
12 codebooks:
Nora
16 codebooks:
12 codebooks:
Oliver
16 codebooks:
12 codebooks:
Tessa
16 codebooks:
12 codebooks:
Quick Start
CLI
git clone https://github.com/sentiuminc/holler.git && cd holler
./build.sh
./holler --6bit --text 'Hello world' --talk
Python Server
git clone https://github.com/sentiuminc/holler.git && cd holler
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
python3 inference/server.py -c sentiuminc/holler-0.6b-6bit
# โ http://localhost:8100
HollerKit (Swift)
import HollerKit
let model = try await HollerModel.load(repo: "sentiuminc/holler-0.6b-6bit")
let audio = try await model.synthesize("Hello world", voice: "kit")
Stock mlx-audio
from mlx_audio.tts import load
model = load("sentiuminc/holler-0.6b-6bit")
audio = model.generate("Hello world", speaker="kit")
Codebooks
Use 16 codebooks for best quality (default), or 12 for maximum speed (~18% faster, negligible quality loss). Configurable at inference time โ not baked into the weights.
Quantization
6-bit affine quantization with group size 64, applied via mlx_audio.convert. The speech_tokenizer/ is shared with the bf16 variant and is not quantized.
Attribution
Fine-tune of Qwen3-TTS by the Qwen team at Alibaba Cloud (Apache 2.0). All credit for the underlying architecture goes to them.
- Downloads last month
- 131
6-bit
Model tree for sentiuminc/holler-0.6b-6bit
Base model
sentiuminc/holler-0.6b