Instructions to use smcleod/dots.tts-soar-mlx with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use smcleod/dots.tts-soar-mlx with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir dots.tts-soar-mlx smcleod/dots.tts-soar-mlx
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
dots.tts-soar-mlx
An MLX conversion of rednote-hilab/dots.tts-soar for Apple Silicon.
dots.tts is a ~2B-parameter fully continuous, end-to-end autoregressive text-to-speech system. The LLM backbone is initialised from Qwen2.5-1.5B-Base, consumes BPE text directly (no phonemes), and emits one hidden state per audio step, which an autoregressive flow-matching acoustic head renders through a 48 kHz AudioVAE. There are no discrete codec tokens anywhere in the pipeline. The soar variant adds Self-corrective Alignment (SCA), a reward-free flow-matching post-training stage, and is the upstream-recommended default for zero-shot voice cloning.
Variants
Three self-contained layouts ship in this repo. Point your loader at the repo root or at a subfolder.
| Variant | Path | Footprint trade-off | Size |
|---|---|---|---|
| Default | repo root | Max fidelity: only backbone quantised, full F32 everywhere else |
~4 GB |
| Full int4 | 4bit/ |
Smallest: 4-bit compute path + reduced-precision acoustics | ~1.6 GB |
| Full int8 | 8bit/ |
Balanced: 8-bit compute path + reduced-precision acoustics | ~2.6 GB |
Each variant carries its own complete set of components (backbone, dit, patch_encoder, vocoder, speaker, audiovae_encoder, heads) plus shared config.json, llm_config.json, latent_stats.* and resampler_48k_16k.safetensors, so any one loads standalone.
The 4bit/ and 8bit/ subfolders quantise the compute-heavy path (backbone + flow-matching dit + patch_encoder) and store the acoustic path at reduced precision (vocoder fp16, audiovae_encoder fp16). A/B listening across multiple voices found int8 marginally cleaner than int4, and the reduced-precision acoustics indistinguishable from F32. The top-level default keeps the full F32 acoustic path as a max-fidelity reference.
Per-component precision
| Component | Default (root) | 4bit/ |
8bit/ |
|---|---|---|---|
backbone (Qwen2 LLM) |
4-bit | 4-bit | 8-bit |
dit (flow-matching) |
F32 | 4-bit | 8-bit |
patch_encoder |
F32 | 4-bit | 8-bit |
vocoder |
F32 | fp16 | fp16 |
audiovae_encoder |
F32 | fp16 | fp16 |
speaker |
F32 | F32 | F32 |
heads |
F32 | F32 | F32 |
Both reduced-precision acoustic components use fp16. A vocoder precision bench (f32 vs bf16 vs fp16 on real denormalised speech latents) found fp16 roughly 4x closer to the F32 output than bf16 at identical decode speed and memory, and the SnakeBeta exp(alpha) / 1/(exp(beta)+eps) overflow that fp16's narrower exponent range could in theory cause never materialises on real latents. Reduced precision gives no decode-speed or peak-memory win over F32 (peak is activation-dominated); the only benefit is halving the on-disk weights (vocoder 521 -> 261 MB), so the F32 root keeps it as the max-fidelity reference. Integer quantisation is group-size-64 affine, recorded in each quantised component's config.json quantization block:
{
"group_size": 64,
"bits": 4,
"mode": "affine"
}
Weights are packed as U32 with scales and biases in MLX layout. The unquantised backbone is bf16 (~3 GB); 4-bit brings it to ~829 MB, 8-bit to ~1.6 GB.
Requirements
- Apple Silicon (M-series) running macOS
- An MLX runtime and a dots.tts inference implementation with MLX backbone support
This repository contains weights and configs only. Use it with an MLX dots.tts inference implementation; each component reads its own config.json quantization block, so quantised and F32 components load through the same code path.
Quantisation notes and caveats
- Weight quantisation trades a small amount of fidelity for lower memory pressure and faster bandwidth-bound autoregressive decode. The backbone is bandwidth-bound, so quantising it is the main speed win; quantising the
dit/patch_encoderis mostly a footprint reduction. - The acoustic output paths (
vocoder,audiovae_encoder) remain F32 in every variant, so raw audio fidelity is governed mainly by those; quantisation effects show up more in prosody and token-level decisions. - No calibration dataset was used; this is direct affine weight quantisation (group size 64).
Licence
Apache 2.0, inherited from the base model. See the base model card for full terms.
Ethical use
This model can produce highly realistic synthetic speech via zero-shot voice cloning. It is intended for research and authorised deployment only. Do not use it for impersonation, fraud, or disinformation, or to clone a voice without the speaker's consent.
Attribution
- Base model: rednote-hilab/dots.tts-soar
- MLX conversion and 4-bit backbone quantisation: smcleod
- Downloads last month
- 168
4-bit
Model tree for smcleod/dots.tts-soar-mlx
Base model
rednote-hilab/dots.tts-base