dots.tts-soar-mlx

An MLX conversion of rednote-hilab/dots.tts-soar for Apple Silicon.

dots.tts is a ~2B-parameter fully continuous, end-to-end autoregressive text-to-speech system. The LLM backbone is initialised from Qwen2.5-1.5B-Base, consumes BPE text directly (no phonemes), and emits one hidden state per audio step, which an autoregressive flow-matching acoustic head renders through a 48 kHz AudioVAE. There are no discrete codec tokens anywhere in the pipeline. The soar variant adds Self-corrective Alignment (SCA), a reward-free flow-matching post-training stage, and is the upstream-recommended default for zero-shot voice cloning.

Variants

Three self-contained layouts ship in this repo. Point your loader at the repo root or at a subfolder.

Variant	Path	Footprint trade-off	Size
Default	repo root	Max fidelity: only `backbone` quantised, full F32 everywhere else	~4 GB
Full int4	`4bit/`	Smallest: 4-bit compute path + reduced-precision acoustics	~1.6 GB
Full int8	`8bit/`	Balanced: 8-bit compute path + reduced-precision acoustics	~2.6 GB

Each variant carries its own complete set of components (backbone, dit, patch_encoder, vocoder, speaker, audiovae_encoder, heads) plus shared config.json, llm_config.json, latent_stats.* and resampler_48k_16k.safetensors, so any one loads standalone.

The 4bit/ and 8bit/ subfolders quantise the compute-heavy path (backbone + flow-matching dit + patch_encoder) and store the acoustic path at reduced precision (vocoder fp16, audiovae_encoder fp16). A/B listening across multiple voices found int8 marginally cleaner than int4, and the reduced-precision acoustics indistinguishable from F32. The top-level default keeps the full F32 acoustic path as a max-fidelity reference.

Per-component precision

Component	Default (root)	`4bit/`	`8bit/`
`backbone` (Qwen2 LLM)	4-bit	4-bit	8-bit
`dit` (flow-matching)	F32	4-bit	8-bit
`patch_encoder`	F32	4-bit	8-bit
`vocoder`	F32	fp16	fp16
`audiovae_encoder`	F32	fp16	fp16
`speaker`	F32	F32	F32
`heads`	F32	F32	F32

Both reduced-precision acoustic components use fp16. A vocoder precision bench (f32 vs bf16 vs fp16 on real denormalised speech latents) found fp16 roughly 4x closer to the F32 output than bf16 at identical decode speed and memory, and the SnakeBeta exp(alpha) / 1/(exp(beta)+eps) overflow that fp16's narrower exponent range could in theory cause never materialises on real latents. Reduced precision gives no decode-speed or peak-memory win over F32 (peak is activation-dominated); the only benefit is halving the on-disk weights (vocoder 521 -> 261 MB), so the F32 root keeps it as the max-fidelity reference. Integer quantisation is group-size-64 affine, recorded in each quantised component's config.json quantization block:

{
  "group_size": 64,
  "bits": 4,
  "mode": "affine"
}

Weights are packed as U32 with scales and biases in MLX layout. The unquantised backbone is bf16 (~3 GB); 4-bit brings it to ~829 MB, 8-bit to ~1.6 GB.

Requirements

Apple Silicon (M-series) running macOS
An MLX runtime and a dots.tts inference implementation with MLX backbone support

This repository contains weights and configs only. Use it with an MLX dots.tts inference implementation; each component reads its own config.json quantization block, so quantised and F32 components load through the same code path.

Quantisation notes and caveats

Weight quantisation trades a small amount of fidelity for lower memory pressure and faster bandwidth-bound autoregressive decode. The backbone is bandwidth-bound, so quantising it is the main speed win; quantising the dit/patch_encoder is mostly a footprint reduction.
The acoustic output paths (vocoder, audiovae_encoder) remain F32 in every variant, so raw audio fidelity is governed mainly by those; quantisation effects show up more in prosody and token-level decisions.
No calibration dataset was used; this is direct affine weight quantisation (group size 64).

Licence

Apache 2.0, inherited from the base model. See the base model card for full terms.

Ethical use

This model can produce highly realistic synthetic speech via zero-shot voice cloning. It is intended for research and authorised deployment only. Do not use it for impersonation, fraud, or disinformation, or to clone a voice without the speaker's consent.

Attribution

Base model: rednote-hilab/dots.tts-soar
MLX conversion and 4-bit backbone quantisation: smcleod

Downloads last month: 65

MLX

Hardware compatibility

4-bit

Model tree for smcleod/dots.tts-soar-mlx

Base model

dots-studio/dots.tts-base

Finetuned

dots-studio/dots.tts-soar

Quantized

(1)

this model