dots-tts-mlx — quantized MLX weights (Apple Silicon)

Ready-to-run MLX weights for rednote-hilab/dots.tts-soar — a 2B continuous-AR flow-matching, multilingual (24 languages, same as upstream), zero-shot voice-clone TTS — quantized for Apple Silicon. Download and run with the dots-tts-mlx runtime — no PyTorch and no conversion step.

Languages: same as upstream dots.tts — all 24 (Arabic, Cantonese, Chinese, Czech, Dutch, English, Finnish, French, German, Greek, Hindi, Indonesian, Italian, Japanese, Korean, Polish, Portuguese, Romanian, Russian, Spanish, Thai, Turkish, Ukrainian, Vietnamese). The 5-language check in Quality is only the quantization spot-check, not the supported set.

These are converted + LLM-quantized MLX safetensors, not PyTorch. They load only with the dots-tts-mlx runtime on Apple Silicon (Metal). For the original PyTorch model, see rednote-hilab/dots.tts-soar.

⚡ Two decoders, one voice. int4/int8 are the standard soar build (reference quality). mf-int4/mf-int8 are MeanFlow — a distilled few-step decoder (rednote-hilab/dots.tts-mf) that runs the acoustic DiT at NFE=4 with no classifier-free guidance, ~2× faster than the 10-step path with no measurable quality loss. Same model, same 24-language cloning — pick the folder that fits your speed budget. For cloning, reference mode (audio + transcript) is the recommended quality path.

Variants

Subfolder	Decoder	Download	Speed	Use
`int4/` ⭐	soar — 10-step + CFG	~2.4 GB	baseline	default, best quality
`int8/`	soar — 10-step + CFG	~3.1 GB	baseline	conservative fallback
`mf-int4/` ⚡	MeanFlow — NFE-4, no CFG	~2.4 GB	~2× faster	latency-sensitive
`mf-int8/`	MeanFlow — NFE-4, no CFG	~3.1 GB	~2× faster	meanflow + more LLM precision

Only the Qwen2.5-1.5B LLM trunk (≈70% of the weights) is quantized (group-wise affine, group size 64); the precision-sensitive flow-matching DiT, the BigVGAN vocoder, and the CAM++ speaker encoder stay bf16.

MeanFlow (mf-*) is the distilled rednote-hilab/dots.tts-mf checkpoint — the same architecture plus a small duration embedder — auto-detected from config.json (no flag): point the runtime at an mf-* folder and it uses the few-step solver. Measured 1.9–2.2× faster than the 10-step path on reference cloning (EN/HI/ZH). It drops classifier-free guidance, so --guidance-scale is ignored. Use reference cloning (audio + transcript) for best quality.

Quality

Quantization is validated to be lossless relative to the full-precision MLX build: on a small multilingual acceptance check (EN/DE/ES/FR + Hindi), int8 and int4 showed no transcription-accuracy or voice-similarity regression vs bf16. This is a sanity check, not a dataset-scale benchmark — evaluate on your own content.

Correctness of the port itself is gated per-stage against the original PyTorch model (AudioVAE PSNR ≈ 56 dB; attention / DiT / LLM / semantic-encoder cosine ≥ 0.9999) — see the runtime repo.

Usage

# 1. install the quant-aware runtime (>= v0.2.0)
pip install "git+https://github.com/sb1992/dots-tts-mlx.git@v0.2.0"

# 2. download the variant you want  (use "mf-int4/*" for the faster MeanFlow decoder)
hf download shraey/dots-tts-mlx --include "int4/*" --local-dir ./dots-tts-mlx-weights

# 3. run (files land in ./dots-tts-mlx-weights/int4/)
dots-tts --model ./dots-tts-mlx-weights/int4 \
    --text "Hello from MLX." --ref-audio reference.wav --language EN \
    --out-path out --out-prefix clone

The runtime auto-detects the quantization block in config.json, so nothing changes at the CLI/API level vs an unquantized directory. Python API and the full flag set: see the runtime repo.

Memory: peak RAM scales with generation + reference length — roughly ~6 GB for a short clip, up to ~13 GB for a ~30 s clip (int4); resident weights are ~2.4 GB. The render peak is activation-bound (the bf16 DiT + vocoder working set), so it's the same for soar and MeanFlow and isn't reduced by quantization. MLX's allocator may cache up to its memory limit, but that cache is releasable.
Requires: Apple Silicon (MLX is Metal-only), Python ≥ 3.10.

Attribution & licenses

Derivative quantized weights of rednote-hilab/dots.tts-soar (Apache-2.0) — you must comply with the upstream license. Components:

dots.tts — model · code — Apache-2.0, © the dots.tts team at rednote-hilab.
Qwen2.5-1.5B-Base (LLM backbone) — Apache-2.0.
CAM++ / 3D-Speaker (speaker x-vector encoder) — Apache-2.0.
BigVGAN (vocoder/decoder architecture style) — MIT, © NVIDIA.

MLX port + quantization code: github.com/sb1992/dots-tts-mlx (Apache-2.0).

Responsible use

This performs zero-shot voice cloning — it can reproduce a person's voice from a few seconds of audio. Only clone voices you own or for which you have explicit, informed consent; do not use it for impersonation, fraud, or deception; and disclose AI-generated audio wherever it's shared. See the upstream risks guidance.

Downloads last month: -; Downloads are not tracked for this model. How to track

MLX

Hardware compatibility

Quantized

Model tree for shraey/dots-tts-mlx

Base model

dots-studio/dots.tts-base

Finetuned

dots-studio/dots.tts-soar

Finetuned

dots-studio/dots.tts-mf

Finetuned

(2)

this model