Instructions to use shraey/dots-tts-mlx with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use shraey/dots-tts-mlx with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir dots-tts-mlx shraey/dots-tts-mlx
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
dots-tts-mlx β quantized MLX weights (Apple Silicon)
Ready-to-run MLX weights for rednote-hilab/dots.tts-soar β a 2B continuous-AR flow-matching, multilingual (24 languages, same as upstream), zero-shot voice-clone TTS β quantized for Apple Silicon. Download and run with the dots-tts-mlx runtime β no PyTorch and no conversion step.
Languages: same as upstream dots.tts β all 24 (Arabic, Cantonese, Chinese, Czech, Dutch, English, Finnish, French, German, Greek, Hindi, Indonesian, Italian, Japanese, Korean, Polish, Portuguese, Romanian, Russian, Spanish, Thai, Turkish, Ukrainian, Vietnamese). The 5-language check in Quality is only the quantization spot-check, not the supported set.
These are converted + LLM-quantized MLX safetensors, not PyTorch. They load only with the
dots-tts-mlxruntime on Apple Silicon (Metal). For the original PyTorch model, see rednote-hilab/dots.tts-soar.
Variants
| Subfolder | Download | vs original (~9 GB) | Use |
|---|---|---|---|
int4/ β |
~2.4 GB | β73% | recommended |
int8/ |
~3.1 GB | β65% | conservative fallback |
Only the Qwen2.5-1.5B LLM trunk (β70% of the weights) is quantized (group-wise affine, group size 64); the precision-sensitive flow-matching DiT, the BigVGAN vocoder, and the CAM++ speaker encoder stay bf16.
Quality
Quantization is validated to be lossless relative to the full-precision MLX build: on a small multilingual acceptance check (EN/DE/ES/FR + Hindi), int8 and int4 showed no transcription-accuracy or voice-similarity regression vs bf16. This is a sanity check, not a dataset-scale benchmark β evaluate on your own content.
Correctness of the port itself is gated per-stage against the original PyTorch model (AudioVAE PSNR β 56 dB; attention / DiT / LLM / semantic-encoder cosine β₯ 0.9999) β see the runtime repo.
Usage
# 1. install the quant-aware runtime (>= v0.2.0)
pip install "git+https://github.com/sb1992/dots-tts-mlx.git@v0.2.0"
# 2. download the variant you want
hf download shraey/dots-tts-mlx --include "int4/*" --local-dir ./dots-tts-mlx-weights
# 3. run (files land in ./dots-tts-mlx-weights/int4/)
dots-tts --model ./dots-tts-mlx-weights/int4 \
--text "Hello from MLX." --ref-audio reference.wav --language EN \
--out-path out --out-prefix clone
The runtime auto-detects the quantization block in config.json, so nothing changes at the CLI/API level vs an unquantized directory. Python API and the full flag set: see the runtime repo.
- Memory: runs in ~6 GB with a short (2β3s) reference; the in-context prompt-prefill scales with reference length, so a longer reference raises the peak.
- Requires: Apple Silicon (MLX is Metal-only), Python β₯ 3.10.
Attribution & licenses
Derivative quantized weights of rednote-hilab/dots.tts-soar (Apache-2.0) β you must comply with the upstream license. Components:
- dots.tts β model Β· code β Apache-2.0, Β© the dots.tts team at rednote-hilab.
- Qwen2.5-1.5B-Base (LLM backbone) β Apache-2.0.
- CAM++ / 3D-Speaker (speaker x-vector encoder) β Apache-2.0.
- BigVGAN (vocoder/decoder architecture style) β MIT, Β© NVIDIA.
MLX port + quantization code: github.com/sb1992/dots-tts-mlx (Apache-2.0).
Responsible use
This performs zero-shot voice cloning β it can reproduce a person's voice from a few seconds of audio. Only clone voices you own or for which you have explicit, informed consent; do not use it for impersonation, fraud, or deception; and disclose AI-generated audio wherever it's shared. See the upstream risks guidance.
Quantized
Model tree for shraey/dots-tts-mlx
Base model
rednote-hilab/dots.tts-base