Text-to-Speech
MLX
Safetensors
English
dots_tts
tts
quantized
4-bit precision
8-bit precision
apple-silicon
dots.tts
Instructions to use smcleod/dots.tts-soar-mlx with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use smcleod/dots.tts-soar-mlx with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir dots.tts-soar-mlx smcleod/dots.tts-soar-mlx
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
| # Conversion scripts | |
| The scripts that convert upstream [rednote-hilab/dots.tts-soar](https://huggingface.co/rednote-hilab/dots.tts-soar) PyTorch weights into the MLX layout in this repo, and quantise individual components. | |
| They are research scripts: the source snapshot path and output paths are hardcoded near the top of each file (look for `/Users/.../models--rednote-hilab--dots.tts-soar/...`). Edit those to your local upstream snapshot and a destination directory before running. All require `mlx` and run on Apple Silicon (Metal); `convert_backbone_dit.py` also uses `mlx_lm`. | |
| ## Pipeline | |
| | Script | Produces | Notes | | |
| |---|---|---| | |
| | `extract_backbone.py` | Qwen2 backbone in HF layout | Strips the `llm.` prefix; no `lm_head` (tied embeddings) | | |
| | `convert_backbone_dit.py` | `backbone/` (MLX, 4-bit g64) and `dit/` (F32) | Backbone via `mlx_lm.convert(quantize=True, q_bits=4, q_group_size=64)` | | |
| | `convert_vocoder.py` | `vocoder/` | BigVGAN/AudioVAE decoder; Conv weights transposed to MLX OKI layout | | |
| | `convert_speaker.py` | `speaker/` | CAM++ x-vector encoder | | |
| | `convert_refpath.py` | `patch_encoder/` and `audiovae_encoder/` | Reference-audio conditioning path | | |
| | `convert_heads.py` | `heads/` | Coordinate / hidden / latent / xvec / EOS projection heads | | |
| | `quantize_component.py` | quantised component dir | Generic per-component quantiser (below) | | |
| ## Per-component quantisation | |
| `quantize_component.py <src_dir> <dst_dir> <bits> <group_size>` quantises every 2D `.weight` whose in-features are divisible by the group size (matching MLX's `Linear` eligibility), writing `.weight` (packed), `.scales`, `.biases`, plus a `config.json` `quantization` block. Norms (1D), conv (3D) and biases are left in full precision. | |
| The `4bit/` and `8bit/` variants in this repo were built by running it over `dit/` and `patch_encoder/`: | |
| ```sh | |
| python quantize_component.py dit dit-int4 4 64 | |
| python quantize_component.py dit dit-int8 8 64 | |
| python quantize_component.py patch_encoder patch_encoder-int4 4 64 | |
| python quantize_component.py patch_encoder patch_encoder-int8 8 64 | |
| ``` | |
| The backbone is quantised by `convert_backbone_dit.py` (4-bit) or `quantize_component.py` (8-bit). Each variant subfolder is then assembled from the quantised `backbone`/`dit`/`patch_encoder` plus the shared F32 `vocoder`/`speaker`/`audiovae_encoder`/`heads` and the top-level config files, so it loads standalone. | |