# Conversion scripts

The scripts that convert upstream [rednote-hilab/dots.tts-soar](https://huggingface.co/rednote-hilab/dots.tts-soar) PyTorch weights into the MLX layout in this repo, and quantise individual components.

They are research scripts: the source snapshot path and output paths are hardcoded near the top of each file (look for `/Users/.../models--rednote-hilab--dots.tts-soar/...`). Edit those to your local upstream snapshot and a destination directory before running. All require `mlx` and run on Apple Silicon (Metal); `convert_backbone_dit.py` also uses `mlx_lm`.

## Pipeline

| Script | Produces | Notes |
|---|---|---|
| `extract_backbone.py` | Qwen2 backbone in HF layout | Strips the `llm.` prefix; no `lm_head` (tied embeddings) |
| `convert_backbone_dit.py` | `backbone/` (MLX, 4-bit g64) and `dit/` (F32) | Backbone via `mlx_lm.convert(quantize=True, q_bits=4, q_group_size=64)` |
| `convert_vocoder.py` | `vocoder/` | BigVGAN/AudioVAE decoder; Conv weights transposed to MLX OKI layout |
| `convert_speaker.py` | `speaker/` | CAM++ x-vector encoder |
| `convert_refpath.py` | `patch_encoder/` and `audiovae_encoder/` | Reference-audio conditioning path |
| `convert_heads.py` | `heads/` | Coordinate / hidden / latent / xvec / EOS projection heads |
| `quantize_component.py` | quantised component dir | Generic per-component quantiser (below) |

## Per-component quantisation

`quantize_component.py <src_dir> <dst_dir> <bits> <group_size>` quantises every 2D `.weight` whose in-features are divisible by the group size (matching MLX's `Linear` eligibility), writing `.weight` (packed), `.scales`, `.biases`, plus a `config.json` `quantization` block. Norms (1D), conv (3D) and biases are left in full precision.

The `4bit/` and `8bit/` variants in this repo were built by running it over `dit/` and `patch_encoder/`:

```sh
python quantize_component.py dit         dit-int4         4 64
python quantize_component.py dit         dit-int8         8 64
python quantize_component.py patch_encoder patch_encoder-int4 4 64
python quantize_component.py patch_encoder patch_encoder-int8 8 64
```

The backbone is quantised by `convert_backbone_dit.py` (4-bit) or `quantize_component.py` (8-bit). Each variant subfolder is then assembled from the quantised `backbone`/`dit`/`patch_encoder` plus the shared F32 `vocoder`/`speaker`/`audiovae_encoder`/`heads` and the top-level config files, so it loads standalone.