Conversion scripts

The scripts that convert upstream rednote-hilab/dots.tts-soar PyTorch weights into the MLX layout in this repo, and quantise individual components.

They are research scripts: the source snapshot path and output paths are hardcoded near the top of each file (look for /Users/.../models--rednote-hilab--dots.tts-soar/...). Edit those to your local upstream snapshot and a destination directory before running. All require mlx and run on Apple Silicon (Metal); convert_backbone_dit.py also uses mlx_lm.

Pipeline

Script	Produces	Notes
`extract_backbone.py`	Qwen2 backbone in HF layout	Strips the `llm.` prefix; no `lm_head` (tied embeddings)
`convert_backbone_dit.py`	`backbone/` (MLX, 4-bit g64) and `dit/` (F32)	Backbone via `mlx_lm.convert(quantize=True, q_bits=4, q_group_size=64)`
`convert_vocoder.py`	`vocoder/`	BigVGAN/AudioVAE decoder; Conv weights transposed to MLX OKI layout
`convert_speaker.py`	`speaker/`	CAM++ x-vector encoder
`convert_refpath.py`	`patch_encoder/` and `audiovae_encoder/`	Reference-audio conditioning path
`convert_heads.py`	`heads/`	Coordinate / hidden / latent / xvec / EOS projection heads
`quantize_component.py`	quantised component dir	Generic per-component quantiser (below)

Per-component quantisation

quantize_component.py <src_dir> <dst_dir> <bits> <group_size> quantises every 2D .weight whose in-features are divisible by the group size (matching MLX's Linear eligibility), writing .weight (packed), .scales, .biases, plus a config.json quantization block. Norms (1D), conv (3D) and biases are left in full precision.

The 4bit/ and 8bit/ variants in this repo were built by running it over dit/ and patch_encoder/:

python quantize_component.py dit         dit-int4         4 64
python quantize_component.py dit         dit-int8         8 64
python quantize_component.py patch_encoder patch_encoder-int4 4 64
python quantize_component.py patch_encoder patch_encoder-int8 8 64

The backbone is quantised by convert_backbone_dit.py (4-bit) or quantize_component.py (8-bit). Each variant subfolder is then assembled from the quantised backbone/dit/patch_encoder plus the shared F32 vocoder/speaker/audiovae_encoder/heads and the top-level config files, so it loads standalone.