smcleod's picture
Upload folder using huggingface_hub
39057fb verified

Conversion scripts

The scripts that convert upstream rednote-hilab/dots.tts-soar PyTorch weights into the MLX layout in this repo, and quantise individual components.

They are research scripts: the source snapshot path and output paths are hardcoded near the top of each file (look for /Users/.../models--rednote-hilab--dots.tts-soar/...). Edit those to your local upstream snapshot and a destination directory before running. All require mlx and run on Apple Silicon (Metal); convert_backbone_dit.py also uses mlx_lm.

Pipeline

Script Produces Notes
extract_backbone.py Qwen2 backbone in HF layout Strips the llm. prefix; no lm_head (tied embeddings)
convert_backbone_dit.py backbone/ (MLX, 4-bit g64) and dit/ (F32) Backbone via mlx_lm.convert(quantize=True, q_bits=4, q_group_size=64)
convert_vocoder.py vocoder/ BigVGAN/AudioVAE decoder; Conv weights transposed to MLX OKI layout
convert_speaker.py speaker/ CAM++ x-vector encoder
convert_refpath.py patch_encoder/ and audiovae_encoder/ Reference-audio conditioning path
convert_heads.py heads/ Coordinate / hidden / latent / xvec / EOS projection heads
quantize_component.py quantised component dir Generic per-component quantiser (below)

Per-component quantisation

quantize_component.py <src_dir> <dst_dir> <bits> <group_size> quantises every 2D .weight whose in-features are divisible by the group size (matching MLX's Linear eligibility), writing .weight (packed), .scales, .biases, plus a config.json quantization block. Norms (1D), conv (3D) and biases are left in full precision.

The 4bit/ and 8bit/ variants in this repo were built by running it over dit/ and patch_encoder/:

python quantize_component.py dit         dit-int4         4 64
python quantize_component.py dit         dit-int8         8 64
python quantize_component.py patch_encoder patch_encoder-int4 4 64
python quantize_component.py patch_encoder patch_encoder-int8 8 64

The backbone is quantised by convert_backbone_dit.py (4-bit) or quantize_component.py (8-bit). Each variant subfolder is then assembled from the quantised backbone/dit/patch_encoder plus the shared F32 vocoder/speaker/audiovae_encoder/heads and the top-level config files, so it loads standalone.