smcleod
/

dots.tts-soar-mlx

4-bit precision

8-bit precision

Model card Files Files and versions

dots.tts-soar-mlx / scripts /README.md

smcleod's picture

Upload folder using huggingface_hub

39057fb verified 5 days ago

|

history blame contribute delete

2.44 kB

	# Conversion scripts

	The scripts that convert upstream [rednote-hilab/dots.tts-soar](https://huggingface.co/rednote-hilab/dots.tts-soar) PyTorch weights into the MLX layout in this repo, and quantise individual components.

	They are research scripts: the source snapshot path and output paths are hardcoded near the top of each file (look for `/Users/.../models--rednote-hilab--dots.tts-soar/...`). Edit those to your local upstream snapshot and a destination directory before running. All require `mlx` and run on Apple Silicon (Metal); `convert_backbone_dit.py` also uses `mlx_lm`.

	## Pipeline

	\| Script \| Produces \| Notes \|
	\|---\|---\|---\|
	\| `extract_backbone.py` \| Qwen2 backbone in HF layout \| Strips the `llm.` prefix; no `lm_head` (tied embeddings) \|
	\| `convert_backbone_dit.py` \| `backbone/` (MLX, 4-bit g64) and `dit/` (F32) \| Backbone via `mlx_lm.convert(quantize=True, q_bits=4, q_group_size=64)` \|
	\| `convert_vocoder.py` \| `vocoder/` \| BigVGAN/AudioVAE decoder; Conv weights transposed to MLX OKI layout \|
	\| `convert_speaker.py` \| `speaker/` \| CAM++ x-vector encoder \|
	\| `convert_refpath.py` \| `patch_encoder/` and `audiovae_encoder/` \| Reference-audio conditioning path \|
	\| `convert_heads.py` \| `heads/` \| Coordinate / hidden / latent / xvec / EOS projection heads \|
	\| `quantize_component.py` \| quantised component dir \| Generic per-component quantiser (below) \|

	## Per-component quantisation

	`quantize_component.py <src_dir> <dst_dir> <bits> <group_size>` quantises every 2D `.weight` whose in-features are divisible by the group size (matching MLX's `Linear` eligibility), writing `.weight` (packed), `.scales`, `.biases`, plus a `config.json` `quantization` block. Norms (1D), conv (3D) and biases are left in full precision.

	The `4bit/` and `8bit/` variants in this repo were built by running it over `dit/` and `patch_encoder/`:

	```sh
	python quantize_component.py dit dit-int4 4 64
	python quantize_component.py dit dit-int8 8 64
	python quantize_component.py patch_encoder patch_encoder-int4 4 64
	python quantize_component.py patch_encoder patch_encoder-int8 8 64
	```

	The backbone is quantised by `convert_backbone_dit.py` (4-bit) or `quantize_component.py` (8-bit). Each variant subfolder is then assembled from the quantised `backbone`/`dit`/`patch_encoder` plus the shared F32 `vocoder`/`speaker`/`audiovae_encoder`/`heads` and the top-level config files, so it loads standalone.