mlx-indextts2-standard-8bit
This is a converted MLX IndexTTS2 model for Apple Silicon inference with solar2ain/mlx-indextts.
It was prepared for the local /Users/vanch/index-tts IndexTTS2 optimization project, where the goal was stable Vietnamese and multilingual TTS on an M3 Max Mac without PyTorch MPS memory crashes.
Variant
- Profile: Standard multilingual
- Precision / quantization: 8bit
- Approx local size: 2.8GB
- Source checkpoint directory during conversion:
/Users/vanch/index-tts/checkpoints - Note: Upstream MLX GPT-only 8-bit quantization; S2Mel and BigVGAN remain fp32.
- Conversion detail: Converted with
mlx-indextts convert --quantize 8. In the current upstream implementation this quantizes GPT only; S2Mel and BigVGAN stay fp32.
Expected Files
The repository root is a ready-to-use MLX IndexTTS2 model directory:
gpt.safetensorss2mel.safetensorsbigvgan.safetensorsvq2emb.safetensorstokenizer.modelconfig.yamlconfig.jsonfeat1.ptfeat2.ptwav2vec2bert_stats.pt
Usage
Install and use mlx-indextts:
git clone https://github.com/solar2ain/mlx-indextts.git
cd mlx-indextts
uv sync --extra convert --extra v2
huggingface-cli download vanch007/mlx-indextts2-standard-8bit \
--local-dir models/mlx-indextts2-standard-8bit \
--local-dir-use-symlinks False
uv run mlx-indextts generate \
-m models/mlx-indextts2-standard-8bit \
-r /path/to/reference_or_speaker.npz \
-t "Your text here" \
-o output.wav \
--memory-limit 24 \
--diffusion-steps 16
For repeated generation, precompute speaker conditioning first:
uv run mlx-indextts speaker \
-m models/mlx-indextts2-standard-8bit \
-r /path/to/reference.wav \
-o speaker.npz \
--memory-limit 24
Benchmark
Benchmarked on a 128GB unified-memory M3 Max Mac using:
mlx-indexttsfromsolar2ain/mlx-indextts- precomputed
.npzspeaker conditioning memory_limit=24GBdiffusion_steps=16- emotion=
calm,emo_alpha=0.6 - same text set across fp32 / fp16 / 8bit / optimized PyTorch MPS
RTF lower is faster:
| Case | fp32 MLX RTF | fp16 MLX RTF | 8bit MLX RTF | PyTorch MPS RTF |
|---|---|---|---|---|
| zh short | 1.127 | 1.538 | 0.966 | 1.446 |
| zh long | 1.232 | 1.584 | 1.035 | 1.699 |
| en short | 1.157 | 1.462 | 0.914 | 2.192 |
| en long | 1.193 | 1.511 | 0.956 | 1.783 |
Summary from the local comparison:
- 8bit was the fastest MLX route in this test set.
- fp16 saved space but was slower than fp32 for the standard profile.
- Vietnamese fp16 was slightly faster than Vietnamese fp32, but Vietnamese 8bit was fastest.
ASR Validation
ASR validation with local mlx_whisper + whisper-large-v3-turbo found no empty audio, wrong-language output, or obvious missing sentences. Chinese long-form ASR showed a minor 她/他 homophone difference; English long-form 8-bit ASR showed a minor tense difference.
ASR was used only as an automated sanity check. Final production selection should still include human listening, especially for long-form Vietnamese narration.
Provenance and Scope
This is an MLX conversion for local Apple Silicon inference, not the original PyTorch release. The original implementation and model family are associated with IndexTTS / IndexTTS2; the MLX runtime used here is solar2ain/mlx-indextts.
The benchmark numbers are environment-specific and should be treated as local M3 Max results, not universal performance guarantees.
- Downloads last month
- 34
Quantized