File size: 4,377 Bytes
b10879c 12da36d b10879c 12da36d b10879c 12da36d b10879c 12da36d | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 | ---
library_name: mlx
pipeline_tag: text-to-speech
base_model: stepfun-ai/Step-Audio-EditX
base_model_relation: quantized
license: apache-2.0
language:
- en
- zh
- ja
- ko
tags:
- mlx
- tts
- speech
- voice-cloning
- audio-editing
- step-audio
- step-audio-editx
- stepfun
- quantized
- int8
- apple-silicon
- bundled-components
---
# Step-Audio-EditX β MLX 8-bit
This repository contains a self-contained pure-MLX int8 conversion of
Step-Audio-EditX for local voice cloning and expressive audio editing on
Apple Silicon. All pipeline components are stored as `.safetensors` β no
PyTorch, ONNX, or NumPy files are required at inference time.
## Model Details
- Developed by: AppAutomaton
- Upstream model: [`stepfun-ai/Step-Audio-EditX`](https://huggingface.co/stepfun-ai/Step-Audio-EditX)
- Task: zero-shot voice cloning, expressive audio editing
- Runtime: MLX on Apple Silicon
- Precision: int8 for Step1 LM, Flow model, and VQ02 tokenizer; bf16 for the rest
- Total size: ~4.1 GB (down from ~7.7 GB upstream)
## Bundle Contents
This bundle is self-contained β all weights are packaged in one repository.
| File | Component | Format | Size |
| --- | --- | --- | --- |
| `model.safetensors` | Step1 LM (3.5B params) | int8 | 3.5 GB |
| `flow-model.safetensors` | Flow model (DiT + conformer) | int8 | 181 MB |
| `vq02.safetensors` | VQ02 audio tokenizer | int8 | 162 MB |
| `vq06.safetensors` | VQ06 audio tokenizer | bf16 | 249 MB |
| `hift.safetensors` | HiFT vocoder | bf16 | 40 MB |
| `campplus.safetensors` | CampPlus speaker embedding | bf16 | 13 MB |
| `flow-conditioner.safetensors` | Flow conditioner | bf16 | 2.5 MB |
| `config.json` | Step1 LM config + quantization | JSON | β |
| `flow-model-config.json` | Flow model config | JSON | β |
| `vq02-config.json`, `vq06-config.json` | Tokenizer configs | JSON | β |
| `hift-config.json`, `campplus-config.json`, `flow-conditioner-config.json` | Component configs | JSON | β |
| `tokenizer.json`, `tokenizer.model`, `tokenizer_config.json` | Step1 tokenizer | JSON | β |
## How to Get Started
Download the bundle:
```bash
hf download appautomaton/step-audio-editx-8bit-mlx \
--local-dir models/stepfun/step_audio_editx/mlx-int8
```
**Voice cloning:**
```bash
python scripts/generate/step_audio_editx.py \
--prompt-audio reference.wav \
--prompt-text "Transcript of reference audio." \
-o cloned.wav \
clone --target-text "New speech in the cloned voice."
```
**Audio editing (change emotion):**
```bash
python scripts/generate/step_audio_editx.py \
--prompt-audio input.wav \
--prompt-text "Transcript of input audio." \
-o happy.wav \
edit --edit-type emotion --edit-info happy
```
## Supported Edit Types
| Edit type | Description | `--edit-info` examples |
| --- | --- | --- |
| `emotion` | Change the emotion of speech | `happy`, `sad`, `angry`, `surprised` |
| `style` | Change speaking style | `whispering`, `broadcasting`, `formal` |
| `speed` | Change speaking speed | `fast`, `slow` |
| `denoise` | Remove noise from audio | not used |
| `vad` | Remove silences from audio | not used |
| `paralinguistic` | Add non-verbal sounds | requires `--target-text` |
## Architecture
Five-stage pipeline, all running pure MLX with bf16 activations:
1. **Step1 LM** (3.5B params, int8) β autoregressive dual-codebook token generation
2. **CampPlus** (bf16) β speaker embedding extraction from reference audio
3. **Flow conditioner** (bf16) β conditions generation on speaker embedding
4. **Flow model** (int8) β flow-matching mel spectrogram generation
5. **HiFT vocoder** (bf16) β mel spectrogram to waveform
The VQ02 and VQ06 tokenizers encode reference audio into dual codebook tokens
consumed by Step1.
## Performance
On Apple Silicon with int8 weights and bf16 activations, real-time factor
(RTF) is approximately 1.46x for voice cloning β faster than real-time.
## Links
- Source code: [`mlx-speech`](https://github.com/appautomaton/mlx-speech)
- Upstream model: [`stepfun-ai/Step-Audio-EditX`](https://huggingface.co/stepfun-ai/Step-Audio-EditX)
- Technical report: [arXiv:2511.03601](https://arxiv.org/abs/2511.03601)
- More examples: [AppAutomaton](https://github.com/appautomaton)
## License
Apache 2.0 β following the upstream license published with
[`stepfun-ai/Step-Audio-EditX`](https://huggingface.co/stepfun-ai/Step-Audio-EditX).
|