File size: 4,377 Bytes
b10879c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12da36d
b10879c
 
 
 
 
 
 
 
 
12da36d
b10879c
 
 
 
 
 
 
 
 
12da36d
b10879c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12da36d
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
---
library_name: mlx
pipeline_tag: text-to-speech
base_model: stepfun-ai/Step-Audio-EditX
base_model_relation: quantized
license: apache-2.0
language:
- en
- zh
- ja
- ko
tags:
- mlx
- tts
- speech
- voice-cloning
- audio-editing
- step-audio
- step-audio-editx
- stepfun
- quantized
- int8
- apple-silicon
- bundled-components
---

# Step-Audio-EditX β€” MLX 8-bit

This repository contains a self-contained pure-MLX int8 conversion of
Step-Audio-EditX for local voice cloning and expressive audio editing on
Apple Silicon. All pipeline components are stored as `.safetensors` β€” no
PyTorch, ONNX, or NumPy files are required at inference time.

## Model Details

- Developed by: AppAutomaton
- Upstream model: [`stepfun-ai/Step-Audio-EditX`](https://huggingface.co/stepfun-ai/Step-Audio-EditX)
- Task: zero-shot voice cloning, expressive audio editing
- Runtime: MLX on Apple Silicon
- Precision: int8 for Step1 LM, Flow model, and VQ02 tokenizer; bf16 for the rest
- Total size: ~4.1 GB (down from ~7.7 GB upstream)

## Bundle Contents

This bundle is self-contained β€” all weights are packaged in one repository.

| File | Component | Format | Size |
| --- | --- | --- | --- |
| `model.safetensors` | Step1 LM (3.5B params) | int8 | 3.5 GB |
| `flow-model.safetensors` | Flow model (DiT + conformer) | int8 | 181 MB |
| `vq02.safetensors` | VQ02 audio tokenizer | int8 | 162 MB |
| `vq06.safetensors` | VQ06 audio tokenizer | bf16 | 249 MB |
| `hift.safetensors` | HiFT vocoder | bf16 | 40 MB |
| `campplus.safetensors` | CampPlus speaker embedding | bf16 | 13 MB |
| `flow-conditioner.safetensors` | Flow conditioner | bf16 | 2.5 MB |
| `config.json` | Step1 LM config + quantization | JSON | β€” |
| `flow-model-config.json` | Flow model config | JSON | β€” |
| `vq02-config.json`, `vq06-config.json` | Tokenizer configs | JSON | β€” |
| `hift-config.json`, `campplus-config.json`, `flow-conditioner-config.json` | Component configs | JSON | β€” |
| `tokenizer.json`, `tokenizer.model`, `tokenizer_config.json` | Step1 tokenizer | JSON | β€” |

## How to Get Started

Download the bundle:

```bash
hf download appautomaton/step-audio-editx-8bit-mlx \
  --local-dir models/stepfun/step_audio_editx/mlx-int8
```

**Voice cloning:**

```bash
python scripts/generate/step_audio_editx.py \
  --prompt-audio reference.wav \
  --prompt-text "Transcript of reference audio." \
  -o cloned.wav \
  clone --target-text "New speech in the cloned voice."
```

**Audio editing (change emotion):**

```bash
python scripts/generate/step_audio_editx.py \
  --prompt-audio input.wav \
  --prompt-text "Transcript of input audio." \
  -o happy.wav \
  edit --edit-type emotion --edit-info happy
```

## Supported Edit Types

| Edit type | Description | `--edit-info` examples |
| --- | --- | --- |
| `emotion` | Change the emotion of speech | `happy`, `sad`, `angry`, `surprised` |
| `style` | Change speaking style | `whispering`, `broadcasting`, `formal` |
| `speed` | Change speaking speed | `fast`, `slow` |
| `denoise` | Remove noise from audio | not used |
| `vad` | Remove silences from audio | not used |
| `paralinguistic` | Add non-verbal sounds | requires `--target-text` |

## Architecture

Five-stage pipeline, all running pure MLX with bf16 activations:

1. **Step1 LM** (3.5B params, int8) β€” autoregressive dual-codebook token generation
2. **CampPlus** (bf16) β€” speaker embedding extraction from reference audio
3. **Flow conditioner** (bf16) β€” conditions generation on speaker embedding
4. **Flow model** (int8) β€” flow-matching mel spectrogram generation
5. **HiFT vocoder** (bf16) β€” mel spectrogram to waveform

The VQ02 and VQ06 tokenizers encode reference audio into dual codebook tokens
consumed by Step1.

## Performance

On Apple Silicon with int8 weights and bf16 activations, real-time factor
(RTF) is approximately 1.46x for voice cloning β€” faster than real-time.

## Links

- Source code: [`mlx-speech`](https://github.com/appautomaton/mlx-speech)
- Upstream model: [`stepfun-ai/Step-Audio-EditX`](https://huggingface.co/stepfun-ai/Step-Audio-EditX)
- Technical report: [arXiv:2511.03601](https://arxiv.org/abs/2511.03601)
- More examples: [AppAutomaton](https://github.com/appautomaton)

## License

Apache 2.0 β€” following the upstream license published with
[`stepfun-ai/Step-Audio-EditX`](https://huggingface.co/stepfun-ai/Step-Audio-EditX).