Update model card: fix script paths, add license section

12da36d verified 4 days ago

4.38 kB

library_name: mlx
pipeline_tag: text-to-speech
base_model: stepfun-ai/Step-Audio-EditX
base_model_relation: quantized
license: apache-2.0
language:
  - en
  - zh
  - ja
  - ko
tags:
  - mlx
  - tts
  - speech
  - voice-cloning
  - audio-editing
  - step-audio
  - step-audio-editx
  - stepfun
  - quantized
  - int8
  - apple-silicon
  - bundled-components

Step-Audio-EditX — MLX 8-bit

This repository contains a self-contained pure-MLX int8 conversion of Step-Audio-EditX for local voice cloning and expressive audio editing on Apple Silicon. All pipeline components are stored as .safetensors — no PyTorch, ONNX, or NumPy files are required at inference time.

Model Details

Developed by: AppAutomaton
Upstream model: stepfun-ai/Step-Audio-EditX
Task: zero-shot voice cloning, expressive audio editing
Runtime: MLX on Apple Silicon
Precision: int8 for Step1 LM, Flow model, and VQ02 tokenizer; bf16 for the rest
Total size: ~4.1 GB (down from ~7.7 GB upstream)

Bundle Contents

This bundle is self-contained — all weights are packaged in one repository.

File	Component	Format	Size
`model.safetensors`	Step1 LM (3.5B params)	int8	3.5 GB
`flow-model.safetensors`	Flow model (DiT + conformer)	int8	181 MB
`vq02.safetensors`	VQ02 audio tokenizer	int8	162 MB
`vq06.safetensors`	VQ06 audio tokenizer	bf16	249 MB
`hift.safetensors`	HiFT vocoder	bf16	40 MB
`campplus.safetensors`	CampPlus speaker embedding	bf16	13 MB
`flow-conditioner.safetensors`	Flow conditioner	bf16	2.5 MB
`config.json`	Step1 LM config + quantization	JSON	—
`flow-model-config.json`	Flow model config	JSON	—
`vq02-config.json`, `vq06-config.json`	Tokenizer configs	JSON	—
`hift-config.json`, `campplus-config.json`, `flow-conditioner-config.json`	Component configs	JSON	—
`tokenizer.json`, `tokenizer.model`, `tokenizer_config.json`	Step1 tokenizer	JSON	—

How to Get Started

Download the bundle:

hf download appautomaton/step-audio-editx-8bit-mlx \
  --local-dir models/stepfun/step_audio_editx/mlx-int8

Voice cloning:

python scripts/generate/step_audio_editx.py \
  --prompt-audio reference.wav \
  --prompt-text "Transcript of reference audio." \
  -o cloned.wav \
  clone --target-text "New speech in the cloned voice."

Audio editing (change emotion):

python scripts/generate/step_audio_editx.py \
  --prompt-audio input.wav \
  --prompt-text "Transcript of input audio." \
  -o happy.wav \
  edit --edit-type emotion --edit-info happy

Supported Edit Types

Edit type	Description	`--edit-info` examples
`emotion`	Change the emotion of speech	`happy`, `sad`, `angry`, `surprised`
`style`	Change speaking style	`whispering`, `broadcasting`, `formal`
`speed`	Change speaking speed	`fast`, `slow`
`denoise`	Remove noise from audio	not used
`vad`	Remove silences from audio	not used
`paralinguistic`	Add non-verbal sounds	requires `--target-text`

Architecture

Five-stage pipeline, all running pure MLX with bf16 activations:

Step1 LM (3.5B params, int8) — autoregressive dual-codebook token generation
CampPlus (bf16) — speaker embedding extraction from reference audio
Flow conditioner (bf16) — conditions generation on speaker embedding
Flow model (int8) — flow-matching mel spectrogram generation
HiFT vocoder (bf16) — mel spectrogram to waveform

The VQ02 and VQ06 tokenizers encode reference audio into dual codebook tokens consumed by Step1.

Performance

On Apple Silicon with int8 weights and bf16 activations, real-time factor (RTF) is approximately 1.46x for voice cloning — faster than real-time.

License

Apache 2.0 — following the upstream license published with stepfun-ai/Step-Audio-EditX.

appautomaton
/

step-audio-editx-8bit-mlx