VoxCPM2-4bit / README.md
acul3's picture
Upload README.md with huggingface_hub
dc9e5c1 verified
metadata
license: apache-2.0
base_model: openbmb/VoxCPM2
tags:
  - mlx
  - tts
  - text-to-speech
  - voice-cloning
  - voice-design
  - multilingual
library_name: mlx-audio
pipeline_tag: text-to-speech
language:
  - en
  - zh
  - id
  - ja
  - ko
  - multilingual

VoxCPM2 - 4-bit quantized

MLX port of openbmb/VoxCPM2 — a 2B-parameter multilingual TTS model with 48kHz studio-quality output, voice cloning, and voice design.

4-bit quantized (LM layers only, VAE/DiT at full precision). Fastest, smallest, with minimal quality loss.

Features

  • 30 languages — including English, Chinese, Indonesian, Japanese, Korean, and more
  • 48kHz output — studio-quality audio
  • Voice Design — create voices from text descriptions (no reference audio needed)
  • Voice Cloning — clone any voice from a short audio reference
  • 4 generation modes — zero-shot, continuation, reference cloning, combined

Usage

pip install mlx-audio

# Zero-shot
python -m mlx_audio.tts.generate --model mlx-community/VoxCPM2-4bit --text "Hello world" --verbose

# Voice design
python -m mlx_audio.tts.generate --model mlx-community/VoxCPM2-4bit \
  --text "Hello world" \
  --instruct "A young woman, gentle and sweet voice"

# Voice cloning
python -m mlx_audio.tts.generate --model mlx-community/VoxCPM2-4bit \
  --text "Hello world" \
  --ref_audio speaker.wav --ref_text "reference text"

Python API

from mlx_audio.tts import load_model

model = load_model("mlx-community/VoxCPM2-4bit")

# Generate
for result in model.generate(
    text="Hello, this is VoxCPM2 on Apple Silicon.",
    inference_timesteps=7,
    cfg_value=2.0,
):
    print(f"Duration: {result.audio_duration}")

Performance (Apple Silicon)

Variant Size RTF (7 timesteps)
bf16 4.96 GB 0.48x
8-bit 3.23 GB 0.85x
4-bit 2.30 GB 0.90x

RTF = Real-Time Factor (>1.0 = faster than realtime)

Original Model

Converted with mlx-audio.