Chatterbox TTS β CoreML
CoreML port of Chatterbox TTS by
Resemble AI. The full pipeline (T3 transformer β s3 speech tokenizer β s3gen
flow + flow estimator β mel-to-wav vocoder, plus the voice encoder and
CAMPPlus speaker encoder used for one-shot voice cloning) compiled to
.mlpackage for on-device inference on Apple Silicon Macs.
Intended use
Drop-in for Apple platform apps that want offline, on-device TTS with optional voice cloning. Built specifically to back the narration feature in Screen Cut Pro β but the files are the model only; nothing here is app-specific.
Files
| File | Role |
|---|---|
t3_text_emb.mlpackage |
T3 text token embedding |
t3_text_pos_emb.mlpackage |
T3 text positional embedding |
t3_speech_emb.mlpackage |
T3 speech-token embedding |
t3_speech_pos_emb.mlpackage |
T3 speech-token positional embedding |
t3_cond_enc.mlpackage |
T3 conditioning encoder (speaker + emotion + prompt) |
t3_tfmr.mlpackage |
T3 Llama-style transformer (autoregressive backbone) |
t3_speech_head.mlpackage |
T3 speech-token output projection |
s3_tokenizer.mlpackage |
s3 speech tokenizer (mel β speech tokens) |
voice_encoder.mlpackage |
T3 speaker encoder (256-d embedding) |
campplus.mlpackage |
CAMPPlus speaker encoder for s3gen (192-d embedding) |
flow_encoder.mlpackage |
s3gen flow encoder |
flow_estimator.mlpackage |
s3gen flow estimator (diffusion) |
mel2wav.mlpackage |
Vocoder (mel β 24 kHz waveform) |
default_t3_speaker_emb.bin |
Bundled default voice β T3 speaker embedding |
default_t3_cond_prompt_tokens.bin |
Bundled default voice β T3 conditioning prompt tokens |
default_flow_prompt_token.bin |
Bundled default voice β s3gen prompt tokens |
default_flow_prompt_feat.bin |
Bundled default voice β s3gen prompt features |
default_flow_speaker_embedding.bin |
Bundled default voice β s3gen speaker embedding |
Conversion notes
Converted from the upstream PyTorch checkpoints with coremltools. A few
non-obvious patches were required to make the ONNXβCoreML path work:
- s3 tokenizer RoPE β the upstream uses
complex64rotary embeddings, which CoreML does not support. Replaced with split real(cos, sin)tensors at trace time. - Fixed-length traces β CoreML traces shapes statically; the flow encoder is traced at 400 prompt tokens / 1024 mel frames. Inputs longer than that must be chunked by the host application.
- Mel features β the upstream uses
librosa.filters.mel(slaney)for voice encoding andkaldi.fbank(htk)for CAMPPlus. The conversion script bakes those windowing assumptions into the model where possible; the host app reproduces them where not. SeeTools/compare_mels.pyin the source app for verification.
The conversion script is in the source app's Tools/ directory:
convert_chatterbox_to_coreml.py.
License
MIT, inherited from upstream Chatterbox. See LICENSE for the
full text and NOTICE for attribution and a summary of
modifications.
Citation
If you use this in research, cite the upstream model:
@misc{chatterbox2024,
title = {Chatterbox},
author = {Resemble AI},
year = {2024},
url = {https://github.com/resemble-ai/chatterbox}
}
- Downloads last month
- 110
Model tree for yepher/screen-cut-pro-tts-coreml
Base model
ResembleAI/chatterbox