Chatterbox TTS — CoreML

CoreML port of Chatterbox TTS by Resemble AI. The full pipeline (T3 transformer → s3 speech tokenizer → s3gen flow + flow estimator → mel-to-wav vocoder, plus the voice encoder and CAMPPlus speaker encoder used for one-shot voice cloning) compiled to .mlpackage for on-device inference on Apple Silicon Macs.

Intended use

Drop-in for Apple platform apps that want offline, on-device TTS with optional voice cloning. Built specifically to back the narration feature in Screen Cut Pro — but the files are the model only; nothing here is app-specific.

Files

File	Role
`t3_text_emb.mlpackage`	T3 text token embedding
`t3_text_pos_emb.mlpackage`	T3 text positional embedding
`t3_speech_emb.mlpackage`	T3 speech-token embedding
`t3_speech_pos_emb.mlpackage`	T3 speech-token positional embedding
`t3_cond_enc.mlpackage`	T3 conditioning encoder (speaker + emotion + prompt)
`t3_tfmr.mlpackage`	T3 Llama-style transformer (autoregressive backbone)
`t3_speech_head.mlpackage`	T3 speech-token output projection
`s3_tokenizer.mlpackage`	s3 speech tokenizer (mel → speech tokens)
`voice_encoder.mlpackage`	T3 speaker encoder (256-d embedding)
`campplus.mlpackage`	CAMPPlus speaker encoder for s3gen (192-d embedding)
`flow_encoder.mlpackage`	s3gen flow encoder
`flow_estimator.mlpackage`	s3gen flow estimator (diffusion)
`mel2wav.mlpackage`	Vocoder (mel → 24 kHz waveform)
`default_t3_speaker_emb.bin`	Bundled default voice — T3 speaker embedding
`default_t3_cond_prompt_tokens.bin`	Bundled default voice — T3 conditioning prompt tokens
`default_flow_prompt_token.bin`	Bundled default voice — s3gen prompt tokens
`default_flow_prompt_feat.bin`	Bundled default voice — s3gen prompt features
`default_flow_speaker_embedding.bin`	Bundled default voice — s3gen speaker embedding

Conversion notes

Converted from the upstream PyTorch checkpoints with coremltools. A few non-obvious patches were required to make the ONNX→CoreML path work:

s3 tokenizer RoPE — the upstream uses complex64 rotary embeddings, which CoreML does not support. Replaced with split real (cos, sin) tensors at trace time.
Fixed-length traces — CoreML traces shapes statically; the flow encoder is traced at 400 prompt tokens / 1024 mel frames. Inputs longer than that must be chunked by the host application.
Mel features — the upstream uses librosa.filters.mel(slaney) for voice encoding and kaldi.fbank(htk) for CAMPPlus. The conversion script bakes those windowing assumptions into the model where possible; the host app reproduces them where not. See Tools/compare_mels.py in the source app for verification.

The conversion script is in the source app's Tools/ directory: convert_chatterbox_to_coreml.py.

License

MIT, inherited from upstream Chatterbox. See LICENSE for the full text and NOTICE for attribution and a summary of modifications.

Citation

If you use this in research, cite the upstream model:

@misc{chatterbox2024,
  title  = {Chatterbox},
  author = {Resemble AI},
  year   = {2024},
  url    = {https://github.com/resemble-ai/chatterbox}
}

Downloads last month: 110

Model tree for yepher/screen-cut-pro-tts-coreml

Base model

ResembleAI/chatterbox

Quantized

(18)

this model