Chatterbox TTS β€” CoreML

CoreML port of Chatterbox TTS by Resemble AI. The full pipeline (T3 transformer β†’ s3 speech tokenizer β†’ s3gen flow + flow estimator β†’ mel-to-wav vocoder, plus the voice encoder and CAMPPlus speaker encoder used for one-shot voice cloning) compiled to .mlpackage for on-device inference on Apple Silicon Macs.

Intended use

Drop-in for Apple platform apps that want offline, on-device TTS with optional voice cloning. Built specifically to back the narration feature in Screen Cut Pro β€” but the files are the model only; nothing here is app-specific.

Files

File Role
t3_text_emb.mlpackage T3 text token embedding
t3_text_pos_emb.mlpackage T3 text positional embedding
t3_speech_emb.mlpackage T3 speech-token embedding
t3_speech_pos_emb.mlpackage T3 speech-token positional embedding
t3_cond_enc.mlpackage T3 conditioning encoder (speaker + emotion + prompt)
t3_tfmr.mlpackage T3 Llama-style transformer (autoregressive backbone)
t3_speech_head.mlpackage T3 speech-token output projection
s3_tokenizer.mlpackage s3 speech tokenizer (mel β†’ speech tokens)
voice_encoder.mlpackage T3 speaker encoder (256-d embedding)
campplus.mlpackage CAMPPlus speaker encoder for s3gen (192-d embedding)
flow_encoder.mlpackage s3gen flow encoder
flow_estimator.mlpackage s3gen flow estimator (diffusion)
mel2wav.mlpackage Vocoder (mel β†’ 24 kHz waveform)
default_t3_speaker_emb.bin Bundled default voice β€” T3 speaker embedding
default_t3_cond_prompt_tokens.bin Bundled default voice β€” T3 conditioning prompt tokens
default_flow_prompt_token.bin Bundled default voice β€” s3gen prompt tokens
default_flow_prompt_feat.bin Bundled default voice β€” s3gen prompt features
default_flow_speaker_embedding.bin Bundled default voice β€” s3gen speaker embedding

Conversion notes

Converted from the upstream PyTorch checkpoints with coremltools. A few non-obvious patches were required to make the ONNX→CoreML path work:

  • s3 tokenizer RoPE β€” the upstream uses complex64 rotary embeddings, which CoreML does not support. Replaced with split real (cos, sin) tensors at trace time.
  • Fixed-length traces β€” CoreML traces shapes statically; the flow encoder is traced at 400 prompt tokens / 1024 mel frames. Inputs longer than that must be chunked by the host application.
  • Mel features β€” the upstream uses librosa.filters.mel(slaney) for voice encoding and kaldi.fbank(htk) for CAMPPlus. The conversion script bakes those windowing assumptions into the model where possible; the host app reproduces them where not. See Tools/compare_mels.py in the source app for verification.

The conversion script is in the source app's Tools/ directory: convert_chatterbox_to_coreml.py.

License

MIT, inherited from upstream Chatterbox. See LICENSE for the full text and NOTICE for attribution and a summary of modifications.

Citation

If you use this in research, cite the upstream model:

@misc{chatterbox2024,
  title  = {Chatterbox},
  author = {Resemble AI},
  year   = {2024},
  url    = {https://github.com/resemble-ai/chatterbox}
}
Downloads last month
110
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for yepher/screen-cut-pro-tts-coreml

Quantized
(18)
this model