Supertonic-3 β CoreML (fp16, ANE-ready)
CoreML conversion of Supertone/supertonic-3, a 99M-parameter multilingual TTS model. All 4 components run on the Apple Neural Engine (1.8β3.7Γ faster than CPU on M-series chips).
| Component | Size | Role |
|---|---|---|
fp16/duration_predictor.mlpackage |
15 MB | text -> frame count |
fp16/text_encoder.mlpackage |
71 MB | text -> conditioning latent |
fp16/vector_estimator.mlpackage |
135 MB | flow-matching denoiser (8 steps) |
fp16/vocoder.mlpackage |
51 MB | latent -> 44.1 kHz waveform |
| Total | 272 MB | (originals: ~400 MB ONNX) |
Quickstart
pip install coremltools soundfile numpy supertonic
git clone https://huggingface.co/Reza2kn/supertonic-3-coreml
cd supertonic-3-coreml
# Short prompt
python inference.py --text "Hello, world." --voice F1 --lang en --out hello.wav
# Long prompt β use --auto-pad for full content rendering
python inference.py \
--text "A gentle breeze moved through the open window while everyone listened to the story. The narrator paused, took a slow breath, and continued in a softer tone. Outside, the city carried on, unaware of the quiet moment unfolding inside." \
--voice F5 --lang en --auto-pad --out long.wav
10 voice styles ship in voice_styles/: F1βF5 (female), M1βM5 (male).
31 languages supported via unicode_indexer.json.
The auto-pad trick (why --auto-pad matters)
The supertonic-3 model has a soft cap on how much speech it renders per utterance. For long inputs (more than ~13 s of natural speech) the model truncates the prompt and emits a low-amplitude filler tone for the rest of the budget. The CoreML conversion's static bucket (T=L=320) extends this cap by ~3 s due to the way the bucket's padded positions leak into the real positions through ConvNeXt's dilated convolutions β that's why CoreML inference sounds more natural than the original ONNX library (proper word separation, intonation), but it still cuts off mid-sentence on long prompts.
--auto-pad is a two-pass workaround:
- Pass 1 synthesizes the prompt alone at full bucket length to find
where the model's content naturally stops (
t_orig). - Pass 2 appends a long filler sentence
(
" And with that, the gentle silence wrapped itself around the room.") that gives the model extra frames to fully render the original prompt, then renders the filler sentence, then drops into the filler tone. - The longest clean-silence gap after
t_origis the boundary between the original prompt and the appended filler. The pipeline trims there and tail-pads with 0.5 s of true silence.
Cost: ~2Γ synthesis time. Worth it for any prompt over ~5 s.
ANE engagement
All 4 components compile to ANE-resident programs when loaded with
compute_units=ALL (default). Measured speedups on M2 Pro vs CPU:
| Component | ANE speedup |
|---|---|
| duration_predictor | 1.9Γ |
| text_encoder | 2.8Γ |
| vector_estimator | 2.4Γ (per step; 8 steps total) |
| vocoder | 3.7Γ |
Verify ANE engagement with:
xctrace record --template "Core ML" --output trace.trace -- python inference.py --text "test"
Conversion notes
- Static bucket: T=320 (text length), L=320 (latent length). Inputs are zero-padded on the right and masked. Bucket = 22.3 s of audio.
duration_predictor,text_encoder,vocoderare hand-reimplemented in PyTorch from the ONNX initializers, then traced to CoreML. Per-component float parity vs ONNX: max-abs 2e-5 (dp), cos 0.998 (text_encoder), cos 0.9998 (vocoder).vector_estimator(the heavy diffusion model) goes throughonnxsim.simplify(T=L=320)->onnx2torch.convert->torch.jit.trace-> coremltools. Cos 0.998 vs ONNX per diffusion step.- The diffusion sampler stays host-side (8 Euler steps over the single step graph). All 4 components are individually quantizable.
License
This conversion follows the original Supertone/supertonic-3 license
(OpenRAIL). See LICENSE (or the upstream model card).
Why fp16 and not INT4?
We attempted to ship an INT4 variant. After exhaustive testing (INT4 sweep notes below), the supertonic-3 architecture caps at INT8 minimum for the vocoder and vector_estimator:
| Component | INT4 result | Why |
|---|---|---|
| vocoder | cos β 0 across all 6 palette configs (kmeans/uniform, pgc with group_size 8/16/32/64). Mixed-precision (head-only carve-out) also broken. | HiFi-GAN-style upsampling is uniformly sensitive β INT8 (cos 0.99) is the floor. |
| vector_estimator | per-step cos 0.82-0.999; over 8 diffusion steps this compounds to garbled / "backwards-sounding" audio. | Flow-matching diffusion accumulates error. INT8 works (per-step cos 0.96-0.998 except short_ja_F3 which is architecturally cos 0.82 even at INT8). |
| duration_predictor | smallest drift was 0.11s with pt_uniform β but enough to shift L_real β bucket-leak boundary moves β pacing perceptibly breaks. | dp output sets the diffusion frame budget; any drift propagates. |
| text_encoder | cos 0.97 at pgc_g32 (works alone). | Conditioning quality compounds with VE drift. |
The best achievable mixed config (only voc INT8, others fp16) saves ~25 MB out of 272 MB β not worth a separate variant. The fp16 build shipped here is the final deliverable.
Companion build
The cross-platform LiteRT version is at Reza2kn/supertonic-3-litert. LiteRT ships INT4/INT8 + an INT8 quantized ONNX vector_estimator (65 MB instead of 256 MB) β but LiteRT can't reproduce the CoreML-only "bucket-leak" extension, so long prompts sound rushed on LiteRT. Use CoreML for full quality on Apple platforms.
Credits
- Original model: Supertone/supertonic-3
- CoreML conversion + auto-pad workflow: this repo
- Companion LiteRT build: Reza2kn/supertonic-3-litert
- Downloads last month
- 34
Model tree for Reza2kn/supertonic-3-coreml
Base model
Supertone/supertonic-3