phonetic-whisper-mlx-broad-multi

Whisper-large-v3 decoder fine-tuned for broad International Phonetic Alphabet (IPA) transcription across 8 languages, trained on a single Apple Silicon machine with MLX.

Companion variant: phonetic-whisper-mlx-narrow-en trains on TIMIT narrow English alone and emits TIMIT-narrow phonetic detail. Use this broad-multi variant for cross-lingual broad IPA; use narrow-en for English narrow IPA.

Code: barathanaslan/phonetic-whisper-mlx

Model description

phonetic-whisper-mlx-broad-multi is a decoder-only fine-tune of mlx-community/whisper-large-v3-mlx. The encoder is frozen during training; only the decoder weights are updated. The model takes 16 kHz audio and emits broad-phonemic IPA strings (no diacritics, merged allophones).

Output convention. Broad IPA, NFC-normalized, with the TIMIT-style closures (bcl, dcl, gcl, pcl, tcl, kcl) and silences (pau, epi, h#) dropped, allophonic glottal stops suppressed, and combining diacritics stripped (m̩→m, n̩→n, l̩→l, ɨ→ɪ, ʉ→u, ɦ→h).

Intended use

  • Research on multilingual phonetic recognition under a uniform broad-IPA output convention.
  • Linguistic-resource construction for the 8 trained languages (English, Japanese, Polish, Maltese, Hungarian, Finnish, Greek, Tamil).
  • Cross-lingual zero-shot phonetic transcription as a baseline; expect degraded quality on languages outside the training set.

Out of scope: narrow phonetic transcription (use the companion narrow-en for English narrow); orthographic ASR (this model emits IPA, not text); commercial deployment without complying with the upstream LDC TIMIT non-commercial licensing terms.

How to use

MLX (Apple Silicon)

from huggingface_hub import snapshot_download
import mlx.core as mx
from mlx_whisper.load_models import load_model
from mlx_whisper.audio import load_audio, pad_or_trim, log_mel_spectrogram
from mlx_whisper.decoding import DecodingOptions, decode
from mlx.utils import tree_flatten, tree_unflatten

# Download checkpoint weights from HF.
ckpt = snapshot_download("barathanasln/phonetic-whisper-mlx-broad-multi")

# Load Whisper-large-v3 architecture and overlay our decoder weights.
model = load_model("mlx-community/whisper-large-v3-mlx")
model.set_dtype(mx.float32)
trained = mx.load(f"{ckpt}/model.safetensors")
decoder_weights = {k: v for k, v in trained.items() if k.startswith("decoder.")}
params = dict(tree_flatten(model.parameters()))
for k, v in decoder_weights.items():
    if k in params:
        params[k] = v
model.update(tree_unflatten(list(params.items())))

# Inference. ALWAYS pass language="en" — see Training-time language token.
audio = load_audio("your-audio.wav")
mel = log_mel_spectrogram(pad_or_trim(audio), n_mels=128)
mel = mx.expand_dims(mel, 0).astype(mx.float32)
features = model.encoder(mel)
result = decode(model, features, DecodingOptions(language="en", without_timestamps=True))
print(result[0].text.strip())

For training reproduction, see the GitHub repository.

Training data

Source Samples Convention
TIMIT broad (English, derived from prepare_timit_dataset.py + simplify_timit_ipa.py) 4,158 Broad
CommonVoice broad — 7 languages (ja, pl, mt, hu, fi, el, ta), Epitran-based G2P 6,538 Broad
Total 10,696 Broad

Approximately ~30 hours of audio. Held-out validation: 924 utterances (stratified 50/50 TIMIT/CommonVoice, seed=42).

TIMIT (LDC93S1) is licensed for non-commercial research only. The trained weights are distributed under CC BY-NC 4.0 in accordance with this restriction; see License.

Training procedure

Decoder-only fine-tune, encoder frozen, AdamW with linear warmup and cosine decay, fp32, on a single Apple M3 Ultra with MLX. Full hyperparameters, launchers, and reproduction commands are in the GitHub repository.

Training-time language token

All training samples use <|en|> as the start-of-transcript prefix regardless of source-audio language; the token is overloaded as "emit IPA". This is intentional — phonetic transcription is meant to be language-agnostic, so the decoder is trained without a per-language signal. Pass language="en" at inference.

Evaluation

PFER (Phonetic Feature Error Rate) is per-phone Hamming distance over PanPhon's 24 articulatory features ÷ 24, with insertion/deletion cost = 1 (Taguchi 2023 §4.2 / POWSM Table 4 rescoring convention).

Benchmark n PFER (%) Convention notes
Combined broad held-out validation (in-distribution) 924 3.19 TIMIT+CV stratified 50/50
TIMIT broad core test (in-distribution) 1,680 4.70 Broad-on-broad
MultIPA zero-shot (Taguchi 2023) 20.78 Same test set as Taguchi 2023 (21.2 reported)
Tusom2021 (Tibeto-Burman, zero-shot) 447 23.05 Same convention as Wav2Vec2Phoneme rescored by POWSM Table 4 (31.92)
L2-ARCTIC PRiSM-cut 3,599 14.22 Convention-mismatched (broad model on narrow refs)
VoxAngeles (95 langs) 5,446 19.42 Convention-mismatched; cross-lingual stress
DoReCo subset (8 langs) 3,898 25.18 Convention-mismatched; cross-lingual stress

Cross-lingual narrow benchmarks (L2-ARCTIC, VoxAngeles, DoReCo) are not direct quality comparisons — they pair our broad-IPA output against narrow human references, so the numbers reflect a known convention penalty in addition to recognition difficulty.

Limitations

  • Cross-lingual narrow generalization. This model loses to encoder-CTC speech-to-IPA models trained on much larger corpora (POWSM, ZIPA, PhoneticXEUS, HuPER). The gap is structural — ~1000× data-scale gap and a uniform broad output convention vs. their language-specific narrow inventories.
  • AR-decoder repetition. Whisper's autoregressive decoder occasionally produces severe repetition hallucinations on out-of-distribution languages with short utterances (e.g., Bengali on VoxAngeles, PFER ≈ 151%, n=40, contributing ~1 absolute point to the aggregate VoxAngeles PFER).
  • Language coverage. Trained on 8 languages. Performance on any language outside that set is zero-shot; expect convention and inventory penalties.

Citation

@software{aslan2026phonetic_whisper_mlx,
  author       = {Aslan, Barathan},
  title        = {phonetic-whisper-mlx: Whisper-decoder fine-tunes for IPA transcription on Apple Silicon},
  year         = {2026},
  url          = {https://github.com/barathanaslan/phonetic-whisper-mlx},
  version      = {0.1.0},
  license      = {MIT (code), CC BY-NC 4.0 (weights)}
}

For training data:

Garofolo, J. S., et al. TIMIT Acoustic-Phonetic Continuous Speech Corpus LDC93S1. Web download. Philadelphia: Linguistic Data Consortium, 1993.

Ardila, R., Branson, M., Davis, K., et al. Common Voice: A Massively-Multilingual Speech Corpus. LREC 2020.

For the per-phone Hamming/24 PFER convention:

Taguchi, C. Universal Automatic Phonetic Transcription into the IPA. arXiv:2308.03917, 2023.

Lu et al. POWSM: A Phonetic Open Whisper-Style Speech Foundation Model. arXiv:2510.24992, 2025.

License

Trained model weights: CC BY-NC 4.0. The non-commercial restriction reflects the TIMIT (LDC93S1) data terms inherited via training data. Commercial deployment of derivative products may require obtaining a TIMIT For-Profit Membership from LDC; compliance with upstream training-data licenses is the deployer's responsibility.

Source code: MIT, distributed via the GitHub repository.

Downloads last month

-

Downloads are not tracked for this model. How to track
Safetensors
Model size
2B params
Tensor type
F32
·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Rayrui33/phonetic-whisper-mlx-broad-multi

Finetuned
(6)
this model

Papers for Rayrui33/phonetic-whisper-mlx-broad-multi

Evaluation results

  • Phone Feature Error Rate (PanPhon Hamming/24) on Combined broad-IPA held-out validation
    self-reported
    3.190
  • Phone Feature Error Rate on TIMIT core test (broad)
    self-reported
    4.700
  • Phone Feature Error Rate on MultIPA zero-shot (Taguchi 2023)
    self-reported
    20.780
  • Phone Feature Error Rate on Tusom2021
    self-reported
    23.050