whisper-tiny-french

French-only fine-tune of openai/whisper-tiny, built to validate on-device TTS output inside VocaRead. 39 M parameters, all of them dedicated to French.

This is the PyTorch checkpoint. For the iOS-ready CoreML INT8 bundle see eborges78/whisper-tiny-fr-coreml-slim.

Why this exists

openai/whisper-tiny is the only Whisper size that fits in RAM on iPad 6 / iPhone 8 / SE2. But its 39 M parameters span 99 languages — French gets a fraction of that capacity, and the WER on Common Voice FR sits around 50 %.

A French-only fine-tune of the same architecture concentrates 100 % of the capacity on FR and pushes WER under 20 %, all while staying inside the same memory envelope. Drop-in replacement for the multilingual tiny when the input language is known.

Quick start

import torch
from transformers import pipeline

asr = pipeline(
    "automatic-speech-recognition",
    model="eborges78/whisper-tiny-french",
    chunk_length_s=30,
    generate_kwargs={"language": "fr", "task": "transcribe"},
)

transcript = asr("path/to/french-audio.wav")
print(transcript["text"])

Performance

Measured on a 50-clip golden subset of FLEURS-FR (google/fleurs, config fr_fr, split test). WER computed with jiwer + Whisper-style normalization (lowercase, strip punctuation, collapse whitespace).

Model Params WER (FR) Δ vs baseline
openai/whisper-tiny (multilingual) 39 M ~50 % baseline
eborges78/whisper-tiny-french (this model) 39 M 43 % (measured) −7 pts
openai/whisper-base (multilingual) 74 M ~25 % for comparison
eborges78/whisper-base-french (planned) 74 M ~8-10 % sister model

Honest assessment : the FLEURS WER of 43 % is above the original 20 % quality gate target documented in configs/tiny-fr.yaml. The model trained for 4 epochs on 30 h of MLS-FR but FLEURS is a substantially harder out-of-distribution eval (news/short prompts with English loanwords like "springboks", "u.s. corps of engineers" vs MLS's 19th-century French audiobooks).

Internal eval on 300 MLS-FR test clips (in-distribution) lands at WER 32 % — closer to but still above target. The model is published under the dev branch revision rather than main to flag this gap.

For the target VocaRead use case (validating clean French TTS output of public-domain literature), the in-distribution performance is what matters — and the model improves noticeably over the multilingual tiny baseline (which hallucinates English phrases on FR audio).

Per-clip behaviour examples (from the 50-clip FLEURS golden set)

Best cases (clean French, no loanwords, WER 7-11 %) :

REF: ainsi le crayon était un bon ami pour beaucoup de gens lorsqu'il est sorti
HYP: si le crayon était un bon ami pour beaucoup de gens lorsqu'il est sorti
WER: 0.07

Worst cases (English loanwords + proper nouns, WER > 80 %) :

REF: pour les springboks ce fut la fin d'une série de cinq défaites
HYP: pour l'esprit de boxe se fût la fin d'une cerine de sang du défaite
WER: 0.75

The model maps unseen English words phonetically — expected since MLS is a 19th-century French literature corpus.

Training

Item Value
Base model openai/whisper-tiny
Fine-tune corpus facebook/multilingual_librispeech, config french, split train
Training hours used 30 h (~9 000 clips, capped)
Epochs 4
Steps 1 128
Batch size 32
Learning rate 1.0e-5, linear, 500 warmup steps
Hardware 1× RTX 3090 24 GB (Vast.ai)
Wall-clock ~6 h total (4h29 dataset mel-mapping CPU, 1h26 training GPU)
Cost ~€2 actual (single-instance, sub-optimal — see repo docs/adding-a-language.md for the CPU+GPU split that saves ~50 %)
In-training eval MLS test (capped 300 clips) every 500 steps
Final training loss 0.44 (started at 1.32)
In-training eval WER (MLS test) 31.96 %
FLEURS-FR bench WER (50 clips) 42.96 %

Training pipeline and full reproduction recipe : github.com/eborges78/whisper-fr-coreml-slim.

Known training limitations

  • 4 epochs probably insufficient. Training loss was still descending in epoch 4 (0.50 → 0.44). A retraining at 8-10 epochs would likely move WER lower.
  • MLS-only corpus. Adding FLEURS train data, VoxPopuli FR, or a custom news corpus to the training mix would help bridge the FLEURS eval gap.
  • No FR-specific augmentation. Current spec_augment is conservative (time_mask_param: 30, freq_mask_param: 27). More aggressive masking might help.

Limitations

This model is calibrated for the specific downstream task of validating TTS output read-aloud audio. It will work but is sub-optimal for :

  • Far-field noisy speech : trained on clean audiobook reads, will degrade on phone-call quality audio. Use whisper-base-french or whisper-small-french for noisier inputs.
  • Code-switching : capacity is 100 % FR. Sentences mixing French and English will be transcribed entirely in French (the English chunks get phonetically mapped). For mixed-language input, stay on multilingual whisper-base or larger.
  • Strong regional accents : MLS speakers are mostly metropolitan / continental French. Quebec or West African French may have higher WER. We did not specifically evaluate this.
  • Hallucination at the edges : like all Whisper sizes, the model can hallucinate on silence-only inputs (it generates audiobook-style filler). Always pair with a VAD or duration check upstream.
  • Single-language only : forced to FR via generate_kwargs={"language": "fr"}. Passing other languages will produce garbage — use the multilingual base if you don't know the language ahead of time.

License

MIT. This model is a derivative of openai/whisper-tiny (MIT) trained on Multilingual LibriSpeech (CC-BY-4.0). Both upstream licenses allow commercial use ; this fine-tune adds no additional restrictions.

Citation

If you use this model in a paper or product, please cite the upstream Whisper paper and the MLS dataset :

@misc{radford2022whisper,
  title  = {Robust Speech Recognition via Large-Scale Weak Supervision},
  author = {Alec Radford and Jong Wook Kim and Tao Xu and Greg Brockman and Christine McLeavey and Ilya Sutskever},
  year   = {2022},
  eprint = {2212.04356},
}

@inproceedings{pratap2020mls,
  title     = {{MLS}: A Large-Scale Multilingual Dataset for Speech Research},
  author    = {Pratap, Vineel and Xu, Qiantong and Sriram, Anuroop and Synnaeve, Gabriel and Collobert, Ronan},
  booktitle = {Interspeech},
  year      = {2020},
}

Acknowledgments

Downloads last month
41
Safetensors
Model size
37.8M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for eborges78/whisper-tiny-french

Finetuned
(1830)
this model

Dataset used to train eborges78/whisper-tiny-french

Paper for eborges78/whisper-tiny-french

Evaluation results