Pantagruel: Unified Self-Supervised Encoders for French Text and Speech

Summary

Pantagruel is a family of self-supervised encoder models for French text and speech, with separate models trained for each modality. Rather than relying only on masked input-level reconstruction, Pantagruel encoders learn contextualized representations in feature space following the data2vec 2.0 / JEPA (Joint-Embedding Predictive Architecture) paradigm.

Pantagruel adopts data2vec 2.0-style teacher–student setup: a student encoder processes partially visible inputs and is trained to predict latent representations produced by a teacher encoder that observes the full, unmasked inputs. The teacher is implemented as an exponential moving average (EMA) of the student. This feature-space prediction objective is used for both speech and text models. For text, it is combined with an additional masked language modeling (MLM) loss to better capture fine-grained syntactic and semantic information.

The models were pre-trained using fairseq library (v0.12.2) and converted to HuggingFace's transformers format. For best compatibility, we recommend using transformers==4.57.0 or 4.56.2, together with tokenizers==0.22.1 and sentencepiece==0.1.99.

Paper: https://arxiv.org/abs/2601.05911
Pre-training code: to be updated soon.

Speech-only models

Pantagruel speech encoders are trained exclusively with data2vec 2.0 masked feature prediction objective on diverse French audio data. Training data includes the French portion of Multilingual LibriSpeech (around 1K hours), LeBenchmark (around 14K hours), and INA-100k, a newly introduced 100,000-hour corpus of French broadcast speech from the Institut National de l’Audiovisuel (INA). This diverse mix of read, conversational, and broadcast audio enables the models to learn robust acoustic representations suitable for a wide range of speech understanding tasks.

Important: Please make sure your audio is mono, sampled at 16 kHz and normalized.

Note: These pre-trained models do not include a tokenizer, as they are trained on speech data only. To use them for automatic speech recognition (ASR) for example, you must create a tokenizer and fine-tune the model on labeled speech–text pairs. See this Hugging Face blog post for a step-by-step guide to fine-tuning wav2vec-style models.

The table below presents ASR performance on the French Common Voice v6.1 dataset, measured using word error rate (WER; lower is better).

HuggingFace name	Model name (paper)	Arch/ Params	Pretrained dataset	WER on CommonVoice v6.1 (FR) (dev / test)
speech-base-1K	Pantagruel-B-1k	Base / 93M	French LibriSpeech (1K hours)	8.92 / 10.46
speech-base-14K	Pantagruel-B-14K	Base / 93M	+LeBenchmark (14K hours)	8.46 / 9.94
speech-large-14K	Pantagruel-L-14K	Large / 313M	+LeBenchmark (14K hours)	6.95 / 8.05
speech-large-114K	Pantagruel-L-114K	Large / 313M	+INA-100k (100K hours)	7.07 / 8.21

For additional downstream tasks and evaluation datasets, please refer to our paper.

Usage

Our models can be used with AutoModel and AutoConfig classes to extract features as below. Other common classes for audio-related downstream tasks, including AutoModelForSequenceClassification, AutoModelForAudioFrameClassification, and AutoModelForCTC are also supported. We are currently working to merge the modeling files into the official Hugging Face repository, which will enable native use of the Pantagruel classes.

import soundfile as sf
import torch
from transformers import AutoProcessor, AutoModel

# load model
model_name = "PantagrueLLM/speech-base-1K"
processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
model.eval()

# load audio files
wav, curr_sample_rate = sf.read("audio.wav", dtype="float32")
feats = torch.from_numpy(wav).float()
# Note: please normalize the audio if not using AutoProcessor
inputs = processor(feats, sampling_rate=16000, return_tensors="pt")

# extract features
with torch.no_grad():
    outputs = model(**inputs)

Text-only models

If you want to check out our text-only models, please visit our text-only collection for more details.

Citation

If you use these models or find them useful in your research, publications, or applications, please cite the following work:

@article{le2026pantagruel,
  title={Pantagruel: Unified Self-Supervised Encoders for French Text and Speech},
  author={Le, Phuong-Hang and Pelloin, Valentin and Chatelain, Arnault and Bouziane, Maryem and Ghennai, Mohammed and Guan, Qianwen and Milintsevich, Kirill and Mdhaffar, Salima and Mannion, Aidan and Defauw, Nils and others},
  journal={arXiv preprint arXiv:2601.05911},
  year={2026}
}

For more information, see the full paper: https://arxiv.org/abs/2601.05911.