Pantagruel: Unified Self-Supervised Encoders for French Text and Speech

Summary

Pantagruel is a family of self-supervised encoder models for French text and speech, with separate models trained for each modality. Rather than relying only on masked input-level reconstruction, Pantagruel encoders learn contextualized representations in feature space following the data2vec 2.0 / JEPA (Joint-Embedding Predictive Architecture) paradigm.

Pantagruel adopts data2vec 2.0-style teacher–student setup: a student encoder processes partially visible inputs and is trained to predict latent representations produced by a teacher encoder that observes the full, unmasked inputs. The teacher is implemented as an exponential moving average (EMA) of the student. This feature-space prediction objective is used for both speech and text models. For text, it is combined with an additional masked language modeling (MLM) loss to better capture fine-grained syntactic and semantic information.

The models were pre-trained using fairseq library (v0.12.2) and converted to HuggingFace's transformers format. For best compatibility, we recommend using transformers==4.57.0 or 4.56.2, together with tokenizers==0.22.1 and sentencepiece==0.1.99.

Paper: https://arxiv.org/abs/2601.05911
Pre-training code: to be updated soon.

Text-only models

Pantagruel text encoders are trained on large-scale French text corpora, including Wikipedia 2019, OSCAR 2019, and CroissantLLM. In addition to feature-space prediction, text models incorporate masked language modeling (MLM) to better capture fine-grained syntactic and semantic information. These models produce strong sentence and token-level representations for downstream NLP tasks.

Note on model naming convention: Models that include camtok in their name use CamemBERT's tokenizer, which is used for comparison our models to a BERT-based counterpart. If no tokenizer is specified, the model uses our custom tokenizer. All text-based models are trained using the data2vec 2.0 masked feature prediction objective. Models with an MLM suffix additionally incorporate the masked language modeling (MLM) objective alongside the main data2vec 2.0 objective. The table below presents the accuracy of the natural language inference task on the French XNLI dataset.

HuggingFace name	Model name (paper)	Arch/ Params	Pretrained dataset	Accuracy on XNLI (FR) (dev / test)
text-base-camtok-wiki	Pantagruel-B-camtok-Wk	Base / 110M	French Wikipedia 2019 (4GB)	76.94% / 77.43%
text-base-wiki	Pantagruel-B-Wk	Base / 125M	French Wikipedia 2019 (4GB)	77.40% / 78.41%
text-base-wiki-mlm	Pantagruel-B-Wk-MLM	Base / 125M	French Wikipedia 2019 (4GB)	78.25% / 78.41%
text-base-camtok-oscar	Pantagruel-B-camtok-Osc	Base / 110M	OSCAR 2019 (138GB)	80.40% / 80.53%
text-base-oscar-mlm	Pantagruel-B-Osc-MLM	Base / 125M	OSCAR 2019 (138GB)	81.11% / 81.52%
text-base-croissant-mlm	Pantagruel-B-Crs-MLM	Base / 125M	croissantLLM (1.5GB)	81.05% / 80.69%

For more downstream tasks and evaluation datasets, please refer to our paper.

Usage

Our models can be used with AutoModel and AutoConfig classes to extract features as below. Other common classes for text-related downstream tasks, including AutoModelForMaskedLM, AutoModelForSequenceClassification, AutoModelForMultipleChoice, AutoModelForTokenClassification, and AutoModelForQuestionAnswering are also supported. We are currently working to merge the modeling files into the official Hugging Face repository, which will enable native use of the Pantagruel classes.

import torch
from transformers import AutoTokenizer, AutoModel

# Load the tokenizer and model
model_name = "PantagrueLLM/text-base-wiki-mlm"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
model.eval()

# Example input
sentences = [
    "Bonjour, comment allez-vous ?",
    "Le chat dort sur le tapis."
]

# Tokenize input
inputs = tokenizer(
    sentences,
    padding=True,
    truncation=True,
    return_tensors="pt"
)

# Forward pass to get hidden states
with torch.no_grad():
    outputs = model(**inputs)

# Token-level embeddings
token_embeddings = outputs.last_hidden_state
print(token_embeddings.shape)
# Shape: (batch_size, sequence_length, hidden_size)

Speech-only models

If you want to check out our speech-only models, please visit our speech-only collection for more details.

Citation

If you use these models or find them useful in your research, publications, or applications, please cite the following work:

@article{le2026pantagruel,
  title={Pantagruel: Unified Self-Supervised Encoders for French Text and Speech},
  author={Le, Phuong-Hang and Pelloin, Valentin and Chatelain, Arnault and Bouziane, Maryem and Ghennai, Mohammed and Guan, Qianwen and Milintsevich, Kirill and Mdhaffar, Salima and Mannion, Aidan and Defauw, Nils and others},
  journal={arXiv preprint arXiv:2601.05911},
  year={2026}
}

For more information, see the full paper: https://arxiv.org/abs/2601.05911.