|
|
--- |
|
|
license: cc-by-nc-sa-2.0 |
|
|
language: |
|
|
- fr |
|
|
pipeline_tag: feature-extraction |
|
|
library_name: transformers |
|
|
tags: |
|
|
- data2vec2 |
|
|
- JEPA |
|
|
- text |
|
|
- fairseq |
|
|
--- |
|
|
|
|
|
# Pantagruel: Unified Self-Supervised Encoders for French Text and Speech |
|
|
|
|
|
**Summary** |
|
|
|
|
|
Pantagruel is a family of self-supervised encoder models for French text and speech, with separate models trained for each modality. Rather than relying only on masked input-level reconstruction, Pantagruel encoders learn contextualized representations in feature space following the [data2vec 2.0](https://arxiv.org/abs/2212.07525) / [JEPA (Joint-Embedding Predictive Architecture)](https://arxiv.org/abs/2301.08243) paradigm. |
|
|
|
|
|
Pantagruel adopts data2vec 2.0-style teacher–student setup: a student encoder processes partially visible inputs and is trained to predict latent representations produced by a teacher encoder that observes the full, unmasked inputs. The teacher is implemented as an exponential moving average (EMA) of the student. This feature-space prediction objective is used for both speech and text models. For text, it is combined with an additional masked language modeling (MLM) loss to better capture fine-grained syntactic and semantic information. |
|
|
|
|
|
The models were pre-trained using `fairseq` library (v0.12.2) and converted to HuggingFace's `transformers` format. For best compatibility, we recommend using `transformers==4.57.0` or `4.56.2`, together with `tokenizers==0.22.1` and `sentencepiece==0.1.99`. |
|
|
|
|
|
- **Paper**: https://arxiv.org/abs/2601.05911 |
|
|
- **Pre-training code**: to be updated soon. |
|
|
|
|
|
|
|
|
## Text-only models |
|
|
Pantagruel text encoders are trained on large-scale French text corpora, including Wikipedia 2019, OSCAR 2019, and CroissantLLM. In addition to feature-space prediction, text models incorporate masked language modeling (MLM) to better capture fine-grained syntactic and semantic information. These models produce strong sentence and token-level representations for downstream NLP tasks. |
|
|
|
|
|
**Note on model naming convention:** Models that include `camtok` in their name use CamemBERT's tokenizer, which is used for comparison our models to a BERT-based counterpart. If no tokenizer is specified, the model uses our custom tokenizer. All text-based models are trained using the data2vec 2.0 masked feature prediction objective. Models with an `MLM` suffix additionally incorporate the masked language modeling (MLM) objective alongside the main data2vec 2.0 objective. |
|
|
|
|
|
The table below presents the accuracy of the natural language inference task on the French XNLI dataset. |
|
|
|
|
|
| **HuggingFace name**| **Model name (paper)** | **Arch/ Params** | **Pretrained dataset** | **Accuracy on XNLI (FR) (dev / test)** | |
|
|
|----------|------------------------|-----------------|----------------------|---------------------------------------| |
|
|
| [text-base-camtok-wiki](https://huggingface.co/PantagrueLLM/text-base-camtok-wiki) | Pantagruel-B-camtok-Wk | Base / 110M | French Wikipedia 2019 (4GB) | 76.94% / 77.43% | |
|
|
| text-base-wiki | Pantagruel-B-Wk | Base / 125M | French Wikipedia 2019 (4GB) | 77.40% / 78.41% | |
|
|
| [text-base-wiki-mlm](https://huggingface.co/PantagrueLLM/text-base-wiki-mlm) | Pantagruel-B-Wk-MLM | Base / 125M | French Wikipedia 2019 (4GB) | 78.25% / 78.41% | |
|
|
| [text-base-camtok-oscar](https://huggingface.co/PantagrueLLM/text-base-camtok-oscar) | Pantagruel-B-camtok-Osc | Base / 110M | OSCAR 2019 (138GB) | 80.40% / 80.53% | |
|
|
| [text-base-oscar-mlm](https://huggingface.co/PantagrueLLM/text-base-oscar-mlm) | Pantagruel-B-Osc-MLM | Base / 125M | OSCAR 2019 (138GB) | 81.11% / 81.52% | |
|
|
| [text-base-croissant-mlm](https://huggingface.co/PantagrueLLM/text-base-croissant-mlm) | Pantagruel-B-Crs-MLM | Base / 125M | croissantLLM (1.5GB) | 81.05% / 80.69% | |
|
|
|
|
|
For more downstream tasks and evaluation datasets, please refer to [our paper](https://arxiv.org/abs/2601.05911). |
|
|
|
|
|
## Usage |
|
|
Our models can be used with `AutoModel` and `AutoConfig` classes to extract features as below. Other common classes for text-related downstream tasks, including `AutoModelForMaskedLM`, `AutoModelForSequenceClassification`, `AutoModelForMultipleChoice`, `AutoModelForTokenClassification`, and `AutoModelForQuestionAnswering` are also supported. We are currently working to merge the modeling files into the official Hugging Face repository, which will enable native use of the `Pantagruel` classes. |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from transformers import AutoTokenizer, AutoModel |
|
|
|
|
|
# Load the tokenizer and model |
|
|
model_name = "PantagrueLLM/text-base-wiki" |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) |
|
|
model = AutoModel.from_pretrained(model_name, trust_remote_code=True) |
|
|
model.eval() |
|
|
|
|
|
# Example input |
|
|
sentences = [ |
|
|
"Bonjour, comment allez-vous ?", |
|
|
"Le chat dort sur le tapis." |
|
|
] |
|
|
|
|
|
# Tokenize input |
|
|
inputs = tokenizer( |
|
|
sentences, |
|
|
padding=True, |
|
|
truncation=True, |
|
|
return_tensors="pt" |
|
|
) |
|
|
|
|
|
# Forward pass to get hidden states |
|
|
with torch.no_grad(): |
|
|
outputs = model(**inputs) |
|
|
|
|
|
# Token-level embeddings |
|
|
token_embeddings = outputs.last_hidden_state |
|
|
print(token_embeddings.shape) |
|
|
# Shape: (batch_size, sequence_length, hidden_size) |
|
|
``` |
|
|
|
|
|
## Speech-only models |
|
|
|
|
|
If you want to check out our speech-only models, please visit our [speech-only collection](https://huggingface.co/collections/PantagrueLLM/speech-only-models) for more details. |
|
|
|
|
|
## Citation |
|
|
If you use these models or find them useful in your research, publications, or applications, please cite the following work: |
|
|
|
|
|
```bibtex |
|
|
@article{le2026pantagruel, |
|
|
title={Pantagruel: Unified Self-Supervised Encoders for French Text and Speech}, |
|
|
author={Le, Phuong-Hang and Pelloin, Valentin and Chatelain, Arnault and Bouziane, Maryem and Ghennai, Mohammed and Guan, Qianwen and Milintsevich, Kirill and Mdhaffar, Salima and Mannion, Aidan and Defauw, Nils and others}, |
|
|
journal={arXiv preprint arXiv:2601.05911}, |
|
|
year={2026} |
|
|
} |
|
|
``` |
|
|
For more information, see the full paper: https://arxiv.org/abs/2601.05911. |