Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,97 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: cc-by-nc-sa-2.0
|
| 3 |
+
language:
|
| 4 |
+
- fr
|
| 5 |
+
pipeline_tag: feature-extraction
|
| 6 |
+
library_name: fairseq
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
# Pantagruel: Unified Self-Supervised Encoders for French Text and Speech
|
| 10 |
+
|
| 11 |
+
**Summary**
|
| 12 |
+
|
| 13 |
+
Pantagruel is a family of self-supervised encoder models for French text and speech, with separate models trained for each modality. Rather than relying only on masked input-level reconstruction, Pantagruel encoders learn contextualized representations in feature space following the [data2vec 2.0](https://arxiv.org/abs/2212.07525) / [JEPA (Joint-Embedding Predictive Architecture)](https://arxiv.org/abs/2301.08243) paradigm.
|
| 14 |
+
|
| 15 |
+
Pantagruel adopts data2vec 2.0-style teacher–student setup: a student encoder processes partially visible inputs and is trained to predict latent representations produced by a teacher encoder that observes the full, unmasked inputs. The teacher is implemented as an exponential moving average (EMA) of the student. This feature-space prediction objective is used for both speech and text models. For text, it is combined with an additional masked language modeling (MLM) loss to better capture fine-grained syntactic and semantic information.
|
| 16 |
+
|
| 17 |
+
The models were pre-trained using `fairseq` library (v0.12.2) and converted to HuggingFacce's `transformers` format. For best compatibility, we recommend using `transformers==4.57.0` or `4.56.2`, together with `tokenizers==0.22.1` and `sentencepiece==0.1.99`.
|
| 18 |
+
|
| 19 |
+
- **Paper**: https://arxiv.org/abs/2601.05911
|
| 20 |
+
- **Pre-training code**: to be updated soon.
|
| 21 |
+
|
| 22 |
+
|
| 23 |
+
## Speech-only models
|
| 24 |
+
|
| 25 |
+
Please refer to our [speech-only collection](https://huggingface.co/collections/PantagrueLLM/speech-only-models) for more details.
|
| 26 |
+
|
| 27 |
+
## Text-only models
|
| 28 |
+
|
| 29 |
+
Please refer to our [text-only collection](https://huggingface.co/collections/PantagrueLLM/text-only-models) for more details.
|
| 30 |
+
|
| 31 |
+
## Text-only models
|
| 32 |
+
Pantagruel text encoders are trained on large-scale French text corpora, including Wikipedia 2019, OSCAR 2019, and CroissantLLM. In addition to feature-space prediction, text models incorporate masked language modeling (MLM) to better capture fine-grained syntactic and semantic information. These models produce strong sentence and token-level representations for downstream NLP tasks.
|
| 33 |
+
|
| 34 |
+
**Note on model naming convention:** Models that include `camtok` in their name use CamemBERT's tokenizer. If no tokenizer is specified, the model uses our custom tokenizer. All text-based models are trained using the data2vec 2.0 masked feature prediction objective. Models with an `MLM` suffix additionally incorporate the masked language modeling (MLM) objective alongside the main data2vec 2.0 objective.
|
| 35 |
+
|
| 36 |
+
The table below presents the accuracy of the natural language inference task on the French XNLI dataset.
|
| 37 |
+
|
| 38 |
+
| **Model name (HuggingFace / Paper)** | **Arch/ Params** | **Pretrained dataset** | **Accuracy on XNLI (FR) (dev / test)** | **Note** |
|
| 39 |
+
|------------------------|-----------------|----------------------|---------------------------------------|----------|
|
| 40 |
+
| text-base-camtok-wiki / Pantagruel-B-camtok-Wk | Base (110M) | French Wikipedia 2019 (4GB) | 76.94% / 77.43% | for ablation study purpose |
|
| 41 |
+
| text-base-wiki / Pantagruel-B-Wk | Base (125M) | French Wikipedia 2019 (4GB) | 77.40% / 78.41% | for ablation study purpose |
|
| 42 |
+
| text-base-wiki-mlm / Pantagruel-B-Wk-MLM | Base (125M) | French Wikipedia 2019 (4GB) | 78.25% / 78.41% | |
|
| 43 |
+
| text-base-camtok-oscar / Pantagruel-B-camtok-Osc | Base (110M) | OSCAR 2019 (138GB) | 80.40% / 80.53% | |
|
| 44 |
+
| text-base-oscar-mlm / Pantagruel-B-Osc-MLM | Base (125M) | OSCAR 2019 (138GB) | 81.11% / 81.52% | |
|
| 45 |
+
| text-base-croissant-mlm / Pantagruel-B-Crs-MLM | Base (125M) | croissantLLM (1.5GB) | 80.91% / 81.05% | |
|
| 46 |
+
|
| 47 |
+
For more downstream tasks and evaluation datasets, please refer to [our paper](https://arxiv.org/abs/2601.05911).
|
| 48 |
+
|
| 49 |
+
## Usage
|
| 50 |
+
Our models can be used with `AutoModel` and `AutoConfig` classes to extract features as below. Other common classes for text-related downstream tasks, including `AutoModelForMaskedLM`, `AutoModelForSequenceClassification`, `AutoModelForMultipleChoice`, `AutoModelForTokenClassification`, and `AutoModelForQuestionAnswering` are also supported. We are currently working to merge the modeling files into the official Hugging Face repository, which will enable native use of the `Pantagruel` classes.
|
| 51 |
+
|
| 52 |
+
```python
|
| 53 |
+
import torch
|
| 54 |
+
from transformers import AutoTokenizer, AutoModel
|
| 55 |
+
|
| 56 |
+
# Load the tokenizer and model
|
| 57 |
+
model_name = "PantagrueLLM/text-base-oscar-mlm"
|
| 58 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
|
| 59 |
+
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
|
| 60 |
+
model.eval()
|
| 61 |
+
|
| 62 |
+
# Example input
|
| 63 |
+
sentences = [
|
| 64 |
+
"Bonjour, comment allez-vous ?",
|
| 65 |
+
"Le chat dort sur le tapis."
|
| 66 |
+
]
|
| 67 |
+
|
| 68 |
+
# Tokenize input
|
| 69 |
+
inputs = tokenizer(
|
| 70 |
+
sentences,
|
| 71 |
+
padding=True,
|
| 72 |
+
truncation=True,
|
| 73 |
+
return_tensors="pt"
|
| 74 |
+
)
|
| 75 |
+
|
| 76 |
+
# Forward pass to get hidden states
|
| 77 |
+
with torch.no_grad():
|
| 78 |
+
outputs = model(**inputs)
|
| 79 |
+
|
| 80 |
+
# Token-level embeddings
|
| 81 |
+
token_embeddings = outputs.last_hidden_state
|
| 82 |
+
print(token_embeddings.shape)
|
| 83 |
+
# Shape: (batch_size, sequence_length, hidden_size)
|
| 84 |
+
```
|
| 85 |
+
|
| 86 |
+
## Citation
|
| 87 |
+
If you use these models or find them useful in your research, publications, or applications, please cite the following work:
|
| 88 |
+
|
| 89 |
+
```bibtex
|
| 90 |
+
@article{le2026pantagruel,
|
| 91 |
+
title={Pantagruel: Unified Self-Supervised Encoders for French Text and Speech},
|
| 92 |
+
author={Le, Phuong-Hang and Pelloin, Valentin and Chatelain, Arnault and Bouziane, Maryem and Ghennai, Mohammed and Guan, Qianwen and Milintsevich, Kirill and Mdhaffar, Salima and Mannion, Aidan and Defauw, Nils and others},
|
| 93 |
+
journal={arXiv preprint arXiv:2601.05911},
|
| 94 |
+
year={2026}
|
| 95 |
+
}
|
| 96 |
+
```
|
| 97 |
+
For more information, see the full paper: https://arxiv.org/abs/2601.05911.
|