Feature Extraction
Transformers
Safetensors
Fairseq
French
pantagruel_uni
fill-mask
data2vec2
JEPA
text
custom_code
flaubert commited on
Commit
f45f163
·
verified ·
1 Parent(s): 4b954d1

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +99 -0
README.md ADDED
@@ -0,0 +1,99 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-sa-2.0
3
+ datasets:
4
+ - croissantllm/croissant_dataset
5
+ language:
6
+ - fr
7
+ pipeline_tag: feature-extraction
8
+ library_name: fairseq
9
+ ---
10
+
11
+ # Pantagruel: Unified Self-Supervised Encoders for French Text and Speech
12
+
13
+ **Summary**
14
+
15
+ Pantagruel is a family of self-supervised encoder models for French text and speech, with separate models trained for each modality. Rather than relying only on masked input-level reconstruction, Pantagruel encoders learn contextualized representations in feature space following the [data2vec 2.0](https://arxiv.org/abs/2212.07525) / [JEPA (Joint-Embedding Predictive Architecture)](https://arxiv.org/abs/2301.08243) paradigm.
16
+
17
+ Pantagruel adopts data2vec 2.0-style teacher–student setup: a student encoder processes partially visible inputs and is trained to predict latent representations produced by a teacher encoder that observes the full, unmasked inputs. The teacher is implemented as an exponential moving average (EMA) of the student. This feature-space prediction objective is used for both speech and text models. For text, it is combined with an additional masked language modeling (MLM) loss to better capture fine-grained syntactic and semantic information.
18
+
19
+ The models were pre-trained using `fairseq` library (v0.12.2) and converted to HuggingFacce's `transformers` format. For best compatibility, we recommend using `transformers==4.57.0` or `4.56.2`, together with `tokenizers==0.22.1` and `sentencepiece==0.1.99`.
20
+
21
+ - **Paper**: https://arxiv.org/abs/2601.05911
22
+ - **Pre-training code**: to be updated soon.
23
+
24
+
25
+ ## Speech-only models
26
+
27
+ Please refer to our [speech-only collection](https://huggingface.co/collections/PantagrueLLM/speech-only-models) for more details.
28
+
29
+ ## Text-only models
30
+
31
+ Please refer to our [text-only collection](https://huggingface.co/collections/PantagrueLLM/text-only-models) for more details.
32
+
33
+ ## Text-only models
34
+ Pantagruel text encoders are trained on large-scale French text corpora, including Wikipedia 2019, OSCAR 2019, and CroissantLLM. In addition to feature-space prediction, text models incorporate masked language modeling (MLM) to better capture fine-grained syntactic and semantic information. These models produce strong sentence and token-level representations for downstream NLP tasks.
35
+
36
+ **Note on model naming convention:** Models that include `camtok` in their name use CamemBERT's tokenizer. If no tokenizer is specified, the model uses our custom tokenizer. All text-based models are trained using the data2vec 2.0 masked feature prediction objective. Models with an `MLM` suffix additionally incorporate the masked language modeling (MLM) objective alongside the main data2vec 2.0 objective.
37
+
38
+ The table below presents the accuracy of the natural language inference task on the French XNLI dataset.
39
+
40
+ | **Model name (HuggingFace / Paper)** | **Arch/ Params** | **Pretrained dataset** | **Accuracy on XNLI (FR) (dev / test)** | **Note** |
41
+ |------------------------|-----------------|----------------------|---------------------------------------|----------|
42
+ | text-base-camtok-wiki / Pantagruel-B-camtok-Wk | Base (110M) | French Wikipedia 2019 (4GB) | 76.94% / 77.43% | for ablation study purpose |
43
+ | text-base-wiki / Pantagruel-B-Wk | Base (125M) | French Wikipedia 2019 (4GB) | 77.40% / 78.41% | for ablation study purpose |
44
+ | text-base-wiki-mlm / Pantagruel-B-Wk-MLM | Base (125M) | French Wikipedia 2019 (4GB) | 78.25% / 78.41% | |
45
+ | text-base-camtok-oscar / Pantagruel-B-camtok-Osc | Base (110M) | OSCAR 2019 (138GB) | 80.40% / 80.53% | |
46
+ | text-base-oscar-mlm / Pantagruel-B-Osc-MLM | Base (125M) | OSCAR 2019 (138GB) | 81.11% / 81.52% | |
47
+ | text-base-croissant-mlm / Pantagruel-B-Crs-MLM | Base (125M) | croissantLLM (1.5GB) | 80.91% / 81.05% | |
48
+
49
+ For more downstream tasks and evaluation datasets, please refer to [our paper](https://arxiv.org/abs/2601.05911).
50
+
51
+ ## Usage
52
+ Our models can be used with `AutoModel` and `AutoConfig` classes to extract features as below. Other common classes for text-related downstream tasks, including `AutoModelForMaskedLM`, `AutoModelForSequenceClassification`, `AutoModelForMultipleChoice`, `AutoModelForTokenClassification`, and `AutoModelForQuestionAnswering` are also supported. We are currently working to merge the modeling files into the official Hugging Face repository, which will enable native use of the `Pantagruel` classes.
53
+
54
+ ```python
55
+ import torch
56
+ from transformers import AutoTokenizer, AutoModel
57
+
58
+ # Load the tokenizer and model
59
+ model_name = "PantagrueLLM/text-base-croissant-mlm"
60
+ tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
61
+ model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
62
+ model.eval()
63
+
64
+ # Example input
65
+ sentences = [
66
+ "Bonjour, comment allez-vous ?",
67
+ "Le chat dort sur le tapis."
68
+ ]
69
+
70
+ # Tokenize input
71
+ inputs = tokenizer(
72
+ sentences,
73
+ padding=True,
74
+ truncation=True,
75
+ return_tensors="pt"
76
+ )
77
+
78
+ # Forward pass to get hidden states
79
+ with torch.no_grad():
80
+ outputs = model(**inputs)
81
+
82
+ # Token-level embeddings
83
+ token_embeddings = outputs.last_hidden_state
84
+ print(token_embeddings.shape)
85
+ # Shape: (batch_size, sequence_length, hidden_size)
86
+ ```
87
+
88
+ ## Citation
89
+ If you use these models or find them useful in your research, publications, or applications, please cite the following work:
90
+
91
+ ```bibtex
92
+ @article{le2026pantagruel,
93
+ title={Pantagruel: Unified Self-Supervised Encoders for French Text and Speech},
94
+ author={Le, Phuong-Hang and Pelloin, Valentin and Chatelain, Arnault and Bouziane, Maryem and Ghennai, Mohammed and Guan, Qianwen and Milintsevich, Kirill and Mdhaffar, Salima and Mannion, Aidan and Defauw, Nils and others},
95
+ journal={arXiv preprint arXiv:2601.05911},
96
+ year={2026}
97
+ }
98
+ ```
99
+ For more information, see the full paper: https://arxiv.org/abs/2601.05911.