--- license: cc-by-nc-sa-2.0 language: - fr pipeline_tag: feature-extraction library_name: transformers tags: - data2vec2 - JEPA - text - fairseq --- # Pantagruel: Unified Self-Supervised Encoders for French Text and Speech **Summary** Pantagruel is a family of self-supervised encoder models for French text and speech, with separate models trained for each modality. Rather than relying only on masked input-level reconstruction, Pantagruel encoders learn contextualized representations in feature space following the [data2vec 2.0](https://arxiv.org/abs/2212.07525) / [JEPA (Joint-Embedding Predictive Architecture)](https://arxiv.org/abs/2301.08243) paradigm. Pantagruel adopts data2vec 2.0-style teacher–student setup: a student encoder processes partially visible inputs and is trained to predict latent representations produced by a teacher encoder that observes the full, unmasked inputs. The teacher is implemented as an exponential moving average (EMA) of the student. This feature-space prediction objective is used for both speech and text models. For text, it is combined with an additional masked language modeling (MLM) loss to better capture fine-grained syntactic and semantic information. The models were pre-trained using `fairseq` library (v0.12.2) and converted to HuggingFace's `transformers` format. For best compatibility, we recommend using `transformers==4.57.0` or `4.56.2`, together with `tokenizers==0.22.1` and `sentencepiece==0.1.99`. - **Paper**: https://arxiv.org/abs/2601.05911 - **Pre-training code**: to be updated soon. ## Text-only models Pantagruel text encoders are trained on large-scale French text corpora, including Wikipedia 2019, OSCAR 2019, and CroissantLLM. In addition to feature-space prediction, text models incorporate masked language modeling (MLM) to better capture fine-grained syntactic and semantic information. These models produce strong sentence and token-level representations for downstream NLP tasks. **Note on model naming convention:** Models that include `camtok` in their name use CamemBERT's tokenizer, which is used for comparison our models to a BERT-based counterpart. If no tokenizer is specified, the model uses our custom tokenizer. All text-based models are trained using the data2vec 2.0 masked feature prediction objective. Models with an `MLM` suffix additionally incorporate the masked language modeling (MLM) objective alongside the main data2vec 2.0 objective. The table below presents the accuracy of the natural language inference task on the French XNLI dataset. | **HuggingFace name**| **Model name (paper)** | **Arch/ Params** | **Pretrained dataset** | **Accuracy on XNLI (FR) (dev / test)** | |----------|------------------------|-----------------|----------------------|---------------------------------------| | [text-base-camtok-wiki](https://huggingface.co/PantagrueLLM/text-base-camtok-wiki) | Pantagruel-B-camtok-Wk | Base / 110M | French Wikipedia 2019 (4GB) | 76.94% / 77.43% | | text-base-wiki | Pantagruel-B-Wk | Base / 125M | French Wikipedia 2019 (4GB) | 77.40% / 78.41% | | [text-base-wiki-mlm](https://huggingface.co/PantagrueLLM/text-base-wiki-mlm) | Pantagruel-B-Wk-MLM | Base / 125M | French Wikipedia 2019 (4GB) | 78.25% / 78.41% | | [text-base-camtok-oscar](https://huggingface.co/PantagrueLLM/text-base-camtok-oscar) | Pantagruel-B-camtok-Osc | Base / 110M | OSCAR 2019 (138GB) | 80.40% / 80.53% | | [text-base-oscar-mlm](https://huggingface.co/PantagrueLLM/text-base-oscar-mlm) | Pantagruel-B-Osc-MLM | Base / 125M | OSCAR 2019 (138GB) | 81.11% / 81.52% | | [text-base-croissant-mlm](https://huggingface.co/PantagrueLLM/text-base-croissant-mlm) | Pantagruel-B-Crs-MLM | Base / 125M | croissantLLM (1.5GB) | 81.05% / 80.69% | For more downstream tasks and evaluation datasets, please refer to [our paper](https://arxiv.org/abs/2601.05911). ## Usage Our models can be used with `AutoModel` and `AutoConfig` classes to extract features as below. Other common classes for text-related downstream tasks, including `AutoModelForMaskedLM`, `AutoModelForSequenceClassification`, `AutoModelForMultipleChoice`, `AutoModelForTokenClassification`, and `AutoModelForQuestionAnswering` are also supported. We are currently working to merge the modeling files into the official Hugging Face repository, which will enable native use of the `Pantagruel` classes. ```python import torch from transformers import AutoTokenizer, AutoModel # Load the tokenizer and model model_name = "PantagrueLLM/text-base-wiki" tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) model = AutoModel.from_pretrained(model_name, trust_remote_code=True) model.eval() # Example input sentences = [ "Bonjour, comment allez-vous ?", "Le chat dort sur le tapis." ] # Tokenize input inputs = tokenizer( sentences, padding=True, truncation=True, return_tensors="pt" ) # Forward pass to get hidden states with torch.no_grad(): outputs = model(**inputs) # Token-level embeddings token_embeddings = outputs.last_hidden_state print(token_embeddings.shape) # Shape: (batch_size, sequence_length, hidden_size) ``` ## Speech-only models If you want to check out our speech-only models, please visit our [speech-only collection](https://huggingface.co/collections/PantagrueLLM/speech-only-models) for more details. ## Citation If you use these models or find them useful in your research, publications, or applications, please cite the following work: ```bibtex @article{le2026pantagruel, title={Pantagruel: Unified Self-Supervised Encoders for French Text and Speech}, author={Le, Phuong-Hang and Pelloin, Valentin and Chatelain, Arnault and Bouziane, Maryem and Ghennai, Mohammed and Guan, Qianwen and Milintsevich, Kirill and Mdhaffar, Salima and Mannion, Aidan and Defauw, Nils and others}, journal={arXiv preprint arXiv:2601.05911}, year={2026} } ``` For more information, see the full paper: https://arxiv.org/abs/2601.05911.