text-base-wiki / README.md

Update README.md

44fc0e2 verified 17 days ago

5.99 kB

	---
	license: cc-by-nc-sa-2.0
	language:
	- fr
	pipeline_tag: feature-extraction
	library_name: transformers
	tags:
	- data2vec2
	- JEPA
	- text
	- fairseq
	---

	# Pantagruel: Unified Self-Supervised Encoders for French Text and Speech

	Summary

	Pantagruel is a family of self-supervised encoder models for French text and speech, with separate models trained for each modality. Rather than relying only on masked input-level reconstruction, Pantagruel encoders learn contextualized representations in feature space following the [data2vec 2.0](https://arxiv.org/abs/2212.07525) / [JEPA (Joint-Embedding Predictive Architecture)](https://arxiv.org/abs/2301.08243) paradigm.

	Pantagruel adopts data2vec 2.0-style teacher–student setup: a student encoder processes partially visible inputs and is trained to predict latent representations produced by a teacher encoder that observes the full, unmasked inputs. The teacher is implemented as an exponential moving average (EMA) of the student. This feature-space prediction objective is used for both speech and text models. For text, it is combined with an additional masked language modeling (MLM) loss to better capture fine-grained syntactic and semantic information.

	The models were pre-trained using `fairseq` library (v0.12.2) and converted to HuggingFace's `transformers` format. For best compatibility, we recommend using `transformers==4.57.0` or `4.56.2`, together with `tokenizers==0.22.1` and `sentencepiece==0.1.99`.

	- Paper: https://arxiv.org/abs/2601.05911
	- Pre-training code: to be updated soon.


	## Text-only models
	Pantagruel text encoders are trained on large-scale French text corpora, including Wikipedia 2019, OSCAR 2019, and CroissantLLM. In addition to feature-space prediction, text models incorporate masked language modeling (MLM) to better capture fine-grained syntactic and semantic information. These models produce strong sentence and token-level representations for downstream NLP tasks.

	Note on model naming convention: Models that include `camtok` in their name use CamemBERT's tokenizer, which is used for comparison our models to a BERT-based counterpart. If no tokenizer is specified, the model uses our custom tokenizer. All text-based models are trained using the data2vec 2.0 masked feature prediction objective. Models with an `MLM` suffix additionally incorporate the masked language modeling (MLM) objective alongside the main data2vec 2.0 objective.

	The table below presents the accuracy of the natural language inference task on the French XNLI dataset.

	\| HuggingFace name\| Model name (paper) \| Arch/ Params \| Pretrained dataset \| Accuracy on XNLI (FR) (dev / test) \|
	\|----------\|------------------------\|-----------------\|----------------------\|---------------------------------------\|
	\| [text-base-camtok-wiki](https://huggingface.co/PantagrueLLM/text-base-camtok-wiki) \| Pantagruel-B-camtok-Wk \| Base / 110M \| French Wikipedia 2019 (4GB) \| 76.94% / 77.43% \|
	\| text-base-wiki \| Pantagruel-B-Wk \| Base / 125M \| French Wikipedia 2019 (4GB) \| 77.40% / 78.41% \|
	\| [text-base-wiki-mlm](https://huggingface.co/PantagrueLLM/text-base-wiki-mlm) \| Pantagruel-B-Wk-MLM \| Base / 125M \| French Wikipedia 2019 (4GB) \| 78.25% / 78.41% \|
	\| [text-base-camtok-oscar](https://huggingface.co/PantagrueLLM/text-base-camtok-oscar) \| Pantagruel-B-camtok-Osc \| Base / 110M \| OSCAR 2019 (138GB) \| 80.40% / 80.53% \|
	\| [text-base-oscar-mlm](https://huggingface.co/PantagrueLLM/text-base-oscar-mlm) \| Pantagruel-B-Osc-MLM \| Base / 125M \| OSCAR 2019 (138GB) \| 81.11% / 81.52% \|
	\| [text-base-croissant-mlm](https://huggingface.co/PantagrueLLM/text-base-croissant-mlm) \| Pantagruel-B-Crs-MLM \| Base / 125M \| croissantLLM (1.5GB) \| 81.05% / 80.69% \|

	For more downstream tasks and evaluation datasets, please refer to [our paper](https://arxiv.org/abs/2601.05911).

	## Usage
	Our models can be used with `AutoModel` and `AutoConfig` classes to extract features as below. Other common classes for text-related downstream tasks, including `AutoModelForMaskedLM`, `AutoModelForSequenceClassification`, `AutoModelForMultipleChoice`, `AutoModelForTokenClassification`, and `AutoModelForQuestionAnswering` are also supported. We are currently working to merge the modeling files into the official Hugging Face repository, which will enable native use of the `Pantagruel` classes.

	```python
	import torch
	from transformers import AutoTokenizer, AutoModel

	# Load the tokenizer and model
	model_name = "PantagrueLLM/text-base-wiki"
	tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
	model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
	model.eval()

	# Example input
	sentences = [
	"Bonjour, comment allez-vous ?",
	"Le chat dort sur le tapis."
	]

	# Tokenize input
	inputs = tokenizer(
	sentences,
	padding=True,
	truncation=True,
	return_tensors="pt"
	)

	# Forward pass to get hidden states
	with torch.no_grad():
	outputs = model(**inputs)

	# Token-level embeddings
	token_embeddings = outputs.last_hidden_state
	print(token_embeddings.shape)
	# Shape: (batch_size, sequence_length, hidden_size)
	```

	## Speech-only models

	If you want to check out our speech-only models, please visit our [speech-only collection](https://huggingface.co/collections/PantagrueLLM/speech-only-models) for more details.

	## Citation
	If you use these models or find them useful in your research, publications, or applications, please cite the following work:

	```bibtex
	@article{le2026pantagruel,
	title={Pantagruel: Unified Self-Supervised Encoders for French Text and Speech},
	author={Le, Phuong-Hang and Pelloin, Valentin and Chatelain, Arnault and Bouziane, Maryem and Ghennai, Mohammed and Guan, Qianwen and Milintsevich, Kirill and Mdhaffar, Salima and Mannion, Aidan and Defauw, Nils and others},
	journal={arXiv preprint arXiv:2601.05911},
	year={2026}
	}
	```
	For more information, see the full paper: https://arxiv.org/abs/2601.05911.