Update README.md
Browse files
README.md
CHANGED
|
@@ -16,35 +16,27 @@ Pantagruel is a family of self-supervised encoder models for French text and spe
|
|
| 16 |
|
| 17 |
Pantagruel adopts data2vec 2.0-style teacher–student setup: a student encoder processes partially visible inputs and is trained to predict latent representations produced by a teacher encoder that observes the full, unmasked inputs. The teacher is implemented as an exponential moving average (EMA) of the student. This feature-space prediction objective is used for both speech and text models. For text, it is combined with an additional masked language modeling (MLM) loss to better capture fine-grained syntactic and semantic information.
|
| 18 |
|
| 19 |
-
The models were pre-trained using `fairseq` library (v0.12.2) and converted to
|
| 20 |
|
| 21 |
- **Paper**: https://arxiv.org/abs/2601.05911
|
| 22 |
- **Pre-training code**: to be updated soon.
|
| 23 |
|
| 24 |
|
| 25 |
-
## Speech-only models
|
| 26 |
-
|
| 27 |
-
Please refer to our [speech-only collection](https://huggingface.co/collections/PantagrueLLM/speech-only-models) for more details.
|
| 28 |
-
|
| 29 |
-
## Text-only models
|
| 30 |
-
|
| 31 |
-
Please refer to our [text-only collection](https://huggingface.co/collections/PantagrueLLM/text-only-models) for more details.
|
| 32 |
-
|
| 33 |
## Text-only models
|
| 34 |
Pantagruel text encoders are trained on large-scale French text corpora, including Wikipedia 2019, OSCAR 2019, and CroissantLLM. In addition to feature-space prediction, text models incorporate masked language modeling (MLM) to better capture fine-grained syntactic and semantic information. These models produce strong sentence and token-level representations for downstream NLP tasks.
|
| 35 |
|
| 36 |
-
**Note on model naming convention:** Models that include `camtok` in their name use CamemBERT's tokenizer. If no tokenizer is specified, the model uses our custom tokenizer. All text-based models are trained using the data2vec 2.0 masked feature prediction objective. Models with an `MLM` suffix additionally incorporate the masked language modeling (MLM) objective alongside the main data2vec 2.0 objective.
|
| 37 |
|
| 38 |
The table below presents the accuracy of the natural language inference task on the French XNLI dataset.
|
| 39 |
|
| 40 |
-
| **Model name (
|
| 41 |
-
|------------------------
|
| 42 |
-
| text-base-camtok-wiki
|
| 43 |
-
| text-base-wiki
|
| 44 |
-
| text-base-wiki-mlm
|
| 45 |
-
| text-base-camtok-oscar
|
| 46 |
-
| text-base-oscar-mlm
|
| 47 |
-
| text-base-croissant-mlm
|
| 48 |
|
| 49 |
For more downstream tasks and evaluation datasets, please refer to [our paper](https://arxiv.org/abs/2601.05911).
|
| 50 |
|
|
@@ -85,6 +77,11 @@ print(token_embeddings.shape)
|
|
| 85 |
# Shape: (batch_size, sequence_length, hidden_size)
|
| 86 |
```
|
| 87 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 88 |
## Citation
|
| 89 |
If you use these models or find them useful in your research, publications, or applications, please cite the following work:
|
| 90 |
|
|
|
|
| 16 |
|
| 17 |
Pantagruel adopts data2vec 2.0-style teacher–student setup: a student encoder processes partially visible inputs and is trained to predict latent representations produced by a teacher encoder that observes the full, unmasked inputs. The teacher is implemented as an exponential moving average (EMA) of the student. This feature-space prediction objective is used for both speech and text models. For text, it is combined with an additional masked language modeling (MLM) loss to better capture fine-grained syntactic and semantic information.
|
| 18 |
|
| 19 |
+
The models were pre-trained using `fairseq` library (v0.12.2) and converted to HuggingFace's `transformers` format. For best compatibility, we recommend using `transformers==4.57.0` or `4.56.2`, together with `tokenizers==0.22.1` and `sentencepiece==0.1.99`.
|
| 20 |
|
| 21 |
- **Paper**: https://arxiv.org/abs/2601.05911
|
| 22 |
- **Pre-training code**: to be updated soon.
|
| 23 |
|
| 24 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 25 |
## Text-only models
|
| 26 |
Pantagruel text encoders are trained on large-scale French text corpora, including Wikipedia 2019, OSCAR 2019, and CroissantLLM. In addition to feature-space prediction, text models incorporate masked language modeling (MLM) to better capture fine-grained syntactic and semantic information. These models produce strong sentence and token-level representations for downstream NLP tasks.
|
| 27 |
|
| 28 |
+
**Note on model naming convention:** Models that include `camtok` in their name use CamemBERT's tokenizer, which is used for comparison our models to a BERT-based counterpart. If no tokenizer is specified, the model uses our custom tokenizer. All text-based models are trained using the data2vec 2.0 masked feature prediction objective. Models with an `MLM` suffix additionally incorporate the masked language modeling (MLM) objective alongside the main data2vec 2.0 objective.
|
| 29 |
|
| 30 |
The table below presents the accuracy of the natural language inference task on the French XNLI dataset.
|
| 31 |
|
| 32 |
+
| **HuggingFace name**| **Model name (paper)** | **Arch/ Params** | **Pretrained dataset** | **Accuracy on XNLI (FR) (dev / test)** |
|
| 33 |
+
|----------|------------------------|-----------------|----------------------|---------------------------------------|
|
| 34 |
+
| text-base-camtok-wiki | Pantagruel-B-camtok-Wk | Base / 110M | French Wikipedia 2019 (4GB) | 76.94% / 77.43% |
|
| 35 |
+
| text-base-wiki | Pantagruel-B-Wk | Base / 125M | French Wikipedia 2019 (4GB) | 77.40% / 78.41% |
|
| 36 |
+
| text-base-wiki-mlm | Pantagruel-B-Wk-MLM | Base / 125M | French Wikipedia 2019 (4GB) | 78.25% / 78.41% |
|
| 37 |
+
| text-base-camtok-oscar | Pantagruel-B-camtok-Osc | Base / 110M | OSCAR 2019 (138GB) | 80.40% / 80.53% |
|
| 38 |
+
| text-base-oscar-mlm | Pantagruel-B-Osc-MLM | Base / 125M | OSCAR 2019 (138GB) | 81.11% / 81.52% |
|
| 39 |
+
| text-base-croissant-mlm | Pantagruel-B-Crs-MLM | Base / 125M | croissantLLM (1.5GB) | 81.05% / 80.69% |
|
| 40 |
|
| 41 |
For more downstream tasks and evaluation datasets, please refer to [our paper](https://arxiv.org/abs/2601.05911).
|
| 42 |
|
|
|
|
| 77 |
# Shape: (batch_size, sequence_length, hidden_size)
|
| 78 |
```
|
| 79 |
|
| 80 |
+
## Speech-only models
|
| 81 |
+
|
| 82 |
+
If you want to check out our speech-only models, please visit our [speech-only collection](https://huggingface.co/collections/PantagrueLLM/speech-only-models) for more details.
|
| 83 |
+
|
| 84 |
+
|
| 85 |
## Citation
|
| 86 |
If you use these models or find them useful in your research, publications, or applications, please cite the following work:
|
| 87 |
|