flaubert commited on
Commit
33a308f
·
verified ·
1 Parent(s): c20ce1d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +15 -18
README.md CHANGED
@@ -16,35 +16,27 @@ Pantagruel is a family of self-supervised encoder models for French text and spe
16
 
17
  Pantagruel adopts data2vec 2.0-style teacher–student setup: a student encoder processes partially visible inputs and is trained to predict latent representations produced by a teacher encoder that observes the full, unmasked inputs. The teacher is implemented as an exponential moving average (EMA) of the student. This feature-space prediction objective is used for both speech and text models. For text, it is combined with an additional masked language modeling (MLM) loss to better capture fine-grained syntactic and semantic information.
18
 
19
- The models were pre-trained using `fairseq` library (v0.12.2) and converted to HuggingFacce's `transformers` format. For best compatibility, we recommend using `transformers==4.57.0` or `4.56.2`, together with `tokenizers==0.22.1` and `sentencepiece==0.1.99`.
20
 
21
  - **Paper**: https://arxiv.org/abs/2601.05911
22
  - **Pre-training code**: to be updated soon.
23
 
24
 
25
- ## Speech-only models
26
-
27
- Please refer to our [speech-only collection](https://huggingface.co/collections/PantagrueLLM/speech-only-models) for more details.
28
-
29
- ## Text-only models
30
-
31
- Please refer to our [text-only collection](https://huggingface.co/collections/PantagrueLLM/text-only-models) for more details.
32
-
33
  ## Text-only models
34
  Pantagruel text encoders are trained on large-scale French text corpora, including Wikipedia 2019, OSCAR 2019, and CroissantLLM. In addition to feature-space prediction, text models incorporate masked language modeling (MLM) to better capture fine-grained syntactic and semantic information. These models produce strong sentence and token-level representations for downstream NLP tasks.
35
 
36
- **Note on model naming convention:** Models that include `camtok` in their name use CamemBERT's tokenizer. If no tokenizer is specified, the model uses our custom tokenizer. All text-based models are trained using the data2vec 2.0 masked feature prediction objective. Models with an `MLM` suffix additionally incorporate the masked language modeling (MLM) objective alongside the main data2vec 2.0 objective.
37
 
38
  The table below presents the accuracy of the natural language inference task on the French XNLI dataset.
39
 
40
- | **Model name (HuggingFace / Paper)** | **Arch/ Params** | **Pretrained dataset** | **Accuracy on XNLI (FR) (dev / test)** | **Note** |
41
- |------------------------|-----------------|----------------------|---------------------------------------|----------|
42
- | text-base-camtok-wiki / Pantagruel-B-camtok-Wk | Base (110M) | French Wikipedia 2019 (4GB) | 76.94% / 77.43% | for ablation study purpose |
43
- | text-base-wiki / Pantagruel-B-Wk | Base (125M) | French Wikipedia 2019 (4GB) | 77.40% / 78.41% | for ablation study purpose |
44
- | text-base-wiki-mlm / Pantagruel-B-Wk-MLM | Base (125M) | French Wikipedia 2019 (4GB) | 78.25% / 78.41% | |
45
- | text-base-camtok-oscar / Pantagruel-B-camtok-Osc | Base (110M) | OSCAR 2019 (138GB) | 80.40% / 80.53% | |
46
- | text-base-oscar-mlm / Pantagruel-B-Osc-MLM | Base (125M) | OSCAR 2019 (138GB) | 81.11% / 81.52% | |
47
- | text-base-croissant-mlm / Pantagruel-B-Crs-MLM | Base (125M) | croissantLLM (1.5GB) | 80.91% / 81.05% | |
48
 
49
  For more downstream tasks and evaluation datasets, please refer to [our paper](https://arxiv.org/abs/2601.05911).
50
 
@@ -85,6 +77,11 @@ print(token_embeddings.shape)
85
  # Shape: (batch_size, sequence_length, hidden_size)
86
  ```
87
 
 
 
 
 
 
88
  ## Citation
89
  If you use these models or find them useful in your research, publications, or applications, please cite the following work:
90
 
 
16
 
17
  Pantagruel adopts data2vec 2.0-style teacher–student setup: a student encoder processes partially visible inputs and is trained to predict latent representations produced by a teacher encoder that observes the full, unmasked inputs. The teacher is implemented as an exponential moving average (EMA) of the student. This feature-space prediction objective is used for both speech and text models. For text, it is combined with an additional masked language modeling (MLM) loss to better capture fine-grained syntactic and semantic information.
18
 
19
+ The models were pre-trained using `fairseq` library (v0.12.2) and converted to HuggingFace's `transformers` format. For best compatibility, we recommend using `transformers==4.57.0` or `4.56.2`, together with `tokenizers==0.22.1` and `sentencepiece==0.1.99`.
20
 
21
  - **Paper**: https://arxiv.org/abs/2601.05911
22
  - **Pre-training code**: to be updated soon.
23
 
24
 
 
 
 
 
 
 
 
 
25
  ## Text-only models
26
  Pantagruel text encoders are trained on large-scale French text corpora, including Wikipedia 2019, OSCAR 2019, and CroissantLLM. In addition to feature-space prediction, text models incorporate masked language modeling (MLM) to better capture fine-grained syntactic and semantic information. These models produce strong sentence and token-level representations for downstream NLP tasks.
27
 
28
+ **Note on model naming convention:** Models that include `camtok` in their name use CamemBERT's tokenizer, which is used for comparison our models to a BERT-based counterpart. If no tokenizer is specified, the model uses our custom tokenizer. All text-based models are trained using the data2vec 2.0 masked feature prediction objective. Models with an `MLM` suffix additionally incorporate the masked language modeling (MLM) objective alongside the main data2vec 2.0 objective.
29
 
30
  The table below presents the accuracy of the natural language inference task on the French XNLI dataset.
31
 
32
+ | **HuggingFace name**| **Model name (paper)** | **Arch/ Params** | **Pretrained dataset** | **Accuracy on XNLI (FR) (dev / test)** |
33
+ |----------|------------------------|-----------------|----------------------|---------------------------------------|
34
+ | text-base-camtok-wiki | Pantagruel-B-camtok-Wk | Base / 110M | French Wikipedia 2019 (4GB) | 76.94% / 77.43% |
35
+ | text-base-wiki | Pantagruel-B-Wk | Base / 125M | French Wikipedia 2019 (4GB) | 77.40% / 78.41% |
36
+ | text-base-wiki-mlm | Pantagruel-B-Wk-MLM | Base / 125M | French Wikipedia 2019 (4GB) | 78.25% / 78.41% |
37
+ | text-base-camtok-oscar | Pantagruel-B-camtok-Osc | Base / 110M | OSCAR 2019 (138GB) | 80.40% / 80.53% |
38
+ | text-base-oscar-mlm | Pantagruel-B-Osc-MLM | Base / 125M | OSCAR 2019 (138GB) | 81.11% / 81.52% |
39
+ | text-base-croissant-mlm | Pantagruel-B-Crs-MLM | Base / 125M | croissantLLM (1.5GB) | 81.05% / 80.69% |
40
 
41
  For more downstream tasks and evaluation datasets, please refer to [our paper](https://arxiv.org/abs/2601.05911).
42
 
 
77
  # Shape: (batch_size, sequence_length, hidden_size)
78
  ```
79
 
80
+ ## Speech-only models
81
+
82
+ If you want to check out our speech-only models, please visit our [speech-only collection](https://huggingface.co/collections/PantagrueLLM/speech-only-models) for more details.
83
+
84
+
85
  ## Citation
86
  If you use these models or find them useful in your research, publications, or applications, please cite the following work:
87