Update README.md
Browse files
README.md
CHANGED
|
@@ -10,8 +10,6 @@ datasets:
|
|
| 10 |
CamemBERT is a state-of-the-art language model for French based on the RoBERTa model.
|
| 11 |
It is now available on Hugging Face in six different versions with varying number of parameters, amount of pretraining data and pretraining data source domains.
|
| 12 |
|
| 13 |
-
## Intended uses & limitations
|
| 14 |
-
|
| 15 |
## How to use
|
| 16 |
|
| 17 |
-**Filling masks using pipeline**
|
|
@@ -45,6 +43,14 @@ It is now available on Hugging Face in six different versions with varying numbe
|
|
| 45 |
|
| 46 |
## Limitations and bias
|
| 47 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 48 |
## Training data
|
| 49 |
|
| 50 |
OSCAR or Open Super-large Crawled Aggregated coRpus is a multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the Ungoliant architecture.
|
|
|
|
| 10 |
CamemBERT is a state-of-the-art language model for French based on the RoBERTa model.
|
| 11 |
It is now available on Hugging Face in six different versions with varying number of parameters, amount of pretraining data and pretraining data source domains.
|
| 12 |
|
|
|
|
|
|
|
| 13 |
## How to use
|
| 14 |
|
| 15 |
-**Filling masks using pipeline**
|
|
|
|
| 43 |
|
| 44 |
## Limitations and bias
|
| 45 |
|
| 46 |
+
Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)).
|
| 47 |
+
|
| 48 |
+
This model was pretrinaed on a subcorpus of OSCAR multilingual corpus. Some of the limitations and risks associated with the OSCAR dataset, which are further detailed in the [OSCAR dataset card](https://huggingface.co/datasets/oscar), include the following:
|
| 49 |
+
|
| 50 |
+
> The quality of some OSCAR sub-corpora might be lower than expected, specifically for the lowest-resource languages.
|
| 51 |
+
|
| 52 |
+
> Constructed from Common Crawl, Personal and sensitive information might be present.
|
| 53 |
+
|
| 54 |
## Training data
|
| 55 |
|
| 56 |
OSCAR or Open Super-large Crawled Aggregated coRpus is a multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the Ungoliant architecture.
|