Bugpie commited on
Commit
ee5bbd7
·
1 Parent(s): 962dd99

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +8 -2
README.md CHANGED
@@ -10,8 +10,6 @@ datasets:
10
  CamemBERT is a state-of-the-art language model for French based on the RoBERTa model.
11
  It is now available on Hugging Face in six different versions with varying number of parameters, amount of pretraining data and pretraining data source domains.
12
 
13
- ## Intended uses & limitations
14
-
15
  ## How to use
16
 
17
  -**Filling masks using pipeline**
@@ -45,6 +43,14 @@ It is now available on Hugging Face in six different versions with varying numbe
45
 
46
  ## Limitations and bias
47
 
 
 
 
 
 
 
 
 
48
  ## Training data
49
 
50
  OSCAR or Open Super-large Crawled Aggregated coRpus is a multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the Ungoliant architecture.
 
10
  CamemBERT is a state-of-the-art language model for French based on the RoBERTa model.
11
  It is now available on Hugging Face in six different versions with varying number of parameters, amount of pretraining data and pretraining data source domains.
12
 
 
 
13
  ## How to use
14
 
15
  -**Filling masks using pipeline**
 
43
 
44
  ## Limitations and bias
45
 
46
+ Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)).
47
+
48
+ This model was pretrinaed on a subcorpus of OSCAR multilingual corpus. Some of the limitations and risks associated with the OSCAR dataset, which are further detailed in the [OSCAR dataset card](https://huggingface.co/datasets/oscar), include the following:
49
+
50
+ > The quality of some OSCAR sub-corpora might be lower than expected, specifically for the lowest-resource languages.
51
+
52
+ > Constructed from Common Crawl, Personal and sensitive information might be present.
53
+
54
  ## Training data
55
 
56
  OSCAR or Open Super-large Crawled Aggregated coRpus is a multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the Ungoliant architecture.