projecte-aina
/

roberta-base-ca-v2

RoBERTa-base-ca-v2

Catalan Textual Corpus

Model card Files Files and versions

mmarimon commited on Nov 17, 2022

Commit

ff189ac

·

1 Parent(s): e975afc

Update README.md

Files changed (1) hide show

README.md +12 -4

README.md CHANGED Viewed

@@ -116,7 +116,18 @@ The training corpus consists of several corpora gathered from web crawling and p
 ### Training procedure
 The training corpus has been tokenized using a byte version of [Byte-Pair Encoding (BPE)](https://github.com/openai/gpt-2)
-used in the original [RoBERTA](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model with a vocabulary size of 50,262 tokens.
 The RoBERTa-ca-v2 pretraining consists of a masked language model training that follows the approach employed for the RoBERTa base model
 with the same hyperparameters as in the original work.
 The training lasted a total of 96 hours with 16 NVIDIA V100 GPUs of 16GB DDRAM.
@@ -233,9 +244,6 @@ If you use any of these resources (datasets or models) in your work, please cite
 ### Disclaimer
-<details>
-<summary>Click to expand</summary>
 The models published in this repository are intended for a generalist purpose and are available to third parties. These models may have bias and/or any other undesirable distortions.
 When third parties, deploy or provide systems and/or services to other parties using any of these models (or using systems based on these models) or become users of the models, they should note that it is their responsibility to mitigate the risks arising from their use and, in any event, to comply with applicable regulations, including regulations regarding the use of Artificial Intelligence.

 ### Training procedure
 The training corpus has been tokenized using a byte version of [Byte-Pair Encoding (BPE)](https://github.com/openai/gpt-2)
+used in the original [RoBERTA](https://github.com/p
+### Author
+Text Mining Unit (TeMU) at the Barcelona Supercomputing Center (bsc-temu@bsc.es)
+### Contact information
+For further information, send an email to <plantl-gob-es@bsc.es>
+### Copyright
+Copyright by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) (2022)
+### Licensing informationytorch/fairseq/tree/master/examples/roberta) model with a vocabulary size of 50,262 tokens.
 The RoBERTa-ca-v2 pretraining consists of a masked language model training that follows the approach employed for the RoBERTa base model
 with the same hyperparameters as in the original work.
 The training lasted a total of 96 hours with 16 NVIDIA V100 GPUs of 16GB DDRAM.
 ### Disclaimer
 The models published in this repository are intended for a generalist purpose and are available to third parties. These models may have bias and/or any other undesirable distortions.
 When third parties, deploy or provide systems and/or services to other parties using any of these models (or using systems based on these models) or become users of the models, they should note that it is their responsibility to mitigate the risks arising from their use and, in any event, to comply with applicable regulations, including regulations regarding the use of Artificial Intelligence.