IsGarrido commited on
Commit
3d54a71
·
verified ·
1 Parent(s): ece66f5

Upload PlanTL-GOB-ES-roberta-base-bne-copy version gender_classifier_en_modernbert_base

Browse files
Files changed (8) hide show
  1. README.md +149 -0
  2. config.json +25 -0
  3. merges.txt +0 -0
  4. pytorch_model.bin +3 -0
  5. special_tokens_map.json +51 -0
  6. tokenizer.json +0 -0
  7. tokenizer_config.json +66 -0
  8. vocab.json +0 -0
README.md ADDED
@@ -0,0 +1,149 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model:
4
+ - PlanTL-GOB-ES/roberta-base-bne
5
+ tags:
6
+ - roberta-base-bne
7
+ - PlanTL-GOB-ES
8
+ - MarIA
9
+ ---
10
+ # PlanTL-GOB-ES-roberta-base-bne
11
+
12
+ ©© **All rights reserved: https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne** ©©
13
+
14
+ **Copy of** MarIA ( **PlanTL-GOB-ES/roberta-base-bne** ) since weight files for this model has been removed permanently because it has been deprecated.
15
+
16
+
17
+ ## How to use
18
+ Here is how to use this model:
19
+
20
+ ```python
21
+ >>> from transformers import pipeline
22
+ >>> from pprint import pprint
23
+ >>> unmasker = pipeline('fill-mask', model='PeterPanecillo/PlanTL-GOB-ES-roberta-base-bne-copy')
24
+ >>> pprint(unmasker("Gracias a los datos de la BNE se ha podido <mask> este modelo del lenguaje."))
25
+ [{'score': 0.08422081917524338,
26
+ 'token': 3832,
27
+ 'token_str': ' desarrollar',
28
+ 'sequence': 'Gracias a los datos de la BNE se ha podido desarrollar este modelo del lenguaje.'},
29
+ {'score': 0.06348305940628052,
30
+ 'token': 3078,
31
+ 'token_str': ' crear',
32
+ 'sequence': 'Gracias a los datos de la BNE se ha podido crear este modelo del lenguaje.'},
33
+ {'score': 0.06148449331521988,
34
+ 'token': 2171,
35
+ 'token_str': ' realizar',
36
+ 'sequence': 'Gracias a los datos de la BNE se ha podido realizar este modelo del lenguaje.'},
37
+ {'score': 0.056218471378088,
38
+ 'token': 10880,
39
+ 'token_str': ' elaborar',
40
+ 'sequence': 'Gracias a los datos de la BNE se ha podido elaborar este modelo del lenguaje.'},
41
+ {'score': 0.05133328214287758,
42
+ 'token': 31915,
43
+ 'token_str': ' validar',
44
+ 'sequence': 'Gracias a los datos de la BNE se ha podido validar este modelo del lenguaje.'}]
45
+ ```
46
+
47
+ Here is how to use this model to get the features of a given text in PyTorch:
48
+
49
+ ```python
50
+ >>> from transformers import RobertaTokenizer, RobertaModel
51
+ >>> tokenizer = RobertaTokenizer.from_pretrained('PeterPanecillo/PlanTL-GOB-ES-roberta-base-bne-copy')
52
+ >>> model = RobertaModel.from_pretrained('PeterPanecillo/PlanTL-GOB-ES-roberta-base-bne-copy')
53
+ >>> text = "Gracias a los datos de la BNE se ha podido desarrollar este modelo del lenguaje."
54
+ >>> encoded_input = tokenizer(text, return_tensors='pt')
55
+ >>> output = model(**encoded_input)
56
+ >>> print(output.last_hidden_state.shape)
57
+ torch.Size([1, 19, 768])
58
+ ```
59
+
60
+
61
+ ## Limitations and bias
62
+
63
+ At the time of submission, no measures have been taken to estimate the bias and toxicity embedded in the model. However, we are well aware that our models may be biased since the corpora have been collected using crawling techniques on multiple web sources. We intend to conduct research in these areas in the future, and if completed, this model card will be updated. Nevertheless, here's an example of how the model can have biased predictions:
64
+
65
+ ```python
66
+ >>> from transformers import pipeline, set_seed
67
+ >>> from pprint import pprint
68
+ >>> unmasker = pipeline('fill-mask', model='PeterPanecillo/PlanTL-GOB-ES-roberta-base-bne-copy')
69
+ >>> set_seed(42)
70
+ >>> pprint(unmasker("Antonio está pensando en <mask>."))
71
+ [{'score': 0.07950365543365479,
72
+ 'sequence': 'Antonio está pensando en ti.',
73
+ 'token': 486,
74
+ 'token_str': ' ti'},
75
+ {'score': 0.03375273942947388,
76
+ 'sequence': 'Antonio está pensando en irse.',
77
+ 'token': 13134,
78
+ 'token_str': ' irse'},
79
+ {'score': 0.031026942655444145,
80
+ 'sequence': 'Antonio está pensando en casarse.',
81
+ 'token': 24852,
82
+ 'token_str': ' casarse'},
83
+ {'score': 0.030703715980052948,
84
+ 'sequence': 'Antonio está pensando en todo.',
85
+ 'token': 665,
86
+ 'token_str': ' todo'},
87
+ {'score': 0.02838558703660965,
88
+ 'sequence': 'Antonio está pensando en ello.',
89
+ 'token': 1577,
90
+ 'token_str': ' ello'}]
91
+
92
+ >>> set_seed(42)
93
+ >>> pprint(unmasker("Mohammed está pensando en <mask>."))
94
+ [{'score': 0.05433618649840355,
95
+ 'sequence': 'Mohammed está pensando en morir.',
96
+ 'token': 9459,
97
+ 'token_str': ' morir'},
98
+ {'score': 0.0400255024433136,
99
+ 'sequence': 'Mohammed está pensando en irse.',
100
+ 'token': 13134,
101
+ 'token_str': ' irse'},
102
+ {'score': 0.03705748915672302,
103
+ 'sequence': 'Mohammed está pensando en todo.',
104
+ 'token': 665,
105
+ 'token_str': ' todo'},
106
+ {'score': 0.03658654913306236,
107
+ 'sequence': 'Mohammed está pensando en quedarse.',
108
+ 'token': 9331,
109
+ 'token_str': ' quedarse'},
110
+ {'score': 0.03329474478960037,
111
+ 'sequence': 'Mohammed está pensando en ello.',
112
+ 'token': 1577,
113
+ 'token_str': ' ello'}]
114
+ ```
115
+
116
+
117
+
118
+ ## Additional information
119
+
120
+ ### Author
121
+ Text Mining Unit (TeMU) from Barcelona Supercomputing Center (<bsc-temu@bsc.es>).
122
+
123
+ ### Contact information
124
+ For further information, send an email to <plantl-gob-es@bsc.es>.
125
+
126
+ ### Copyright
127
+ Copyright by the [Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA)](https://portal.mineco.gob.es/en-us/digitalizacionIA/Pages/sedia.aspx).
128
+
129
+ ### Licensing information
130
+ This work is licensed under a [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
131
+
132
+ ### Funding
133
+ This work was funded by the [Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA)](https://portal.mineco.gob.es/en-us/digitalizacionIA/Pages/sedia.aspx) within the framework of the Plan-TL.
134
+
135
+ ### Citation information
136
+ If you use this model, please cite the [paper](http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6405):
137
+ ```
138
+ @article{,
139
+ title = {MarIA: Spanish Language Models},
140
+ author = {Asier Gutiérrez Fandiño and Jordi Armengol Estapé and Marc Pàmies and Joan Llop Palao and Joaquin Silveira Ocampo and Casimiro Pio Carrino and Carme Armentano Oller and Carlos Rodriguez Penagos and Aitor Gonzalez Agirre and Marta Villegas},
141
+ doi = {10.26342/2022-68-3},
142
+ issn = {1135-5948},
143
+ journal = {Procesamiento del Lenguaje Natural},
144
+ publisher = {Sociedad Española para el Procesamiento del Lenguaje Natural},
145
+ url = {https://upcommons.upc.edu/handle/2117/367156#.YyMTB4X9A-0.mendeley},
146
+ volume = {68},
147
+ year = {2022},
148
+ }
149
+ ```
config.json ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "RobertaForMaskedLM"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.0,
6
+ "bos_token_id": 0,
7
+ "eos_token_id": 2,
8
+ "gradient_checkpointing": false,
9
+ "hidden_act": "gelu",
10
+ "hidden_dropout_prob": 0.0,
11
+ "hidden_size": 768,
12
+ "initializer_range": 0.02,
13
+ "intermediate_size": 3072,
14
+ "layer_norm_eps": 1e-05,
15
+ "max_position_embeddings": 514,
16
+ "model_type": "roberta",
17
+ "num_attention_heads": 12,
18
+ "num_hidden_layers": 12,
19
+ "pad_token_id": 1,
20
+ "position_embedding_type": "absolute",
21
+ "transformers_version": "4.4.2",
22
+ "type_vocab_size": 1,
23
+ "use_cache": true,
24
+ "vocab_size": 50262
25
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2bb8e5dc18acc10e570b09aabff6f66d0bc4012ee7af1d8eb33d3da0cdf37b75
3
+ size 499069583
special_tokens_map.json ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": true,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "cls_token": {
10
+ "content": "<s>",
11
+ "lstrip": false,
12
+ "normalized": true,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "eos_token": {
17
+ "content": "</s>",
18
+ "lstrip": false,
19
+ "normalized": true,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "mask_token": {
24
+ "content": "<mask>",
25
+ "lstrip": true,
26
+ "normalized": true,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "pad_token": {
31
+ "content": "<pad>",
32
+ "lstrip": false,
33
+ "normalized": true,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ },
37
+ "sep_token": {
38
+ "content": "</s>",
39
+ "lstrip": false,
40
+ "normalized": true,
41
+ "rstrip": false,
42
+ "single_word": false
43
+ },
44
+ "unk_token": {
45
+ "content": "<unk>",
46
+ "lstrip": false,
47
+ "normalized": true,
48
+ "rstrip": false,
49
+ "single_word": false
50
+ }
51
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,66 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "bos_token": {
4
+ "__type": "AddedToken",
5
+ "content": "<s>",
6
+ "lstrip": false,
7
+ "normalized": true,
8
+ "rstrip": false,
9
+ "single_word": false
10
+ },
11
+ "cls_token": {
12
+ "__type": "AddedToken",
13
+ "content": "<s>",
14
+ "lstrip": false,
15
+ "normalized": true,
16
+ "rstrip": false,
17
+ "single_word": false
18
+ },
19
+ "eos_token": {
20
+ "__type": "AddedToken",
21
+ "content": "</s>",
22
+ "lstrip": false,
23
+ "normalized": true,
24
+ "rstrip": false,
25
+ "single_word": false
26
+ },
27
+ "errors": "replace",
28
+ "mask_token": {
29
+ "__type": "AddedToken",
30
+ "content": "<mask>",
31
+ "lstrip": true,
32
+ "normalized": true,
33
+ "rstrip": false,
34
+ "single_word": false
35
+ },
36
+ "max_len": 512,
37
+ "model_max_length": 512,
38
+ "name_or_path": "./roberta-base-bne/",
39
+ "pad_token": {
40
+ "__type": "AddedToken",
41
+ "content": "<pad>",
42
+ "lstrip": false,
43
+ "normalized": true,
44
+ "rstrip": false,
45
+ "single_word": false
46
+ },
47
+ "sep_token": {
48
+ "__type": "AddedToken",
49
+ "content": "</s>",
50
+ "lstrip": false,
51
+ "normalized": true,
52
+ "rstrip": false,
53
+ "single_word": false
54
+ },
55
+ "special_tokens_map_file": null,
56
+ "tokenizer_class": "RobertaTokenizer",
57
+ "trim_offsets": true,
58
+ "unk_token": {
59
+ "__type": "AddedToken",
60
+ "content": "<unk>",
61
+ "lstrip": false,
62
+ "normalized": true,
63
+ "rstrip": false,
64
+ "single_word": false
65
+ }
66
+ }
vocab.json ADDED
The diff for this file is too large to render. See raw diff