RichardErkhov commited on
Commit
336dd47
·
verified ·
1 Parent(s): 61fb191

uploaded readme

Browse files
Files changed (1) hide show
  1. README.md +301 -0
README.md ADDED
@@ -0,0 +1,301 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Quantization made by Richard Erkhov.
2
+
3
+ [Github](https://github.com/RichardErkhov)
4
+
5
+ [Discord](https://discord.gg/pvy7H8DZMG)
6
+
7
+ [Request more models](https://github.com/RichardErkhov/quant_request)
8
+
9
+
10
+ FLOR-760M - bnb 4bits
11
+ - Model creator: https://huggingface.co/projecte-aina/
12
+ - Original model: https://huggingface.co/projecte-aina/FLOR-760M/
13
+
14
+
15
+
16
+
17
+ Original model description:
18
+ ---
19
+ language:
20
+ - en
21
+ - es
22
+ - ca
23
+ licence:
24
+ - apache-2.0
25
+ tags:
26
+ - FLOR
27
+ - bloom
28
+ - spanish
29
+ - catalan
30
+ - english
31
+ pipeline_tag: text-generation
32
+ widget:
33
+ - text: |-
34
+ Respon a la pregunta següent.
35
+ Pregunta: "Quina és la capital de Suècia?"
36
+ Resposta: "La capital de Suècia és Estocolm."
37
+ ----
38
+ Respon a la pregunta següent.
39
+ Pregunta: "Quina beguda es consumeix als matins per despertar-se?"
40
+ Resposta: "La majoria de gent consumeix cafè per despertar-se."
41
+ ----
42
+ Respon a la pregunta següent.
43
+ Pregunta: "Explica com funciona un motor de combustió"
44
+ Resposta:
45
+ example_title: Pregunta-Resposta
46
+ - text: |-
47
+ Extrae las entidades nombradas del siguiente texto:
48
+ Texto: "Me llamo Wolfgang y vivo en Berlin"
49
+ Entidades: Wolfgang:PER, Berlin:LOC
50
+ ----
51
+ Extrae las entidades nombradas del siguiente texto:
52
+ Texto: "Hoy voy a visitar el parc güell tras salir del barcelona supercomputing center"
53
+ Entidades: parc güell:LOC, barcelona supercomputing center:LOC
54
+ ----
55
+ Extrae las entidades nombradas del siguiente texto:
56
+ Texto: "Maria y Miguel no tienen ningún problema contigo"
57
+ Entidades: Maria:PER, Miguel:PER
58
+ ----
59
+ Extrae las entidades nombradas del siguiente texto:
60
+ Texto: "Damián se cortó el pelo"
61
+ Entidades: Damián:PER
62
+ ----
63
+ Extrae las entidades nombradas del siguiente texto:
64
+ Texto: "Lo mejor de Barcelona és el bar de mi amigo Pablo"
65
+ Entidades: Pablo:PER, Barcelona:LOC
66
+ ----
67
+ Extrae las entidades nombradas del siguiente texto:
68
+ Texto: "Carlos comparte piso con Marc"
69
+ Entidades:
70
+ example_title: Entidades-Nombradas
71
+ ---
72
+
73
+ # FLOR-760M
74
+
75
+ ## Table of Contents
76
+ <details>
77
+ <summary>Click to expand</summary>
78
+
79
+ - [Model description](#model-description)
80
+ - [Intended uses and limitations](#intended-uses-and-limitations)
81
+ - [How to use](#how-to-use)
82
+ - [Limitations and bias](#limitations-and-bias)
83
+ - [Training](#training)
84
+ - [Evaluation](#evaluation)
85
+ - [Additional information](#additional-information)
86
+
87
+ </details>
88
+
89
+ ## Model description
90
+
91
+ **FLOR-760M** is a 760M-parameter transformer-based causal language model for Catalan, Spanish, and English.
92
+ It is the result of a language adaptation technique performed on [BLOOM-1.1B](https://huggingface.co/bigscience/bloom-1b1),
93
+ which involves modifying the model's vocabulary and embedding layer and continuously pre-training the model with 26B tokens in our target languages.
94
+
95
+ For more details, take a look at [this blogpost](https://medium.com/@mpamies247/flor-6-3b-a-chinchilla-compliant-model-for-catalan-spanish-and-english-7cdb389a9aac) about the project.
96
+
97
+ ## Intended uses and limitations
98
+
99
+ The **FLOR-760M** model is ready-to-use only for causal language modeling.
100
+ It can perform text-generation tasks and be fine-tuned for specific scenarios.
101
+
102
+ ## How to use
103
+ ```python
104
+ import torch
105
+ from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
106
+
107
+ input_text = "Sovint em trobo pensant en tot allò que"
108
+
109
+ model_id = "projecte-aina/FLOR-760M"
110
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
111
+ generator = pipeline(
112
+ "text-generation",
113
+ model=model_id,
114
+ tokenizer=tokenizer,
115
+ torch_dtype=torch.bfloat16,
116
+ trust_remote_code=True,
117
+ device_map="auto",
118
+ )
119
+ generation = generator(
120
+ input_text,
121
+ do_sample=True,
122
+ top_k=10,
123
+ eos_token_id=tokenizer.eos_token_id,
124
+ )
125
+
126
+ print(f"Result: {generation[0]['generated_text']}")
127
+ ```
128
+
129
+ ## Limitations and bias
130
+ At the time of submission, no measures have been taken to estimate the bias and toxicity embedded in the model.
131
+ However, we are well aware that our models may be biased since the corpora have been collected using crawling techniques
132
+ on multiple web sources. We intend to conduct research in these areas in the future, and if completed, this model card will be updated.
133
+
134
+
135
+ ## Training
136
+
137
+ ### Language adaptation and training
138
+
139
+ The language adaptation technique used to create FLOR-760M requires the vocabulary of the source model
140
+ to be adapted before continuing its pre-training with data in the target languages. Specifically, we proceeded as follows:
141
+ 1) We trained our own BPE tokenizer for Catalan, Spanish, and English, and replaced the original BLOOM tokenizer and vocabulary with it. This procedure implied a downsizing of the original BLOOM's embedding layer and, therefore, a model compression from 1.1B parameters to 760M.
142
+ 2) The embeddings corresponding to tokens that are present in both the original and the target vocabulary (matching tokens) were used for initialization.
143
+ 3) The embeddings from tokens not present in BLOOM's original vocabulary were initialized as the average of all embeddings.
144
+ 4) The model was initialized with the weights from BOOM-1.1B, and with our adapted tokenizer (step 1) and embeddings (steps 2-3).
145
+ 5) The model was then trained on a corpus that contains a mixture of Catalan, Spanish, and English data.
146
+
147
+ ### Training data
148
+
149
+ The training corpus is the same that was used to train [Ǎguila-7B](https://huggingface.co/projecte-aina/aguila-7b).
150
+ It consists of 26B tokens of several corpora gathered from web crawlings and public domain data.
151
+
152
+ | Dataset | Language | Words (per-epoch) | Epochs |
153
+ |---------------------|----------|--------------------|--------------|
154
+ | Wikipedia | en | 2169.97M | 1.428144485 |
155
+ | C4_es | es | 53709.80M | 0.1049686196 |
156
+ | Biomedical | es | 455.03M | 0.7140722425 |
157
+ | Legal | es | 995.70M | 0.7140722425 |
158
+ | Wikipedia | es | 693.60M | 1.428144485 |
159
+ | Gutenberg | es | 53.18M | 0.7140722425 |
160
+ | C4_ca | ca | 2826.00M | 2.142216727 |
161
+ | Biomedical | ca | 11.80M | 1.428144485 |
162
+ | RacoCatalà Noticias | ca | 17.16M | 2.142216727 |
163
+ | RacoCatalà Forums | ca | 333.73M | 2.142216727 |
164
+ | CaWaC | ca | 57.79M | 2.142216727 |
165
+ | Wikipedia | ca | 228.01M | 3.570361212 |
166
+ | Vilaweb | ca | 50.34M | 2.142216727 |
167
+
168
+ ### Languages
169
+
170
+ The training data has the same amount of Catalan and Spanish texts, and a smaller amount of English data.
171
+ The table below shows the final language distribution:
172
+
173
+ |Language|Percentage|
174
+ |--------|----------|
175
+ | English (EN) | 16.84% |
176
+ | Spanish (ES) | 41.38% |
177
+ | Catalan (CA) | 41.79% |
178
+
179
+ ### Training hyperparameters
180
+ - seed: 1
181
+ - distributed_type: [WSE-2](https://www.cerebras.net/product-chip/)
182
+ - num_devices: 1
183
+ - train_batch_size: 60
184
+ - eval_batch_size: 60
185
+ - optimizer: AdamW
186
+ - betas: (0.9,0.95)
187
+ - epsilon: 1e-08
188
+ - weight_decay_rate: 0.1
189
+ - learning_rate:
190
+ - scheduler: "Linear"
191
+ initial_learning_rate: 0.0
192
+ end_learning_rate: 4.1e-5
193
+ steps: 3050
194
+ - scheduler: "CosineDecay"
195
+ initial_learning_rate: 4.1e-5
196
+ end_learning_rate: 3.4e-6
197
+ steps: 209133
198
+ - scheduler: "Constant"
199
+ learning_rate: 2.2e-6
200
+ - num_epochs: 1.0
201
+
202
+ ### Framework versions
203
+ The training was conducted in a Cerebras' [CS-2 system](https://www.cerebras.net/product-system/)
204
+ using the [cs-1.9.1](https://github.com/Cerebras/modelzoo/releases/tag/Release_1.9.1) release of their software.
205
+
206
+
207
+ ## Evaluation
208
+ FLOR-760M has been evaluated on 5-shot, using EleutherAI's Evaluation Harness implementation, on several datasets in Catalan, Spanish, and English, with particular emphasis on Catalan datasets.
209
+
210
+ The tasks were chosen to cover several evaluation areas in order to provide a comprehensive overview of the model's capabilities. The baselines used to compare our results are multilingual and English open-source 1.3B models: mGPT-1.3B, GPT-Neo-1.3B, Pythia-1.4B, OPT-1.3B, Falcon-rw-1.3B, and Cerebras-GPT-1.3B.
211
+
212
+ Our implementation of EleutherAI's *LM Evaluation Harness* can be found [here](https://github.com/langtech-bsc/lm-evaluation-harness/tree/FLOR-eval).
213
+
214
+ The following is a list of evaluation areas and their respective datasets:
215
+ - Reading Comprehension: [Belebele](https://huggingface.co/datasets/facebook/belebele)
216
+ - Question Answering: [XQuAD](https://huggingface.co/datasets/xquad), [CatalanQA](https://huggingface.co/datasets/projecte-aina/catalanqa), [CoQCat](https://huggingface.co/datasets/projecte-aina/CoQCat)
217
+ - Natural Language Inference: [XNLI](https://huggingface.co/datasets/xnli) and its translation to Catalan ([XNLI-ca](https://huggingface.co/datasets/projecte-aina/xnli-ca)), [TE-ca](https://huggingface.co/datasets/projecte-aina/teca)
218
+ - Paraphrase Identification: [PAWS-X](https://huggingface.co/datasets/paws-x) and its translation to Catalan ([PAWS-ca](https://huggingface.co/datasets/projecte-aina/PAWS-ca)), [Parafraseja](https://huggingface.co/datasets/projecte-aina/Parafraseja)
219
+ - Commonsense Reasoning: [COPA](https://people.ict.usc.edu/~gordon/copa.html) and its translation to Catalan ([COPA-ca](https://huggingface.co/datasets/projecte-aina/COPA-ca))
220
+ - Translation: [FLoRes](https://huggingface.co/datasets/flores)
221
+
222
+ ### Reading Comprehension and Questions Answering
223
+
224
+ | Model | Belebele-ca | Belebele-es | Belebele-en | XQuAD-ca | XQuAD-es | XQuAD-en | CatalanQA | CoQCat |
225
+ | ------|:-----------:|:-----------:|:-----------:|:--------:|:--------:|:--------:|:---------:|:------:|
226
+ Random | 25.00 | 25.00 | 25.00 | - | - | - | - | - |
227
+ mGPT-1.3B | 26.64 | 25.82 | 28.07 | 0.33 | 0.67 | 0.17 | 0.65 | 0.78 |
228
+ GPT-Neo-1.3B | 39.55 | 37.50 | 42.83 | 19.75 | 29.77 | 51.53 | 22.34 | 23.57 |
229
+ Pythia-1.4B | 38.32 | 36.89 | 44.26 | 26.19 | 34.13 | 52.98 | 27.47 | 25.38 |
230
+ OPT-1.3B | 35.86 | 37.09 | 45.49 | 23.53 | 31.85 | 52.95 | 26.58 | 20.18 |
231
+ Falcon-rw-1.3B | 34.84 | 35.66 | **50.61** | 5.93 | 19.25 | **58.60** | 6.91 | 15.61 |
232
+ Cerebras-GPT-1.3B | 32.79 | 31.76 | 35.04 | 8.56 | 19.98 | 36.00 | 10.87 | 14.12 |
233
+ BLOOM-1.1B | 39.34 | 38.32 | 41.19 | 36.81 | 36.98 | 44.10 | 44.65 | 34.57 |
234
+ FLOR-760M | **41.19** | **39.55** | 36.68 | **41.10** | **41.11** | 40.20 | **51.01** | **41.34** |
235
+
236
+
237
+ ### Natural Language Inference and Paraphrase Identification
238
+
239
+ | Model | XNLI-ca | XNLI-es | XNLI-en | TECA-ca | PAWS-X-ca | PAWS-X-es | PAWS-X-en | Parafraseja |
240
+ | ------|:-------:|:-------:|:-------:|:-------:|:---------:|:---------:|:---------:|:-----------:|
241
+ Random | 33.33 | 33.33 | 33.33 | 33.33 | 50.00 | 50.00 | 50.00 | 50.00 |
242
+ mGPT-1.3B | 40.06 | 43.81 | 45.67 | 37.03 | 51.00 | 52.30 | 56.15 | 51.32 |
243
+ GPT-Neo-1.3B | 41.44 | 45.57 | 49.92 | 35.38 | 54.65 | 53.40 | 54.60 | 51.70 |
244
+ Pythia-1.4B | 42.46 | 45.61 | 51.00 | 37.46 | 54.15 | 52.50 | **57.70** | 55.23 |
245
+ OPT-1.3B | 40.08 | 44.53 | **52.48** | 36.14 | 54.10 | 52.55 | 55.90 | 53.23 |
246
+ Falcon-rw-1.3B | 34.53 | 35.85 | 45.73 | 34.96 | 54.25 | **54.05** | 53.65 | 50.60 |
247
+ Cerebras-GPT-1.3B | 36.83 | 38.88 | 47.25 | 35.62 | 52.40 | 52.20 | 55.95 | 52.05 |
248
+ BLOOM-1.1B | **47.19** | **46.39** | 49.44 | 41.38 | **55.05** | 54.05 | 54.75 | 55.65 |
249
+ FLOR-760M | 46.93 | 46.03 | 46.11 | **42.14** | 52.35 | 52.50 | 54.85 | **56.55** |
250
+
251
+
252
+ ### Commonsense Reasoning and Translation
253
+
254
+ | Model | XStoryCloze-es | XStoryCloze-en | COPA-ca | COPA-en | FloRes (ca->es) | FloRes (es->ca) | FloRes (ca->en) | FloRes (en->ca) | FloRes (es->en) | FloRes (en->es) |
255
+ | ------|:--------------:|:--------------:|:-------:|:-------:|:---------------:|:---------------:|:---------------:|:---------------:|:---------------:|:---------------:|
256
+ Random | 50.00 | 50.00 | 50.00 | 50.00 | - | - | - | - | - | - |
257
+ mGPT-1.3B | 55.33 | 60.09 | 52.20 | 63.40 | 3.25 | 2.96 | 9.25 | 3.79 | 17.75 | 15.34 |
258
+ GPT-Neo-1.3B | 51.42 | 66.58 | 53.40 | 74.80 | 3.27 | 3.80 | 17.77 | 5.49 | 17.70 | 12.04 |
259
+ Pythia-1.4B | 54.14 | 68.37 | 52.20 | 78.60 | 9.68 | 5.74 | 24.03 | 11.10 | 21.50 | 15.04 |
260
+ OPT-1.3B | 53.94 | 69.95 | 52.60 | 76.20 | 3.14 | 3.52 | 15.39 | 2.00 | 16.33 | 6.53 |
261
+ Falcon-rw-1.3B | 51.09 | **71.34** | 52.40 | **79.60** | 3.03 | 3.59 | 8.89 | 3.01 | 14.17 | 6.50 |
262
+ Cerebras-GPT-1.3B | 49.11 | 60.62 | 51.40 | 66.80 | 2.42 | 1.81 | 2.69 | 0.82 | 3.36 | 1.77 |
263
+ BLOOM-1.1B | 57.91 | 62.48 | 62.80 | 66.40 | 21.62 | 15.28 | 31.16 | 21.28 | **20.92** | 16.84 |
264
+ FLOR-760M | **61.42** | 61.42 | **65.40** | 64.20 | **22.62** | **15.77** | **32.26** | **26.04** | 20.91 | **18.08** |
265
+
266
+
267
+ ## Additional information
268
+
269
+ ### Author
270
+ The Language Technologies Unit from Barcelona Supercomputing Center.
271
+
272
+ ### Contact
273
+ For further information, please send an email to <langtech@bsc.es>.
274
+
275
+ ### Copyright
276
+ Copyright(c) 2023 by Language Technologies Unit, Barcelona Supercomputing Center.
277
+
278
+ ### License
279
+ [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
280
+
281
+ ### Funding
282
+ This work was funded by [Departament de la Vicepresidència i de Polítiques Digitals i Territori de la Generalitat de Catalunya](https://politiquesdigitals.gencat.cat/ca/inici/index.html#googtrans(ca|en) within the framework of [Projecte AINA](https://politiquesdigitals.gencat.cat/ca/economia/catalonia-ai/aina).
283
+
284
+ ### Disclaimer
285
+
286
+ <details>
287
+ <summary>Click to expand</summary>
288
+
289
+ The model published in this repository is intended for a generalist purpose and is available to third parties under a permissive Apache License, Version 2.0.
290
+
291
+ Be aware that the model may have biases and/or any other undesirable distortions.
292
+
293
+ When third parties deploy or provide systems and/or services to other parties using this model (or any system based on it)
294
+ or become users of the model, they should note that it is their responsibility to mitigate the risks arising from its use and,
295
+ in any event, to comply with applicable regulations, including regulations regarding the use of Artificial Intelligence.
296
+
297
+ In no event shall the owner and creator of the model (Barcelona Supercomputing Center)
298
+ be liable for any results arising from the use made by third parties.
299
+
300
+ </details>
301
+