Update README.md
Browse files
README.md
CHANGED
|
@@ -6,7 +6,7 @@ language:
|
|
| 6 |
licence:
|
| 7 |
- apache-2.0
|
| 8 |
tags:
|
| 9 |
-
-
|
| 10 |
- bloom
|
| 11 |
- spanish
|
| 12 |
- catalan
|
|
@@ -52,7 +52,7 @@ widget:
|
|
| 52 |
example_title: Entidades-Nombradas
|
| 53 |
---
|
| 54 |
|
| 55 |
-
#
|
| 56 |
|
| 57 |
## Table of Contents
|
| 58 |
<details>
|
|
@@ -70,7 +70,7 @@ widget:
|
|
| 70 |
|
| 71 |
## Model description
|
| 72 |
|
| 73 |
-
**
|
| 74 |
It is the result of a language adaptation technique performed on [BLOOM-1.1B](https://huggingface.co/bigscience/bloom-1b1),
|
| 75 |
which involves modifying the model's vocabulary and embedding layer and continuously pre-training the model with 26B tokens in our target languages.
|
| 76 |
|
|
@@ -78,7 +78,7 @@ This model has been developed as part of a scientific research submitted to [LRE
|
|
| 78 |
|
| 79 |
## Intended uses and limitations
|
| 80 |
|
| 81 |
-
The **
|
| 82 |
It can perform text-generation tasks and be fine-tuned for specific scenarios.
|
| 83 |
|
| 84 |
## How to use
|
|
@@ -88,7 +88,7 @@ from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
|
|
| 88 |
|
| 89 |
input_text = "Sovint em trobo pensant en tot allò que"
|
| 90 |
|
| 91 |
-
model_id = "BSC-LT/
|
| 92 |
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
| 93 |
generator = pipeline(
|
| 94 |
"text-generation",
|
|
@@ -118,7 +118,7 @@ on multiple web sources. We intend to conduct research in these areas in the fut
|
|
| 118 |
|
| 119 |
### Language adaptation and training
|
| 120 |
|
| 121 |
-
The language adaptation technique used to create
|
| 122 |
to be adapted before continuing its pre-training with data in the target languages. Specifically, we proceeded as follows:
|
| 123 |
1) We trained our own BPE tokenizer for Catalan, Spanish, and English, and replaced the original BLOOM tokenizer and vocabulary with it. This procedure implied a downsizing of the original BLOOM's embedding layer and, therefore, a model compression from 1.7B parameters to 1.3B.
|
| 124 |
2) The embeddings corresponding to tokens that are present in both the original and the target vocabulary (matching tokens) were used for initialization.
|
|
@@ -187,11 +187,11 @@ using the [cs-1.9.1](https://github.com/Cerebras/modelzoo/releases/tag/Release_1
|
|
| 187 |
|
| 188 |
|
| 189 |
## Evaluation
|
| 190 |
-
|
| 191 |
|
| 192 |
The tasks were chosen to cover several evaluation areas in order to provide a comprehensive overview of the model's capabilities. The baselines used to compare our results are multilingual and English open-source 1.3B models: mGPT-1.3B, GPT-Neo-1.3B, Pythia-1.4B, OPT-1.3B, Falcon-rw-1.3B, and Cerebras-GPT-1.3B.
|
| 193 |
|
| 194 |
-
Our implementation of EleutherAI's *LM Evaluation Harness* can be found [here](https://github.com/langtech-bsc/lm-evaluation-harness/tree/
|
| 195 |
|
| 196 |
The following is a list of evaluation areas and their respective datasets:
|
| 197 |
- Reading Comprehension: [Belebele](https://huggingface.co/datasets/facebook/belebele)
|
|
@@ -202,13 +202,11 @@ The following is a list of evaluation areas and their respective datasets:
|
|
| 202 |
- Translation: [FLoRes](https://huggingface.co/datasets/flores)
|
| 203 |
|
| 204 |
|
| 205 |
-
{ width: 200px; }
|
| 212 |
|
| 213 |
|
| 214 |
## Additional information
|
|
|
|
| 6 |
licence:
|
| 7 |
- apache-2.0
|
| 8 |
tags:
|
| 9 |
+
- FLOR
|
| 10 |
- bloom
|
| 11 |
- spanish
|
| 12 |
- catalan
|
|
|
|
| 52 |
example_title: Entidades-Nombradas
|
| 53 |
---
|
| 54 |
|
| 55 |
+
# FLOR-760M
|
| 56 |
|
| 57 |
## Table of Contents
|
| 58 |
<details>
|
|
|
|
| 70 |
|
| 71 |
## Model description
|
| 72 |
|
| 73 |
+
**FLOR-760M** is a 760M-parameter transformer-based causal language model for Catalan, Spanish, and English.
|
| 74 |
It is the result of a language adaptation technique performed on [BLOOM-1.1B](https://huggingface.co/bigscience/bloom-1b1),
|
| 75 |
which involves modifying the model's vocabulary and embedding layer and continuously pre-training the model with 26B tokens in our target languages.
|
| 76 |
|
|
|
|
| 78 |
|
| 79 |
## Intended uses and limitations
|
| 80 |
|
| 81 |
+
The **FLOR-760M** model is ready-to-use only for causal language modeling.
|
| 82 |
It can perform text-generation tasks and be fine-tuned for specific scenarios.
|
| 83 |
|
| 84 |
## How to use
|
|
|
|
| 88 |
|
| 89 |
input_text = "Sovint em trobo pensant en tot allò que"
|
| 90 |
|
| 91 |
+
model_id = "BSC-LT/FLOR-760M"
|
| 92 |
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
| 93 |
generator = pipeline(
|
| 94 |
"text-generation",
|
|
|
|
| 118 |
|
| 119 |
### Language adaptation and training
|
| 120 |
|
| 121 |
+
The language adaptation technique used to create FLOR-1.3B requires the vocabulary of the source model
|
| 122 |
to be adapted before continuing its pre-training with data in the target languages. Specifically, we proceeded as follows:
|
| 123 |
1) We trained our own BPE tokenizer for Catalan, Spanish, and English, and replaced the original BLOOM tokenizer and vocabulary with it. This procedure implied a downsizing of the original BLOOM's embedding layer and, therefore, a model compression from 1.7B parameters to 1.3B.
|
| 124 |
2) The embeddings corresponding to tokens that are present in both the original and the target vocabulary (matching tokens) were used for initialization.
|
|
|
|
| 187 |
|
| 188 |
|
| 189 |
## Evaluation
|
| 190 |
+
FLOR-760M has been evaluated on 5-shot, using EleutherAI's Evaluation Harness implementation, on several datasets in Catalan, Spanish and English, with particular emphasis on Catalan datasets.
|
| 191 |
|
| 192 |
The tasks were chosen to cover several evaluation areas in order to provide a comprehensive overview of the model's capabilities. The baselines used to compare our results are multilingual and English open-source 1.3B models: mGPT-1.3B, GPT-Neo-1.3B, Pythia-1.4B, OPT-1.3B, Falcon-rw-1.3B, and Cerebras-GPT-1.3B.
|
| 193 |
|
| 194 |
+
Our implementation of EleutherAI's *LM Evaluation Harness* can be found [here](https://github.com/langtech-bsc/lm-evaluation-harness/tree/FLOR-eval).
|
| 195 |
|
| 196 |
The following is a list of evaluation areas and their respective datasets:
|
| 197 |
- Reading Comprehension: [Belebele](https://huggingface.co/datasets/facebook/belebele)
|
|
|
|
| 202 |
- Translation: [FLoRes](https://huggingface.co/datasets/flores)
|
| 203 |
|
| 204 |
|
| 205 |
+

|
| 206 |
|
| 207 |
+

|
| 208 |
|
| 209 |
+

|
|
|
|
|
|
|
|
|
|
| 210 |
|
| 211 |
|
| 212 |
## Additional information
|