projecte-aina
/

FLOR-760M

@@ -6,7 +6,7 @@ language:
 licence:
 - apache-2.0
 tags:
-- cabloom
 - bloom
 - spanish
 - catalan
@@ -52,7 +52,7 @@ widget:
   example_title: Entidades-Nombradas
 ---
-# CaBLOOM-760M
 ## Table of Contents
 <details>
@@ -70,7 +70,7 @@ widget:
 ## Model description
-**CaBLOOM-760M** is a 760M-parameter transformer-based causal language model for Catalan, Spanish, and English.
 It is the result of a language adaptation technique performed on [BLOOM-1.1B](https://huggingface.co/bigscience/bloom-1b1),
 which involves modifying the model's vocabulary and embedding layer and continuously pre-training the model with 26B tokens in our target languages.
@@ -78,7 +78,7 @@ This model has been developed as part of a scientific research submitted to [LRE
 ## Intended uses and limitations
-The **CaBLOOM-760M** model is ready-to-use only for causal language modeling.
 It can perform text-generation tasks and be fine-tuned for specific scenarios.
 ## How to use
@@ -88,7 +88,7 @@ from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
 input_text = "Sovint em trobo pensant en tot allò que"
-model_id  = "BSC-LT/CaBLOOM-760M"
 tokenizer = AutoTokenizer.from_pretrained(model_id)
 generator = pipeline(
     "text-generation",
@@ -118,7 +118,7 @@ on multiple web sources. We intend to conduct research in these areas in the fut
 ### Language adaptation and training
-The language adaptation technique used to create CaBLOOM-1.3B requires the vocabulary of the source model
 to be adapted before continuing its pre-training with data in the target languages. Specifically, we proceeded as follows:
 1) We trained our own BPE tokenizer for Catalan, Spanish, and English, and replaced the original BLOOM tokenizer and vocabulary with it. This procedure implied a downsizing of the original BLOOM's embedding layer and, therefore, a model compression from 1.7B parameters to 1.3B.
 2) The embeddings corresponding to tokens that are present in both the original and the target vocabulary (matching tokens) were used for initialization.
@@ -187,11 +187,11 @@ using the [cs-1.9.1](https://github.com/Cerebras/modelzoo/releases/tag/Release_1
 ## Evaluation
-CaBLOOM-760M has been evaluated on 5-shot, using EleutherAI's Evaluation Harness implementation, on several datasets in Catalan, Spanish and English, with particular emphasis on Catalan datasets.
 The tasks were chosen to cover several evaluation areas in order to provide a comprehensive overview of the model's capabilities. The baselines used to compare our results are multilingual and English open-source 1.3B models: mGPT-1.3B, GPT-Neo-1.3B, Pythia-1.4B, OPT-1.3B, Falcon-rw-1.3B, and Cerebras-GPT-1.3B.
-Our implementation of EleutherAI's *LM Evaluation Harness* can be found [here](https://github.com/langtech-bsc/lm-evaluation-harness/tree/cabloom-eval).
 The following is a list of evaluation areas and their respective datasets:
 - Reading Comprehension: [Belebele](https://huggingface.co/datasets/facebook/belebele)
@@ -202,13 +202,11 @@ The following is a list of evaluation areas and their respective datasets:
 - Translation: [FLoRes](https://huggingface.co/datasets/flores)
-![image/png](https://cdn-uploads.huggingface.co/production/uploads/635ba692dc371b8f91005172/o595pF7dw-iTuR1_x4MVy.png)
-![image/png](https://cdn-uploads.huggingface.co/production/uploads/635ba692dc371b8f91005172/DhrkZG8Xqob7Ml4n6zQcY.png)
-![image/png](https://cdn-uploads.huggingface.co/production/uploads/635ba692dc371b8f91005172/PxgzqXAelUoWY-23zXvPm.png){ width: 200px; }
 ## Additional information

 licence:
 - apache-2.0
 tags:
+- FLOR
 - bloom
 - spanish
 - catalan
   example_title: Entidades-Nombradas
 ---
+# FLOR-760M
 ## Table of Contents
 <details>
 ## Model description
+**FLOR-760M** is a 760M-parameter transformer-based causal language model for Catalan, Spanish, and English.
 It is the result of a language adaptation technique performed on [BLOOM-1.1B](https://huggingface.co/bigscience/bloom-1b1),
 which involves modifying the model's vocabulary and embedding layer and continuously pre-training the model with 26B tokens in our target languages.
 ## Intended uses and limitations
+The **FLOR-760M** model is ready-to-use only for causal language modeling.
 It can perform text-generation tasks and be fine-tuned for specific scenarios.
 ## How to use
 input_text = "Sovint em trobo pensant en tot allò que"
+model_id  = "BSC-LT/FLOR-760M"
 tokenizer = AutoTokenizer.from_pretrained(model_id)
 generator = pipeline(
     "text-generation",
 ### Language adaptation and training
+The language adaptation technique used to create FLOR-1.3B requires the vocabulary of the source model
 to be adapted before continuing its pre-training with data in the target languages. Specifically, we proceeded as follows:
 1) We trained our own BPE tokenizer for Catalan, Spanish, and English, and replaced the original BLOOM tokenizer and vocabulary with it. This procedure implied a downsizing of the original BLOOM's embedding layer and, therefore, a model compression from 1.7B parameters to 1.3B.
 2) The embeddings corresponding to tokens that are present in both the original and the target vocabulary (matching tokens) were used for initialization.
 ## Evaluation
+FLOR-760M has been evaluated on 5-shot, using EleutherAI's Evaluation Harness implementation, on several datasets in Catalan, Spanish and English, with particular emphasis on Catalan datasets.
 The tasks were chosen to cover several evaluation areas in order to provide a comprehensive overview of the model's capabilities. The baselines used to compare our results are multilingual and English open-source 1.3B models: mGPT-1.3B, GPT-Neo-1.3B, Pythia-1.4B, OPT-1.3B, Falcon-rw-1.3B, and Cerebras-GPT-1.3B.
+Our implementation of EleutherAI's *LM Evaluation Harness* can be found [here](https://github.com/langtech-bsc/lm-evaluation-harness/tree/FLOR-eval).
 The following is a list of evaluation areas and their respective datasets:
 - Reading Comprehension: [Belebele](https://huggingface.co/datasets/facebook/belebele)
 - Translation: [FLoRes](https://huggingface.co/datasets/flores)
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/635ba692dc371b8f91005172/nKvFF6Ap7ocdAtSBQyD6Q.png)
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/635ba692dc371b8f91005172/OcCNfkKyGB4zXi2pXjbB4.png)
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/635ba692dc371b8f91005172/d3iW68sAubt1uU0-le5hX.png)
 ## Additional information