| --- |
| license: apache-2.0 |
| pipeline_tag: text-generation |
| language: |
| - it |
| - en |
| tags: |
| - pretrained |
| datasets: |
| - uonlp/CulturaX |
| - HuggingFaceFW/fineweb |
| - togethercomputer/RedPajama-Data-V2 |
| - bigcode/the-stack-v2 |
| inference: |
| parameters: |
| temperature: 0.5 |
| do_sample: true |
| widget: |
| - text: 'La capitale dell''Italia è ' |
| example_title: Example 1 |
| - text: 'Nel mezzo del cammin di nostra vita ' |
| example_title: Example 2 |
| - text: 'Una cena senza vino è come ' |
| example_title: Example 3 |
| --- |
| |
| <div style="text-align: center; display: flex; flex-direction: column; align-items: center;"> |
| <img src="https://huggingface.co/sapienzanlp/Minerva-7B-instruct-v1.0/resolve/main/minerva-logo.png" style="max-width: 550px; height: auto;"> |
| </div> |
| |
| # Model Card for Minerva-7B-base-v1.0 |
|
|
| Minerva is the first family of **LLMs pretrained from scratch on Italian** developed by [Sapienza NLP](https://nlp.uniroma1.it) |
| in collaboration with [Future Artificial Intelligence Research (FAIR)](https://fondazione-fair.it/) and [CINECA](https://www.cineca.it/). |
| Notably, the Minerva models are truly-open (data and model) Italian-English LLMs, with approximately half of the pretraining data |
| including Italian text. |
|
|
| * [Minerva LLMs - website](https://nlp.uniroma1.it/minerva/) |
|
|
| ## Description |
|
|
| This is the model card for **Minerva-7B-base-v1.0**, a 7 billion parameter model trained on almost 2.5 trillion tokens (1.14 trillion in Italian, |
| 1.14 trillion in English, and 200 billion in code). |
|
|
| This model is part of the Minerva LLM family: |
|
|
| * [Minerva-350M-base-v1.0](https://huggingface.co/sapienzanlp/Minerva-350M-base-v1.0) |
| * [Minerva-1B-base-v1.0](https://huggingface.co/sapienzanlp/Minerva-1B-base-v1.0) |
| * [Minerva-3B-base-v1.0](https://huggingface.co/sapienzanlp/Minerva-3B-base-v1.0) |
| * [Minerva-7B-base-v1.0](https://huggingface.co/sapienzanlp/Minerva-7B-base-v1.0) |
| * [Minerva-7B-instruct-v1.0](https://huggingface.co/sapienzanlp/Minerva-7B-instruct-v1.0) |
|
|
| ## 🚨⚠️🚨 Bias, Risks, and Limitations 🚨⚠️🚨 |
|
|
| *This section identifies foreseeable harms and misunderstandings.* |
|
|
| This is a foundation model, not subject to alignment. Model may: |
|
|
| - Overrepresent some viewpoints and underrepresent others |
| - Contain stereotypes |
| - Contain [personal information](#personal-data-and-information) |
| - Generate: |
| - Racist and sexist content |
| - Hateful, abusive, or violent language |
| - Discriminatory or prejudicial language |
| - Content that may not be appropriate for all settings, including sexual content |
| - Make errors, including producing incorrect information or historical facts as if it were factual |
| - Generate irrelevant or repetitive outputs |
|
|
| We are aware of the biases and potential problematic/toxic content that current pretrained large language models exhibit: more specifically, as probabilistic models of (Italian and English) languages, they reflect and amplify the biases of their training data. |
| For more information about this issue, please refer to our survey: |
|
|
| * [Biases in Large Language Models: Origins, Inventory, and Discussion](https://dl.acm.org/doi/full/10.1145/3597307) |
|
|
| ## How to use Minerva with Hugging Face transformers |
|
|
| ```python |
| import transformers |
| import torch |
| |
| model_id = "sapienzanlp/Minerva-7B-base-v1.0" |
| |
| # Initialize the pipeline. |
| pipeline = transformers.pipeline( |
| "text-generation", |
| model=model_id, |
| model_kwargs={"torch_dtype": torch.bfloat16}, |
| device_map="auto", |
| ) |
| |
| # Input text for the model. |
| input_text = "La capitale dell'Italia è" |
| |
| # Compute the outputs. |
| output = pipeline( |
| input_text, |
| max_new_tokens=128, |
| ) |
| |
| output |
| ``` |
|
|
| [{'generated_text': "La capitale dell'Italia è la città di Roma, che si trova a [...]"}] |
| |
| |
| ## Model Architecture |
|
|
| Minerva-7B-base-v1.0 is a Transformer model based on the Mistral architecture. |
| Please look at the configuration file for a detailed breakdown of the hyperparameters we chose for this model. |
|
|
| The Minerva LLM family is composed of: |
|
|
| | Model Name | Tokens | Layers | Hidden Size | Attention Heads | KV Heads | Sliding Window | Max Context Length | |
| | --- | --- | --- | --- | --- | --- | --- | --- | |
| | Minerva-350M-base-v1.0 | 70B (35B it + 35B en) | 16 | 1152 | 16 | 4 | 2048 | 16384 | |
| | Minerva-1B-base-v1.0 | 200B (100B it + 100B en) | 16 | 2048 | 16 | 4 | 2048 | 16384 | |
| | Minerva-3B-base-v1.0 | 660B (330B it + 330B en) | 32 | 2560 | 32 | 8 | 2048 | 16384 | |
| | Minerva-7B-base-v1.0 | 2.48T (1.14T it + 1.14T en + 200B code) | 32 | 4096 | 32 | 8 | None | 4096 | |
|
|
| ## Model Training |
|
|
| Minerva-7B-base-v1.0 was trained using [llm-foundry 0.8.0](https://github.com/riccorl/llm-foundry) from [MosaicML](https://mosaicml.com/). The hyperparameters used are the following: |
|
|
| | Model Name | Optimizer | lr | betas | eps | weight decay | Scheduler | Warmup Steps | Batch Size (Tokens) | Total Steps | |
| | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | |
| | Minerva-350M-base-v1.0 | Decoupled AdamW | 2e-4 | (0.9, 0.95) | 1e-8 | 0.0 | Cosine | 2% | 4M | 16,690 | |
| | Minerva-1B-base-v1.0 | Decoupled AdamW | 2e-4 | (0.9, 0.95) | 1e-8 | 0.0 | Cosine | 2% | 4M | 47,684 | |
| | Minerva-3B-base-v1.0 | Decoupled AdamW | 2e-4 | (0.9, 0.95) | 1e-8 | 0.0 | Cosine | 2% | 4M | 157,357 | |
| | Minerva-7B-base-v1.0 | AdamW | 3e-4 | (0.9, 0.95) | 1e-5 | 0.1 | Cosine | 2000 | 4M | 591,558 | |
|
|
| ## Model Evaluation |
|
|
| For Minerva's evaluation process, we utilized [ITA-Bench](https://huggingface.co/collections/sapienzanlp/ita-bench-italian-benchmarks-for-llms-66337ca59e6df7d7d4933896), a new evaluation suite to test the capabilities of Italian-speaking models. |
| ITA-Bench is a collection of 18 benchmarks that assess the performance of language models on various tasks, including scientific knowledge, |
| commonsense reasoning, and mathematical problem-solving. |
|
|
| <div style={{ display: 'flex', justifyContent: 'space-around' }}> |
| <img src="https://huggingface.co/sapienzanlp/Minerva-7B-base-v1.0/resolve/main/Minerva%20LLMs%20Results%20Base%20Models.png" alt="Results on base models" style={{ width: '45%' }}></img> |
| <img src="https://huggingface.co/sapienzanlp/Minerva-7B-base-v1.0/resolve/main/Minerva%20LLMs%20Results%20All%20Base%20Models.png" alt="Results on base models" style={{ width: '45%' }}></img> |
| </div> |
| |
| <!-- **Italian** Data: --> |
| <!-- | Task | Accuracy | |
| | --- | --- | --> |
| <!-- | [xcopa](https://huggingface.co/datasets/xcopa) (0-shot) | 0.694 | |
| | [Hellaswag](https://huggingface.co/datasets/alexandrainst/m_hellaswag) (5-shot) | 0.5293 | |
| | [Belebele](https://huggingface.co/datasets/facebook/belebele) (5-shot) | 0.2333 | |
| | [TruthfulQA MC 1](https://huggingface.co/datasets/alexandrainst/m_truthfulqa) (0-shot) | 0.2363 | |
| | [TruthfulQA MC 2](https://huggingface.co/datasets/alexandrainst/m_truthfulqa) (0-shot) | 0.3731 | |
| | [M MMLU](https://huggingface.co/datasets/alexandrainst/m_mmlu) (5-shot) | 0.2612 | |
| | [arc challenge](https://huggingface.co/datasets/alexandrainst/m_arc) (5-shot) | 0.3268 | --> |
|
|
| <!-- **English** Data: --> |
| <!-- | Task | Accuracy | |
| | --- | --- | --> |
| <!-- | [Hellaswag](https://huggingface.co/datasets/Rowan/hellaswag) (5-shot) | 0.6168 | |
| | [piqa](https://huggingface.co/datasets/piqa) (5-shot) | 0.7535 | |
| | [sciq](https://huggingface.co/datasets/sciq) (5-shot) | 0.925 | |
| | [Belebele](https://huggingface.co/datasets/facebook/belebele) (5-shot) | 0.2278 | |
| | [TruthfulQA MC 1](https://huggingface.co/datasets/truthful_qa) (0-shot) | 0.2142 | |
| | [TruthfulQA MC 2](https://huggingface.co/datasets/truthful_qa) (0-shot) | 0.3643 | |
| | [M MMLU](https://huggingface.co/datasets/alexandrainst/m_mmlu) (5-shot) | 0.263 | |
| | [arc challenge](allenai/ai2_arc) (5-shot) | 0.3319 | |
| | [arc easy](allenai/ai2_arc) (5-shot) | 0.6540 | --> |
|
|
| ## Training Data |
|
|
| Minerva-7B-base-v1.0 is trained on 1.14T Italian tokens, 1.14T English tokens, and 200B code tokens. |
|
|
| The training data is a mixture of the following datasets: |
|
|
| | Dataset | Tokens | Language | Epochs | |
| | --- | --- | --- | --- | |
| | RedPajama-Data-V2 | 687,952,502,784 | Italian | 1.3 | |
| | CulturaX | 158,201,876,480 | Italian | 1.5 | |
| | Wikipedia | 1,265,135,616 | Italian | 1.0 | |
| | Gutenberg/Wikisource | 147,017,728 | Italian | 2.0 | |
| | EurLex | 1,647,013,888 | Italian | 1.0 | |
| | Gazzetta Ufficiale | 1,654,013,952| Italian | 1.0 | |
| | FineWeb | 1,076,406,624,256 | English | 1.0 | |
| | Wikipedia | 5,259,501,568 | English | 1.0 | |
| | ArXiv | 33,231,106,048 | English | 1.0 | |
| | Gutenberg | 6,947,893,248 | English | 1.0 | |
| | StackExchange | 22,069,268,480 | English | 1.0 | |
| | The Stack V2 | 200,754,900,992 | Code | 1.0 | |
|
|
| <!-- We have extracted some statistics on Italian (115B tokens) and English (210B tokens) documents from CulturaX on the selected sources: |
|
|
| *Proportion of number of tokens per domain (Italian)* |
| <img src="https://github.com/Andrew-Wyn/images/blob/master/minerva/top_25_url_tokens_proportion_culturax_it.png?raw=true" alt="italian-tok-counts" border="0" width="1800px"> |
|
|
| *Proportion of number of tokens per domain (English)* |
| <img src="https://github.com/Andrew-Wyn/images/blob/master/minerva/top_25_url_tokens_proportion_culturax_en.png?raw=true" alt="english-tok-counts" border="0" width="1800px"> |
| --> |
| ## Tokenizer Fertility |
|
|
| The tokenizer fertility measures the average amount of tokens produced per tokenized word. |
| A tokenizer displaying high fertility values in a particular language typically indicates that it segments words in that language extensively. |
| The tokenizer fertility is strictly correlated with the inference speed of the model with respect to a specific language, |
| as higher values mean longer sequences of tokens to generate and thus lower inference speed. |
|
|
| **Fertility computed over a sample of Cultura X (CX) data and Wikipedia (Wp):** |
|
|
| | Model | Voc. Size | Fertility IT (CX) | Fertility EN (CX) | Fertility IT (Wp) | Fertility EN (Wp) | |
| | --- | --- | --- |--- | --- |--- | |
| | Mistral-7B-v0.1 | 32000 | 1.87 | 1.32 | 2.05 | 1.57 | |
| | gemma-7b | 256000 | 1.42 | 1.18 | 1.56 | 1.34 | |
| | Minerva-3B-base-v1.0 | 32768 | 1.39 | 1.32 | 1.66 | 1.59 | |
| | Minerva-7B-base-v1.0 | 51200 | 1.32 | 1.26 | 1.56 | 1.51 | |
|
|
| ## Notice |
|
|
| Minerva-7B-base-v1.0 is a pretrained base model and, therefore, has no moderation mechanisms. |
|
|
| ## The Sapienza NLP Team |
|
|
| * **Riccardo Orlando:** data preprocessing, model training |
| * **Pere-Lluis Huguet Cabot:** data preprocessing, vocabulary, evaluation |
| * **Luca Moroni:** data curation, data analysis, downstream tasks, evaluation |
| * **Simone Conia:** data curation, evaluation, project supervision |
| * **Edoardo Barba:** data preprocessing, downstream tasks, project supervision |
| * **Roberto Navigli:** project lead and coordination |
|
|
| ### Special thanks for their support |
|
|
| * Giuseppe Fiameni, Nvidia |
| * Sergio Orlandini, CINECA |
|
|
| ## Acknowledgments |
| This work was funded by the PNRR MUR project [PE0000013-FAIR](https://fondazione-fair.it) and the [CREATIVE](https://nlp.uniroma1.it/creative/) PRIN project, which is funded by the MUR Progetti di |
| Rilevante Interesse Nazionale programme (PRIN 2020). |
| We acknowledge the [CINECA](https://www.cineca.it) award "IscB_medit" under the ISCRA initiative for the availability of high-performance computing resources and support. |
| |