| | --- |
| | language: |
| | - nl |
| | license: llama2 |
| | --- |
| | |
| | <p align="center" style="margin:0;padding:0"> |
| | <img src="./chocollama_logo.png" alt="ChocoLlama logo" width="500" style="margin-left: auto; margin-right: auto; display: block;"/> |
| | </p> |
| | <div style="margin:auto; text-align:center"> |
| | <h1 style="margin-bottom: 0">ChocoLlama</h1> |
| | <em>A Llama-2/3-based family of Dutch language models</em> |
| | </div> |
| |
|
| | ## ChocoLlama-2-7B-base: Getting Started |
| |
|
| | We here present **ChocoLlama-2-7B-base**, a language-adapted version of Meta's Llama-2-7b, fine-tuned on 32B Dutch Llama-2 tokens (104GB) using LoRa. |
| | Note that this is a base model, not optimized for conversational behavior. |
| | If this is desired for your use-case, we recommend finetuning this model on your own Dutch data or using the instruction-finetuned version of this model, [ChocoLlama-2-7B-instruct](https://huggingface.co/ChocoLlama/ChocoLlama-2-7B-instruct). |
| |
|
| | Use the code below to get started with the model. |
| |
|
| | ```python |
| | from transformers import AutoModelForCausalLM, AutoTokenizer |
| | |
| | tokenizer = AutoTokenizer.from_pretrained('ChocoLlama/ChocoLlama-2-7B-base') |
| | model = AutoModelForCausalLM.from_pretrained('ChocoLlama/ChocoLlama-2-7B-base') |
| | ``` |
| |
|
| | ## Model Details |
| |
|
| | ChocoLlama is a family of open LLM's specifically adapted to Dutch, contributing to the state-of-the-art of Dutch open LLM's in their weight class. |
| |
|
| | We provide 6 variants (of which 3 base and 3 instruction-tuned models): |
| | - **ChocoLlama-2-7B-base** ([link](https://huggingface.co/ChocoLlama/ChocoLlama-2-7B-base)): A language-adapted version of Meta's Llama-2-7b, fine-tuned on 32B Dutch Llama-2 tokens (104GB) using LoRa. |
| | - **ChocoLlama-2-7B-instruct** ([link](https://huggingface.co/ChocoLlama/ChocoLlama-2-7B-instruct)): An instruction-tuned version of ChocoLlama-2-7B-base, fine-tuned on a collection of Dutch translations of instruction-tuning datasets, using SFT followed by DPO. |
| | - **ChocoLlama-2-7B-tokentrans-base** ([link](https://huggingface.co/ChocoLlama/ChocoLlama-2-7B-tokentrans-base)): A language-adapted version of Meta's Llama-2-7b, using a Dutch RoBERTa-based tokenizer. The token embeddings of this model were reinitialized using the token translation algorithm proposed by [Remy et al.](https://arxiv.org/pdf/2310.03477). The model was subsequently fine-tuned on the same Dutch dataset as ChocoLlama-2-7B-base, again using LoRa. |
| | - **ChocoLlama-2-7B-tokentrans-instruct** ([link](https://huggingface.co/ChocoLlama/ChocoLlama-2-7B-tokentrans-instruct)): An instruction-tuned version of ChocoLlama-2-7B-tokentrans-base, fine-tuned on the same dataset as ChocoLlama-2-7B-instruct, again using SFT followed by DPO. |
| | - **Llama-3-ChocoLlama-8B-base** ([link](https://huggingface.co/ChocoLlama/Llama-3-ChocoLlama-8B-base)): A language-adapted version of Meta's Llama-8-8B, fine-tuned on the same Dutch dataset as ChocoLlama-2-7B-base, again using LoRa. |
| | - **Llama-3-ChocoLlama-instruct** ([link](https://huggingface.co/ChocoLlama/Llama-3-ChocoLlama-8B-instruct)): An instruction-tuned version of Llama-3-ChocoLlama-8B-base, fine-tuned on the same dataset as ChocoLlama-2-7B-instruct, again using SFT followed by DPO. |
| |
|
| | For benchmark results for all models, including compared to their base models and other Dutch LLMs, we refer to our paper [here](https://arxiv.org/pdf/2412.07633). |
| |
|
| | ### Model Description |
| |
|
| | - **Developed by:** [Matthieu Meeus](https://huggingface.co/matthieumeeus97), [Anthony Rathé](https://huggingface.co/anthonyrathe) |
| | - **Funded by:** [Vlaams Supercomputer Centrum](https://www.vscentrum.be/), through a grant of apx. 40K GPU hours (NVIDIA A100-80GB) |
| | - **Language(s):** Dutch |
| | - **License:** [Llama-2 Community License](https://ai.meta.com/llama/license/) |
| | - **Finetuned from model:** [Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf) |
| |
|
| | ### Model Sources |
| |
|
| | - **Repository:** [on Github here](https://github.com/ChocoLlamaModel/ChocoLlama). |
| | - **Paper:** [on ArXiv here](https://arxiv.org/pdf/2412.07633). |
| |
|
| | ## Uses |
| |
|
| | ### Direct Use |
| |
|
| | Since this is a base model, we do not recommend using it for your use-cases directly. We instead recommend: |
| | 1. Fine-tuning this model to your specific use-case |
| | 2. Leveraging the instruction-tuned version of this model |
| |
|
| | ### Downstream Use |
| |
|
| | Since this model is a base model, it can easily be adapted to specific use-cases that required Dutch language understanding and generation. |
| | We expect this model to be particularly useful for use-cases in the domains which were explicitly covered in our dataset, e.g. the analysis and/or generation of Dutch job descriptions, corporate filings and legislation. |
| |
|
| | ### Out-of-Scope Use |
| |
|
| | - Use-cases requiring a chat-style interface: since this is a base model, it cannot be used reliably for turn-based chat interaction. Please refer to the instruction-tuned version of this model instead. |
| | - Use-cases requiring understanding or generation of text in languages other than Dutch: the dataset on which this model was fine-tuned does not contain data in languages other than Dutch, hence we expect significant catastrophic forgetting to have occured for English, which is the language Llama-2 was originally trained for. |
| |
|
| | ## Bias, Risks, and Limitations |
| |
|
| | We have taken care to include only widely used and high-quality data in our dataset. Some of this data has been filtered by the original creators. |
| | However we did not explicitly conduct any additional filtering of this dataset with regards to biased or otherwise harmful content. |
| |
|
| | ### Recommendations |
| |
|
| | We recommend fine-tuning this model to your curated data to maximally avoid undesirable outputs. |
| |
|
| | ## Training Details |
| |
|
| | ### Training Data |
| |
|
| | We collect a diverse set of Dutch natural language. |
| |
|
| | 1. **OSCAR** |
| | The bulk of our data comes from the Dutch portion of [OSCAR](https://oscar-corpus.com), January 2023 version, based on Common Crawl. This dataset includes **93 GB** of text (~28.6B tokens). |
| |
|
| | 2. **Open Subtitles** |
| | We collected Dutch text from movie subtitles, focusing on unique movies either in Dutch or with Dutch subtitles. This dataset contains **5 GB** of text (~1.54B tokens) from **214k samples**. |
| |
|
| | 3. **Project Gutenberg** |
| | We downloaded **970 full Dutch books** from [Project Gutenberg](https://www.gutenberg.org) using a public scraper. The dataset includes **0.3 GB** of text (~92M tokens) and is available on [Hugging Face](https://huggingface.co/datasets/ChocoLlama/gutenberg-dutch). |
| |
|
| | 4. **Wikipedia** |
| | Using the March 2023 [Wikipedia dump](https://dumps.wikimedia.org), we included **2.5 GB** of text (~769M tokens). Despite some duplication with OSCAR, Wikipedia's high quality justifies its inclusion. |
| |
|
| | 5. **Job Descriptions (TechWolf)** |
| | A sample of **750k Dutch job descriptions** collected over five years from public websites, provided by TechWolf. This dataset contains **1.5 GB** of text (~462M tokens). |
| |
|
| | 6. **Staatsblad (Bizzy)** |
| | A sample of **80k legal filings** from [Het Belgisch Staatsblad](https://www.ejustice.just.fgov.be/cgi/welcome.pl). Documents were OCR-processed, and personal data was excluded. This dataset includes **1.4 GB** of text (~431M tokens), collected with help from Bizzy. |
| |
|
| | 7. **Legislation (ML6)** |
| | **15k documents** from Flemish legislation accessed via the [Open Data API](https://www.vlaanderen.be/vlaams-parlement/de-vlaamse-codex). This dataset contains **0.2 GB** of text (~62M tokens), collected with support from ML6. |
| |
|
| | ### Training Procedure |
| |
|
| | This model was fine-tuned using low-rank (LoRa) adapatation with trainable embeddings, for a total of 544M trainable parameters. |
| |
|
| | #### Training Hyperparameters |
| |
|
| | - **Training regime:** bf16 non-mixed precision |
| | - **Epochs:** 1 |
| | - **LoRa parameters:** |
| | - R: 8 |
| | - Alpha: 32 |
| | - Trainable modules: q_proj, v_proj, k_proj, o_proj, gate_proj, up_proj, down_proj, embed_tokens, lm_head |
| | - LoRa dropout: 0.05 |
| | - **Learning Rate:** |
| | - Scheduler: StepLR |
| | - Step size: 6212 |
| | - Learning rate: 0.0003 |
| | - Gamma: 0.85 |
| | - **Other parameters:** |
| | - Minibatch size: 16 |
| | - Gradient accumulation steps: 8 |
| | - Parallelization factor: 8 |
| | - Weight decay: 0 |
| | |
| | ## Evaluation |
| | |
| | ### Quantitative evaluation |
| | |
| | We have evaluated our models on several industry-standard Dutch benchmarks, translated from their original versions. The results can be found in the table below, together with results from several other prominent Dutch models. |
| | |
| | | Model | ARC | HellaSwag | MMLU | TruthfulQA | Avg. | |
| | |----------------------------------------------|----------------|----------------|----------------|----------------|----------------| |
| | | **Llama-3-ChocoLlama-instruct** | **0.48** | **0.66** | **0.49** | **0.49** | **0.53** | |
| | | llama-3-8B-rebatch | 0.44 | 0.64 | 0.46 | 0.48 | 0.51 | |
| | | llama-3-8B-instruct | 0.47 | 0.59 | 0.47 | 0.52 | 0.51 | |
| | | llama-3-8B | 0.44 | 0.64 | 0.47 | 0.45 | 0.5 | |
| | | Reynaerde-7B-Chat | 0.44 | 0.62 | 0.39 | 0.52 | 0.49 | |
| | | **Llama-3-ChocoLlama-base** | **0.45** | **0.64** | **0.44** | **0.44** | **0.49** | |
| | | zephyr-7b-beta | 0.43 | 0.58 | 0.43 | 0.53 | 0.49 | |
| | | geitje-7b-ultra | 0.40 | 0.66 | 0.36 | 0.49 | 0.48 | |
| | | **ChocoLlama-2-7B-tokentrans-instruct** | **0.45** | **0.62** | **0.34** | **0.42** | **0.46** | |
| | | mistral-7b-v0.1 | 0.43 | 0.58 | 0.37 | 0.45 | 0.46 | |
| | | **ChocoLlama-2-7B-tokentrans-base** | **0.42** | **0.61** | **0.32** | **0.43** | **0.45** | |
| | | **ChocoLlama-2-7B-instruct** | **0.36** | **0.57** | **0.33** | **0.45** | **0.43 | |
| | | **ChocoLlama-2-7B-base** | **0.35** | **0.56** | **0.31** | **0.43** | **0.41** | |
| | | llama-2-7b-chat-hf | 0.36 | 0.49 | 0.33 | 0.44 | 0.41 | |
| | | llama-2-7b-hf | 0.36 | 0.51 | 0.32 | 0.41 | 0.40 | |
| | |
| | On average, Llama-3-ChocoLlama-instruct surpasses the previous state-of-the-art on these benchmarks. |
| | |
| | ### Qualitative evaluation |
| | |
| | In our paper, we also provide an additional qualitative evaluation of all models - which we empirically find more reliable. |
| | For details, we refer to the paper and to our benchmark [ChocoLlama-Bench](https://huggingface.co/datasets/ChocoLlama/ChocoLlama-Bench). |
| | |
| | ### Compute Infrastructure |
| | |
| | All ChocoLlama models have been trained on the compute cluster provided by the [Flemish Supercomputer Center (VSC)](https://www.vscentrum.be/). We used 8 to 16 NVIDIA A100 GPU's with 80 GB of VRAM. |
| | |
| | ## Citation |
| | |
| | If you found this useful for your work, kindly cite our paper: |
| | |
| | ``` |
| | @article{meeus2024chocollama, |
| | title={ChocoLlama: Lessons Learned From Teaching Llamas Dutch}, |
| | author={Meeus, Matthieu and Rath{\'e}, Anthony and Remy, Fran{\c{c}}ois and Delobelle, Pieter and Decorte, Jens-Joris and Demeester, Thomas}, |
| | journal={arXiv preprint arXiv:2412.07633}, |
| | year={2024} |
| | } |
| | ``` |