| language: | |
| - pt | |
| metrics: | |
| - accuracy | |
| base_model: | |
| - mistralai/Mistral-7B-v0.3 | |
| pipeline_tag: text-generation | |
| library_name: transformers | |
| tags: | |
| - legal | |
| - portuguese | |
| - Brazil | |
| # Juru: Legal Brazilian Large Language Model from Reputable Sources | |
| This repository hosts the public checkpoints for **Juru-7B**, a Mistral-7B specialised in the Brazilian legal domain. The model was continued pretrained on **1.9 billion** unique tokens from reputable academic and legal sources in Portuguese. For full details on data curation, training, and evaluation, see our paper: <https://arxiv.org/abs/2403.18140>. | |
| ## Checkpoints | |
| * Checkpoints were saved every **200** optimization steps up to step **3,800**. | |
| * Each 200 step interval adds **~0.4 billion** tokens of continued pretraining. | |
| * We refer to **Juru-7B** as checkpoint **3,400** (~7.1 billion tokens), which achieved the best score on our Brazilian legal knowledge benchmarks. | |
| > **Note:** The model has **not** been instruction finetuned. For best results, use few-shot inference or perform additional finetuning on your specific task. | |
| ## Citation information | |
| ```bibtex | |
| @misc{junior2024jurulegalbrazilianlarge, | |
| title={Juru: Legal Brazilian Large Language Model from Reputable Sources}, | |
| author={Roseval Malaquias Junior and Ramon Pires and Roseli Romero and Rodrigo Nogueira}, | |
| year={2024}, | |
| eprint={2403.18140}, | |
| archivePrefix={arXiv}, | |
| primaryClass={cs.CL}, | |
| url={https://arxiv.org/abs/2403.18140}, | |
| } | |
| ``` |