Add model description to README
Browse files
README.md
CHANGED
|
@@ -11,4 +11,30 @@ tags:
|
|
| 11 |
- legal
|
| 12 |
- portuguese
|
| 13 |
- Brazil
|
| 14 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 11 |
- legal
|
| 12 |
- portuguese
|
| 13 |
- Brazil
|
| 14 |
+
---
|
| 15 |
+
|
| 16 |
+
# Juru: Legal Brazilian Large Language Model from Reputable Sources
|
| 17 |
+
|
| 18 |
+
This repository hosts the public checkpoints for **Juru-7B**, a Mistral-7B specialised in the Brazilian legal domain. The model was continued pretrained on **1.9 billion** unique tokens from reputable academic and legal sources in Portuguese. For full details on data curation, training, and evaluation, see our paper: <https://arxiv.org/abs/2403.18140>.
|
| 19 |
+
|
| 20 |
+
## Checkpoints
|
| 21 |
+
|
| 22 |
+
* Checkpoints were saved every **200** optimization steps up to step **3,800**.
|
| 23 |
+
* Each 200 step interval adds **~0.4 billion** tokens of continued pretraining.
|
| 24 |
+
* We refer to **Juru-7B** as checkpoint **3,400** (~7.1 billion tokens), which achieved the best score on our Brazilian legal knowledge benchmarks.
|
| 25 |
+
|
| 26 |
+
> **Note:** The model has **not** been instruction finetuned. For best results, use few-shot inference or perform additional finetuning on your specific task.
|
| 27 |
+
|
| 28 |
+
## Citation information
|
| 29 |
+
|
| 30 |
+
```bibtex
|
| 31 |
+
@misc{junior2024jurulegalbrazilianlarge,
|
| 32 |
+
title={Juru: Legal Brazilian Large Language Model from Reputable Sources},
|
| 33 |
+
author={Roseval Malaquias Junior and Ramon Pires and Roseli Romero and Rodrigo Nogueira},
|
| 34 |
+
year={2024},
|
| 35 |
+
eprint={2403.18140},
|
| 36 |
+
archivePrefix={arXiv},
|
| 37 |
+
primaryClass={cs.CL},
|
| 38 |
+
url={https://arxiv.org/abs/2403.18140},
|
| 39 |
+
}
|
| 40 |
+
```
|