Kiet Bui commited on
Commit ·
c9712ee
0
Parent(s):
initial commit
Browse files- .gitattributes +35 -0
- README.md +105 -0
- loss.png +0 -0
- vbd_logo.png +0 -0
.gitattributes
ADDED
|
@@ -0,0 +1,35 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
*.7z filter=lfs diff=lfs merge=lfs -text
|
| 2 |
+
*.arrow filter=lfs diff=lfs merge=lfs -text
|
| 3 |
+
*.bin filter=lfs diff=lfs merge=lfs -text
|
| 4 |
+
*.bz2 filter=lfs diff=lfs merge=lfs -text
|
| 5 |
+
*.ckpt filter=lfs diff=lfs merge=lfs -text
|
| 6 |
+
*.ftz filter=lfs diff=lfs merge=lfs -text
|
| 7 |
+
*.gz filter=lfs diff=lfs merge=lfs -text
|
| 8 |
+
*.h5 filter=lfs diff=lfs merge=lfs -text
|
| 9 |
+
*.joblib filter=lfs diff=lfs merge=lfs -text
|
| 10 |
+
*.lfs.* filter=lfs diff=lfs merge=lfs -text
|
| 11 |
+
*.mlmodel filter=lfs diff=lfs merge=lfs -text
|
| 12 |
+
*.model filter=lfs diff=lfs merge=lfs -text
|
| 13 |
+
*.msgpack filter=lfs diff=lfs merge=lfs -text
|
| 14 |
+
*.npy filter=lfs diff=lfs merge=lfs -text
|
| 15 |
+
*.npz filter=lfs diff=lfs merge=lfs -text
|
| 16 |
+
*.onnx filter=lfs diff=lfs merge=lfs -text
|
| 17 |
+
*.ot filter=lfs diff=lfs merge=lfs -text
|
| 18 |
+
*.parquet filter=lfs diff=lfs merge=lfs -text
|
| 19 |
+
*.pb filter=lfs diff=lfs merge=lfs -text
|
| 20 |
+
*.pickle filter=lfs diff=lfs merge=lfs -text
|
| 21 |
+
*.pkl filter=lfs diff=lfs merge=lfs -text
|
| 22 |
+
*.pt filter=lfs diff=lfs merge=lfs -text
|
| 23 |
+
*.pth filter=lfs diff=lfs merge=lfs -text
|
| 24 |
+
*.rar filter=lfs diff=lfs merge=lfs -text
|
| 25 |
+
*.safetensors filter=lfs diff=lfs merge=lfs -text
|
| 26 |
+
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
| 27 |
+
*.tar.* filter=lfs diff=lfs merge=lfs -text
|
| 28 |
+
*.tar filter=lfs diff=lfs merge=lfs -text
|
| 29 |
+
*.tflite filter=lfs diff=lfs merge=lfs -text
|
| 30 |
+
*.tgz filter=lfs diff=lfs merge=lfs -text
|
| 31 |
+
*.wasm filter=lfs diff=lfs merge=lfs -text
|
| 32 |
+
*.xz filter=lfs diff=lfs merge=lfs -text
|
| 33 |
+
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
+
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
+
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
README.md
ADDED
|
@@ -0,0 +1,105 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: llama2
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
- vi
|
| 6 |
+
---
|
| 7 |
+
|
| 8 |
+
<p align="center"> <img src="vbd_logo.png" width="600" /> </p>
|
| 9 |
+
VBD-LLaMA2-Chat - a Conversationally-tuned Llama2 for Vietnamese
|
| 10 |
+
|
| 11 |
+
We release VBD-LLaMA2-7B-Chat, a finetuned model based on Meta's LLaMA2-7B specifically for the Vietname 🇻🇳 language, in an effort to support the community in building Vietnamese Large Language Models (LLMs). The pretrained weight for this model was trained with continous self-supervised learning (SSL) by extending LLaMA2's vocab on a corpus consisting of 100 billion Vietnamese 🇻🇳 tokens and 40 billion English 🇬🇧 tokens. This approach attempts to leverage the full potential of existing language models and adapt them to lower resource languages and, in the process, reduce the hardware, time, and data cost in building LLMs for these languages. The subsequent supervised finetuning (SFT) was conducted on our internal SFT dataset consisting of 2 million samples.
|
| 12 |
+
|
| 13 |
+
For this release we are only including the SFT weight based on a 50B Vietnamese and 20B English tokens pretrained checkpoint.
|
| 14 |
+
|
| 15 |
+
|
| 16 |
+
Model weights:
|
| 17 |
+
- VBD-LLaMA2-7B-50b-Chat: a snapshot demonstrating the efficacy of the proposed methodology. This base model is pretrained on 50B Vietnamese tokens and 20B English tokens and SFT on XXXX samples.
|
| 18 |
+
|
| 19 |
+
<blockquote style="color:red"> <p><strong style="color: red">Terms of Use and License</strong>: By using our released weights, you agree to and comply with the terms and conditions specified in Meta's LLaMA-2 license.</blockquote>
|
| 20 |
+
|
| 21 |
+
|
| 22 |
+
|
| 23 |
+
Disclaimer: Despite our efforts in limiting misleading, inaccurate and harmful generation, our released model will come with potential risks. We strongly advise to only use this model under highly supervised environment and/or perform extra testing, red teaming and aligning. The use of this model must abide by and comply with local governance and regulations. The authors of this model shall not be held liable for any claim, damage or other liability arise from the use of the released weight(s).
|
| 24 |
+
|
| 25 |
+
|
| 26 |
+
|
| 27 |
+
In the following section, we document some of the benchmark of the released weight(s).
|
| 28 |
+
|
| 29 |
+
Evaluation
|
| 30 |
+
|
| 31 |
+
|
| 32 |
+
We evaluated our model via peer comparison on multiple publicly available dataset using
|
| 33 |
+
<a href="https://github.com/hieunguyen1053/lm-evaluation-harness"> @hieunguyen1053 fork of lm-evaluation-harness </a>. The models are benchmark on different task and metrics. The results are below:
|
| 34 |
+
|
| 35 |
+
|
| 36 |
+
| Organization | Model | Model size | ARC (ACC) | HellaSwag (ACC) | LAMBADA (perplexity) | MMLU (ACC) | IWSLT 2023 en-vi (BLEU) | TruthfulQA (ACC) | Grade 12 Exams (ACC) | hhh_alignment_vi (ACC) | xnli_vi (ACC) |
|
| 37 |
+
| ------------ | ---------------------------- | ---------- | --------- | --------------- | -------------------- | ---------- | ----------------------- | ---------------- | -------------------- | ---------------------- | ------------- |
|
| 38 |
+
| VietAI | gpt-j-6B-vietnamese-news | ~7B | 0,2419 | 0,3856 | 35,1863 | 0,2282 | 0,6698 | 0,4718 | | | 0,4365 |
|
| 39 |
+
| VietAI | gpt-neo-1.3B-vietnamese-news | ~1.5B | 0,2274 | 0,3567 | 64,3972 | 0,229 | 0,5178 | 0,4423 | | | 0,4483 |
|
| 40 |
+
| VietGPT | dama-2-7B-chat | ~7B | 0,3417 | 0,5106 | 38,0188 | 0,338 | 24,3101 | 0,4847 | | | 0,4653 |
|
| 41 |
+
| VietGPT | dama-2-7B | ~7B | 0,3214 | 0,4892 | 17,6625 | 0,2339 | 25,8764 | 0,4416 | 0,293 | | 0,4469 |
|
| 42 |
+
| ViLM | vietcuna-7b-v3 | ~7B | 0,335 | 0,4914 | 21,7747 | 0,336 | 21,0801 | 0,4771 | 0,2992 | | 0,4749 |
|
| 43 |
+
| VLSP | hoa-1b4 | ~1.5B | 0,2718 | 0,4228 | 20,3997 | 0,2281 | 28,0573 | 0,4423 | 0,2684 | | 0,4605 |
|
| 44 |
+
| VLSP | hoa-7b | ~7B | 0,2855 | 0,4329 | 22,6466 | 0,2536 | 25,5126 | 0,4542 | 0,2705 | | 0,4509 |
|
| 45 |
+
| VBD | VBD-LLaMA2-7B-50b | ~7B | 0,3222 | 0,5195 | 13,033 | 0,2964 | | 0,4614 | 0,3197 | | 0,4764 |
|
| 46 |
+
| VBD | VBD-LLaMA2-7B-50b-Chat | ~7B | 0,3585 | 0,5207 | 13,419 | 0,3444 | 24,1 | 0,5179 | 0,3299 | 0,5792 | 0,4772 |
|
| 47 |
+
| AISingapore | Sealion7b | ~7B | 0,2692 | 0,483 | 16,4388 | 0,267 | | 0,4275 | 0,2725 | | 0,4277 |
|
| 48 |
+
| BK Lab | LLaMa-2-BK | ~7B | 0,2966 | 0,4402 | 25,613 | 0,3402 | | 0,4528 | 0,2971 | | 0,4655 |
|
| 49 |
+
| Meta | LLaMa-2 | ~7B | 0,3034 | 0,4287 | | 0,3067 | | | | | |
|
| 50 |
+
| BigScience | Bloom | ~7B | 0,337 | 0,483 | | 0,281 | | | | | |
|
| 51 |
+
| FPT | FPT GenAI | | 0,3581 | 0,5055 | | 0,3143 | | | | | |
|
| 52 |
+
| VinAI | PhoGPT SFT | ~7B | 0,2684 | 0,4109 | 55,509 | 0,2499 | | 0,478 | 0,2643 | | 0,4198 |
|
| 53 |
+
|
| 54 |
+
|
| 55 |
+
|
| 56 |
+
|
| 57 |
+
|
| 58 |
+
|
| 59 |
+
| Organization | Model | Model size | ARC (ACC) | HellaSwag (ACC) | LAMBADA (perplexity) | MMLU (ACC) |
|
| 60 |
+
| ------------ | ------------------ | ---------- | --------- | --------------- | -------------------- | ---------- |
|
| 61 |
+
| VLSP | hoa-7b | ~7B | 0,2722 | 0,4867 | 18,53 | |
|
| 62 |
+
| BK Lab | LLaMA-2-BK | ~7B | 0,4164 | 0,7216 | 5,010 | |
|
| 63 |
+
| ViLM | vietcuna-7b-v3 | ~7B | 0,3976 | 0,6309 | 7,125 | |
|
| 64 |
+
| BigScience | Bloomz-T0 | ~7B | 0,436 | 0,6401 | 6,542 | 0,3785 |
|
| 65 |
+
| TII | Falcon-7B-Instruct | ~7B | 0,4258 | 0,6976 | 7,463 | 0,2584 |
|
| 66 |
+
| MosaicML | MPT-7B-Chat | ~7B | 0,4258 | 0,7438 | 5,797 | 0,3762 |
|
| 67 |
+
| Meta | LLaMA-2-Chat | ~7B | 0,442 | 0,7547 | 3,968 | 0,4832 |
|
| 68 |
+
| AISingapore | Sealion7b | ~7B | 0,3422 | 0,6705 | 6,715 | 0,268 |
|
| 69 |
+
| VBD | VBD-LLaMA2-7B-50b-Chat | ~7B | 0,4556 | 0,7384 | 4,645 | 0,4558 |
|
| 70 |
+
|
| 71 |
+
|
| 72 |
+
Based on this results, our model performs on-par or better most models for tasks in Vietnamese. TO_BE_FILLED
|
| 73 |
+
|
| 74 |
+
Safety Enchancement in Local Context
|
| 75 |
+
TO_BE_FILLED
|
| 76 |
+
|
| 77 |
+
|
| 78 |
+
|
| 79 |
+
Training process
|
| 80 |
+
TO_BE_FILLED
|
| 81 |
+
|
| 82 |
+
The next section will during our SSL process
|
| 83 |
+
|
| 84 |
+
The SSL dataset distribution is as follow:
|
| 85 |
+
|
| 86 |
+
The training time for this 7B model is around 8,000 GPU hours (roughly 42 days on GPU DGX 8 A100 40GB). The snapshot for the 50B checkpoint is taken around 13,000 steps.
|
| 87 |
+
|
| 88 |
+
<p align="left"> <img src="loss.png" width="500" /> </p>
|
| 89 |
+
|
| 90 |
+
Pre-training Strategies
|
| 91 |
+
TO_BE_FILLED
|
| 92 |
+
|
| 93 |
+
|
| 94 |
+
Supervised fine-tuning (SFT) Data
|
| 95 |
+
TO_BE_FILLED
|
| 96 |
+
|
| 97 |
+
SFT Strategies
|
| 98 |
+
TO_BE_FILLED
|
| 99 |
+
|
| 100 |
+
Acknowledgement to Our Linguists
|
| 101 |
+
We would like to express our special thanks to our professional and native linguists, who helped build, evaluate, and fact-check our sampled pretraining and SFT dataset as well as evaluating our models across different aspects, especially safety.
|
| 102 |
+
|
| 103 |
+
Citation
|
| 104 |
+
If you find our project useful, we hope you would kindly star our repo and cite our work as follows: Corresponding Author: v.quangph3@vinbigdata.com, v.kietbs@vinbigdata.com, v.minhtt32@vinbigdata.com
|
| 105 |
+
|
loss.png
ADDED
|
vbd_logo.png
ADDED
|