Kiet Bui commited on
Commit
c9712ee
·
0 Parent(s):

initial commit

Browse files
Files changed (4) hide show
  1. .gitattributes +35 -0
  2. README.md +105 -0
  3. loss.png +0 -0
  4. vbd_logo.png +0 -0
.gitattributes ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,105 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: llama2
3
+ language:
4
+ - en
5
+ - vi
6
+ ---
7
+
8
+ <p align="center"> <img src="vbd_logo.png" width="600" /> </p>
9
+ VBD-LLaMA2-Chat - a Conversationally-tuned Llama2 for Vietnamese
10
+
11
+ We release VBD-LLaMA2-7B-Chat, a finetuned model based on Meta's LLaMA2-7B specifically for the Vietname 🇻🇳 language, in an effort to support the community in building Vietnamese Large Language Models (LLMs). The pretrained weight for this model was trained with continous self-supervised learning (SSL) by extending LLaMA2's vocab on a corpus consisting of 100 billion Vietnamese 🇻🇳 tokens and 40 billion English 🇬🇧 tokens. This approach attempts to leverage the full potential of existing language models and adapt them to lower resource languages and, in the process, reduce the hardware, time, and data cost in building LLMs for these languages. The subsequent supervised finetuning (SFT) was conducted on our internal SFT dataset consisting of 2 million samples.
12
+
13
+ For this release we are only including the SFT weight based on a 50B Vietnamese and 20B English tokens pretrained checkpoint.
14
+
15
+
16
+ Model weights:
17
+ - VBD-LLaMA2-7B-50b-Chat: a snapshot demonstrating the efficacy of the proposed methodology. This base model is pretrained on 50B Vietnamese tokens and 20B English tokens and SFT on XXXX samples.
18
+
19
+ <blockquote style="color:red"> <p><strong style="color: red">Terms of Use and License</strong>: By using our released weights, you agree to and comply with the terms and conditions specified in Meta's LLaMA-2 license.</blockquote>
20
+
21
+
22
+
23
+ Disclaimer: Despite our efforts in limiting misleading, inaccurate and harmful generation, our released model will come with potential risks. We strongly advise to only use this model under highly supervised environment and/or perform extra testing, red teaming and aligning. The use of this model must abide by and comply with local governance and regulations. The authors of this model shall not be held liable for any claim, damage or other liability arise from the use of the released weight(s).
24
+
25
+
26
+
27
+ In the following section, we document some of the benchmark of the released weight(s).
28
+
29
+ Evaluation
30
+
31
+
32
+ We evaluated our model via peer comparison on multiple publicly available dataset using
33
+ <a href="https://github.com/hieunguyen1053/lm-evaluation-harness"> @hieunguyen1053 fork of lm-evaluation-harness </a>. The models are benchmark on different task and metrics. The results are below:
34
+
35
+
36
+ | Organization | Model | Model size | ARC (ACC) | HellaSwag (ACC) | LAMBADA (perplexity) | MMLU (ACC) | IWSLT 2023 en-vi (BLEU) | TruthfulQA (ACC) | Grade 12 Exams (ACC) | hhh_alignment_vi (ACC) | xnli_vi (ACC) |
37
+ | ------------ | ---------------------------- | ---------- | --------- | --------------- | -------------------- | ---------- | ----------------------- | ---------------- | -------------------- | ---------------------- | ------------- |
38
+ | VietAI | gpt-j-6B-vietnamese-news | ~7B | 0,2419 | 0,3856 | 35,1863 | 0,2282 | 0,6698 | 0,4718 | | | 0,4365 |
39
+ | VietAI | gpt-neo-1.3B-vietnamese-news | ~1.5B | 0,2274 | 0,3567 | 64,3972 | 0,229 | 0,5178 | 0,4423 | | | 0,4483 |
40
+ | VietGPT | dama-2-7B-chat | ~7B | 0,3417 | 0,5106 | 38,0188 | 0,338 | 24,3101 | 0,4847 | | | 0,4653 |
41
+ | VietGPT | dama-2-7B | ~7B | 0,3214 | 0,4892 | 17,6625 | 0,2339 | 25,8764 | 0,4416 | 0,293 | | 0,4469 |
42
+ | ViLM | vietcuna-7b-v3 | ~7B | 0,335 | 0,4914 | 21,7747 | 0,336 | 21,0801 | 0,4771 | 0,2992 | | 0,4749 |
43
+ | VLSP | hoa-1b4 | ~1.5B | 0,2718 | 0,4228 | 20,3997 | 0,2281 | 28,0573 | 0,4423 | 0,2684 | | 0,4605 |
44
+ | VLSP | hoa-7b | ~7B | 0,2855 | 0,4329 | 22,6466 | 0,2536 | 25,5126 | 0,4542 | 0,2705 | | 0,4509 |
45
+ | VBD | VBD-LLaMA2-7B-50b | ~7B | 0,3222 | 0,5195 | 13,033 | 0,2964 | | 0,4614 | 0,3197 | | 0,4764 |
46
+ | VBD | VBD-LLaMA2-7B-50b-Chat | ~7B | 0,3585 | 0,5207 | 13,419 | 0,3444 | 24,1 | 0,5179 | 0,3299 | 0,5792 | 0,4772 |
47
+ | AISingapore | Sealion7b | ~7B | 0,2692 | 0,483 | 16,4388 | 0,267 | | 0,4275 | 0,2725 | | 0,4277 |
48
+ | BK Lab | LLaMa-2-BK | ~7B | 0,2966 | 0,4402 | 25,613 | 0,3402 | | 0,4528 | 0,2971 | | 0,4655 |
49
+ | Meta | LLaMa-2 | ~7B | 0,3034 | 0,4287 | | 0,3067 | | | | | |
50
+ | BigScience | Bloom | ~7B | 0,337 | 0,483 | | 0,281 | | | | | |
51
+ | FPT | FPT GenAI | | 0,3581 | 0,5055 | | 0,3143 | | | | | |
52
+ | VinAI | PhoGPT SFT | ~7B | 0,2684 | 0,4109 | 55,509 | 0,2499 | | 0,478 | 0,2643 | | 0,4198 |
53
+
54
+
55
+
56
+
57
+
58
+
59
+ | Organization | Model | Model size | ARC (ACC) | HellaSwag (ACC) | LAMBADA (perplexity) | MMLU (ACC) |
60
+ | ------------ | ------------------ | ---------- | --------- | --------------- | -------------------- | ---------- |
61
+ | VLSP | hoa-7b | ~7B | 0,2722 | 0,4867 | 18,53 | |
62
+ | BK Lab | LLaMA-2-BK | ~7B | 0,4164 | 0,7216 | 5,010 | |
63
+ | ViLM | vietcuna-7b-v3 | ~7B | 0,3976 | 0,6309 | 7,125 | |
64
+ | BigScience | Bloomz-T0 | ~7B | 0,436 | 0,6401 | 6,542 | 0,3785 |
65
+ | TII | Falcon-7B-Instruct | ~7B | 0,4258 | 0,6976 | 7,463 | 0,2584 |
66
+ | MosaicML | MPT-7B-Chat | ~7B | 0,4258 | 0,7438 | 5,797 | 0,3762 |
67
+ | Meta | LLaMA-2-Chat | ~7B | 0,442 | 0,7547 | 3,968 | 0,4832 |
68
+ | AISingapore | Sealion7b | ~7B | 0,3422 | 0,6705 | 6,715 | 0,268 |
69
+ | VBD | VBD-LLaMA2-7B-50b-Chat | ~7B | 0,4556 | 0,7384 | 4,645 | 0,4558 |
70
+
71
+
72
+ Based on this results, our model performs on-par or better most models for tasks in Vietnamese. TO_BE_FILLED
73
+
74
+ Safety Enchancement in Local Context
75
+ TO_BE_FILLED
76
+
77
+
78
+
79
+ Training process
80
+ TO_BE_FILLED
81
+
82
+ The next section will during our SSL process
83
+
84
+ The SSL dataset distribution is as follow:
85
+
86
+ The training time for this 7B model is around 8,000 GPU hours (roughly 42 days on GPU DGX 8 A100 40GB). The snapshot for the 50B checkpoint is taken around 13,000 steps.
87
+
88
+ <p align="left"> <img src="loss.png" width="500" /> </p>
89
+
90
+ Pre-training Strategies
91
+ TO_BE_FILLED
92
+
93
+
94
+ Supervised fine-tuning (SFT) Data
95
+ TO_BE_FILLED
96
+
97
+ SFT Strategies
98
+ TO_BE_FILLED
99
+
100
+ Acknowledgement to Our Linguists
101
+ We would like to express our special thanks to our professional and native linguists, who helped build, evaluate, and fact-check our sampled pretraining and SFT dataset as well as evaluating our models across different aspects, especially safety.
102
+
103
+ Citation
104
+ If you find our project useful, we hope you would kindly star our repo and cite our work as follows: Corresponding Author: v.quangph3@vinbigdata.com, v.kietbs@vinbigdata.com, v.minhtt32@vinbigdata.com
105
+
loss.png ADDED
vbd_logo.png ADDED