Files changed (1) hide show
  1. README.md +174 -166
README.md CHANGED
@@ -1,167 +1,175 @@
1
- ---
2
- license: mit
3
- language:
4
- - vi
5
- - en
6
- - zh
7
- - id
8
- - th
9
- base_model:
10
- - Qwen/Qwen2.5-14B-Instruct
11
- pipeline_tag: text2text-generation
12
- ---
13
-
14
-
15
- # GreenMind-Medium-14B-R1
16
-
17
- We release **GreenMind-Medium-14B-R1**, a medium-sized Vietnamese language model capable of effectively addressing questions that require intermediate-level reasoning, such as general knowledge, mathematics, natural science and social science topics. By leveraging the Group Relative Policy Optimization strategy for fine-tuning, we guide the model to generate logically coherent responses.
18
-
19
- ## Model Description
20
-
21
- - **Model Type:** Causal Language Models
22
- - **Base Model:** [Qwen/Qwen2.5-14B-Instruct](https://huggingface.co/Qwen/Qwen2.5-14B-Instruct)
23
- - **Parameters:** 14.7B
24
- - **Context Length:** Full 131,072 tokens and generation 8192 tokens
25
- - **Language:** Vietnamese
26
-
27
- ## Quickstart
28
-
29
- Here provides a code snippet with `apply_chat_template` to show you how to load the tokenizer and model and how to generate contents.
30
-
31
- ```python
32
- from transformers import AutoModelForCausalLM, AutoTokenizer
33
-
34
- model_name = "GreenNode/GreenMind-Medium-14B-R1"
35
-
36
- model = AutoModelForCausalLM.from_pretrained(
37
- model_name,
38
- torch_dtype="auto",
39
- device_map="auto"
40
- )
41
-
42
- tokenizer = AutoTokenizer.from_pretrained(
43
- model_name,
44
- revision='main',
45
- trust_remote_code=False,
46
- )
47
- prompt = r"""Vừa gà vừa chó
48
- Bó lại cho tròn
49
- Ba mươi sáu con
50
- Một trăm chân chẵn
51
- Hỏi có bao nhiêu con gà, bao nhiêu con chó?"""
52
-
53
- messages = [
54
- {
55
- "role": "system",
56
- "content": "Bạn một trợ lý ảo hữu ích trong việc trả lời câu hỏi. Hãy suy luận từng bước, và đưa ra đáp án trong thẻ <answer> </answer>."
57
- },
58
- {
59
- "role": "user",
60
- "content": f"{prompt} Hãy suy luận từng bước trong thẻ <think> </think>. Và trả về đáp án trong thẻ <answer> </answer>."
61
- },
62
- {
63
- "role": "assistant",
64
- "content": "Hãy để tôi giải quyết từng bước.\n<think>"
65
- }
66
- ]
67
-
68
- text = tokenizer.apply_chat_template(
69
- messages,
70
- tokenize=False,
71
- continue_final_message=True)
72
-
73
- model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
74
-
75
- generated_ids = model.generate(
76
- **model_inputs,
77
- max_new_tokens=1024
78
- )
79
-
80
- generated_ids = [
81
- output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
82
- ]
83
-
84
- response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
85
- print(response)
86
- # Đầu tiên, chúng ta cần thiết lập hai phương trình dựa trên thông tin đề bài:
87
- # 1. Tổng số con gà và chó là 36: x + y = 36
88
- # 2. Tổng số chân là 100: 2x + 4y = 100
89
- # Trong đó, x số con gà và y là số con chó.
90
- # Tiếp theo, chúng ta giải hệ phương trình này:
91
- # Từ phương trình thứ nhất, ta có: x = 36 - y
92
- # Thay vào phương trình thứ hai: 2(36 - y) + 4y = 100
93
- # => 72 - 2y + 4y = 100
94
- # => 2y = 28
95
- # => y = 14 (số con chó)
96
- # Thay y = 14 vào phương trình x + y = 36:
97
- # => x = 36 - 14 = 22 (số con gà)
98
- # Vậy, 22 con 14 con chó.
99
- # </think>
100
- # <answer>Có 22 con 14 con chó.</answer>
101
- ```
102
-
103
- ## Evaluation
104
-
105
- **Table 1. SeaExam Dataset.** GreenMind-Medium-14B-R1 compared to base model and some models with larger size.
106
- | **Model** | **SeaExam-ID** | **SeaExam-TH** | **SeaExam-VI** | **Avg** |
107
- |----------------------------------|----------------|----------------|----------------|----------|
108
- | Meta-Llama-3.1-70B-Instruct | 65.8 | **70.6** | 72.6 | 69.7 |
109
- | gemma3-27b-it | 64.4 | 67.5 | 73.1 | 68.4 |
110
- | Qwen2.5-14B-Instruct | 67.6 | 68.8 | 73.1 | 69.8 |
111
- | **GreenMind-Medium-14B-R1** | **74.36** | 69.75 | **74.44** | **72.79** |
112
-
113
- **Table 2. VLSP 2023 Challenge:** The performance of our model outperforms most SOTA models.
114
-
115
- | **Model** | **ComprehensionQA-vi ↑** | **Exams-vi ↑** | **LAMBADA-vi ↓** | **WikiQA-vi ↑** | **MMLU-vi ↑** |
116
- |----------------------------------|---------------------------|----------------|------------------|-----------------|---------------|
117
- | cpt-smartbot-13b | 0.6633 | 0.3473 | 21.9864 | 0.4455 | 0.414 |
118
- | ura-llama-13b | 0.6556 | 0.342 | 17.5614 | 0.438 | 0.3973 |
119
- | greennode-7b (prior work) | 0.6122 | 0.2892 | 189.7782 | 0.3335 | 0.387 |
120
- | greennode-14b (prior work) | 0.6711 | 0.3672 | 29.5967 | 0.468 | 0.5281 |
121
- | **GreenMind-Medium-14B-R1 (Ours)** | **0.8689** | **0.7796** | **10.7609** | **0.7915** | **0.7124** |
122
-
123
- **Table 3. VMLU Dataset.** The performance compared to fine-tuned models.
124
-
125
- | **Model** | **Access** | **STEM** | **Social Science** | **Humanities** | **Others** | **Avg** |
126
- |----------------------------------|-----------|----------|---------------------|----------------|------------|----------|
127
- | VNPTAI.IO-Medium-R1 | Private | 77.09 | 82.3 | 78.85 | 69.98 | 77.43 |
128
- | MISA-Llama3-v1.1 | Private | 77.5 | 80.75 | 76.62 | 71.6 | 76.87 |
129
- | BnK-AI-Medium-v2 | Private | 80.94 | 80.76 | 70.7 | 74.06 | 76.66 |
130
- | VNPTAI.IO-Large-v4 | Private | 78.05 | 79.05 | 75.39 | 70.37 | 76.21 |
131
- | GreenNode-xMedium-v1 | Private | 75.7 | 81.09 | 75.25 | 69.33 | 75.5 |
132
- | **GreenMind-Medium-14B-R1 (Ours)** | Weight | 76.78 | 77.36 | 72.32 | 69.03 | 74.29 |
133
- | CakebyVPBank-Large | Private | 77.75 | 78.11 | 70.38 | 67.82 | 73.99 |
134
- | DeepSeek-R1-Distill-Llama-70B | Weight | 76.77 | 76.23 | 67.98 | 66.82 | 72.41 |
135
-
136
- ## Follow us
137
-
138
- https://x.com/greennode23
139
-
140
- ## Support
141
-
142
- https://discord.gg/B6MJFM3J3a
143
-
144
- ## License
145
-
146
- This repository and the model weights are licensed under the [MIT License](LICENSE).
147
-
148
- ## Citation
149
-
150
- If you find our work helpful, feel free to give us a cite.
151
-
152
- ```
153
- @misc{tung2025greenmindnextgenerationvietnameselarge,
154
- title={GreenMind: A Next-Generation Vietnamese Large Language Model for Structured and Logical Reasoning},
155
- author={Luu Quy Tung and Hoang Quoc Viet and Vo Trong Thu},
156
- year={2025},
157
- eprint={2504.16832},
158
- archivePrefix={arXiv},
159
- primaryClass={cs.CL},
160
- url={https://arxiv.org/abs/2504.16832},
161
- }
162
- ```
163
-
164
- ## Contact Us
165
-
166
- - General & Collaboration: tung.vu@greennode.ai, thuvt@greennode.ai
 
 
 
 
 
 
 
 
167
  - Technical: viethq5@greennode.ai
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - zho
5
+ - eng
6
+ - fra
7
+ - spa
8
+ - por
9
+ - deu
10
+ - ita
11
+ - rus
12
+ - jpn
13
+ - kor
14
+ - vie
15
+ - tha
16
+ - ara
17
+ base_model:
18
+ - Qwen/Qwen2.5-14B-Instruct
19
+ pipeline_tag: text2text-generation
20
+ ---
21
+
22
+
23
+ # GreenMind-Medium-14B-R1
24
+
25
+ We release **GreenMind-Medium-14B-R1**, a medium-sized Vietnamese language model capable of effectively addressing questions that require intermediate-level reasoning, such as general knowledge, mathematics, natural science and social science topics. By leveraging the Group Relative Policy Optimization strategy for fine-tuning, we guide the model to generate logically coherent responses.
26
+
27
+ ## Model Description
28
+
29
+ - **Model Type:** Causal Language Models
30
+ - **Base Model:** [Qwen/Qwen2.5-14B-Instruct](https://huggingface.co/Qwen/Qwen2.5-14B-Instruct)
31
+ - **Parameters:** 14.7B
32
+ - **Context Length:** Full 131,072 tokens and generation 8192 tokens
33
+ - **Language:** Vietnamese
34
+
35
+ ## Quickstart
36
+
37
+ Here provides a code snippet with `apply_chat_template` to show you how to load the tokenizer and model and how to generate contents.
38
+
39
+ ```python
40
+ from transformers import AutoModelForCausalLM, AutoTokenizer
41
+
42
+ model_name = "GreenNode/GreenMind-Medium-14B-R1"
43
+
44
+ model = AutoModelForCausalLM.from_pretrained(
45
+ model_name,
46
+ torch_dtype="auto",
47
+ device_map="auto"
48
+ )
49
+
50
+ tokenizer = AutoTokenizer.from_pretrained(
51
+ model_name,
52
+ revision='main',
53
+ trust_remote_code=False,
54
+ )
55
+ prompt = r"""Vừa gà vừa chó
56
+ lại cho tròn
57
+ Ba mươi sáu con
58
+ Một trăm chân chẵn
59
+ Hỏi có bao nhiêu con gà, bao nhiêu con chó?"""
60
+
61
+ messages = [
62
+ {
63
+ "role": "system",
64
+ "content": "Bạn một trợ ảo hữu ích trong việc trả lời câu hỏi. Hãy suy luận từng bước, và đưa ra đáp án trong thẻ <answer> </answer>."
65
+ },
66
+ {
67
+ "role": "user",
68
+ "content": f"{prompt} Hãy suy luận từng bước trong thẻ <think> </think>. Và trả về đáp án trong thẻ <answer> </answer>."
69
+ },
70
+ {
71
+ "role": "assistant",
72
+ "content": "Hãy để tôi giải quyết từng bước.\n<think>"
73
+ }
74
+ ]
75
+
76
+ text = tokenizer.apply_chat_template(
77
+ messages,
78
+ tokenize=False,
79
+ continue_final_message=True)
80
+
81
+ model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
82
+
83
+ generated_ids = model.generate(
84
+ **model_inputs,
85
+ max_new_tokens=1024
86
+ )
87
+
88
+ generated_ids = [
89
+ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
90
+ ]
91
+
92
+ response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
93
+ print(response)
94
+ # Đầu tiên, chúng ta cần thiết lập hai phương trình dựa trên thông tin đề bài:
95
+ # 1. Tổng số con gà và chó là 36: x + y = 36
96
+ # 2. Tổng số chân 100: 2x + 4y = 100
97
+ # Trong đó, x số con y số con chó.
98
+ # Tiếp theo, chúng ta giải hệ phương trình này:
99
+ # Từ phương trình thứ nhất, ta có: x = 36 - y
100
+ # Thay vào phương trình thứ hai: 2(36 - y) + 4y = 100
101
+ # => 72 - 2y + 4y = 100
102
+ # => 2y = 28
103
+ # => y = 14 (số con chó)
104
+ # Thay y = 14 vào phương trình x + y = 36:
105
+ # => x = 36 - 14 = 22 (số con gà)
106
+ # Vậy, 22 con 14 con chó.
107
+ # </think>
108
+ # <answer>Có 22 con và 14 con chó.</answer>
109
+ ```
110
+
111
+ ## Evaluation
112
+
113
+ **Table 1. SeaExam Dataset.** GreenMind-Medium-14B-R1 compared to base model and some models with larger size.
114
+ | **Model** | **SeaExam-ID** | **SeaExam-TH** | **SeaExam-VI** | **Avg** |
115
+ |----------------------------------|----------------|----------------|----------------|----------|
116
+ | Meta-Llama-3.1-70B-Instruct | 65.8 | **70.6** | 72.6 | 69.7 |
117
+ | gemma3-27b-it | 64.4 | 67.5 | 73.1 | 68.4 |
118
+ | Qwen2.5-14B-Instruct | 67.6 | 68.8 | 73.1 | 69.8 |
119
+ | **GreenMind-Medium-14B-R1** | **74.36** | 69.75 | **74.44** | **72.79** |
120
+
121
+ **Table 2. VLSP 2023 Challenge:** The performance of our model outperforms most SOTA models.
122
+
123
+ | **Model** | **ComprehensionQA-vi ↑** | **Exams-vi ↑** | **LAMBADA-vi ↓** | **WikiQA-vi ↑** | **MMLU-vi ↑** |
124
+ |----------------------------------|---------------------------|----------------|------------------|-----------------|---------------|
125
+ | cpt-smartbot-13b | 0.6633 | 0.3473 | 21.9864 | 0.4455 | 0.414 |
126
+ | ura-llama-13b | 0.6556 | 0.342 | 17.5614 | 0.438 | 0.3973 |
127
+ | greennode-7b (prior work) | 0.6122 | 0.2892 | 189.7782 | 0.3335 | 0.387 |
128
+ | greennode-14b (prior work) | 0.6711 | 0.3672 | 29.5967 | 0.468 | 0.5281 |
129
+ | **GreenMind-Medium-14B-R1 (Ours)** | **0.8689** | **0.7796** | **10.7609** | **0.7915** | **0.7124** |
130
+
131
+ **Table 3. VMLU Dataset.** The performance compared to fine-tuned models.
132
+
133
+ | **Model** | **Access** | **STEM** | **Social Science** | **Humanities** | **Others** | **Avg** |
134
+ |----------------------------------|-----------|----------|---------------------|----------------|------------|----------|
135
+ | VNPTAI.IO-Medium-R1 | Private | 77.09 | 82.3 | 78.85 | 69.98 | 77.43 |
136
+ | MISA-Llama3-v1.1 | Private | 77.5 | 80.75 | 76.62 | 71.6 | 76.87 |
137
+ | BnK-AI-Medium-v2 | Private | 80.94 | 80.76 | 70.7 | 74.06 | 76.66 |
138
+ | VNPTAI.IO-Large-v4 | Private | 78.05 | 79.05 | 75.39 | 70.37 | 76.21 |
139
+ | GreenNode-xMedium-v1 | Private | 75.7 | 81.09 | 75.25 | 69.33 | 75.5 |
140
+ | **GreenMind-Medium-14B-R1 (Ours)** | Weight | 76.78 | 77.36 | 72.32 | 69.03 | 74.29 |
141
+ | CakebyVPBank-Large | Private | 77.75 | 78.11 | 70.38 | 67.82 | 73.99 |
142
+ | DeepSeek-R1-Distill-Llama-70B | Weight | 76.77 | 76.23 | 67.98 | 66.82 | 72.41 |
143
+
144
+ ## Follow us
145
+
146
+ https://x.com/greennode23
147
+
148
+ ## Support
149
+
150
+ https://discord.gg/B6MJFM3J3a
151
+
152
+ ## License
153
+
154
+ This repository and the model weights are licensed under the [MIT License](LICENSE).
155
+
156
+ ## Citation
157
+
158
+ If you find our work helpful, feel free to give us a cite.
159
+
160
+ ```
161
+ @misc{tung2025greenmindnextgenerationvietnameselarge,
162
+ title={GreenMind: A Next-Generation Vietnamese Large Language Model for Structured and Logical Reasoning},
163
+ author={Luu Quy Tung and Hoang Quoc Viet and Vo Trong Thu},
164
+ year={2025},
165
+ eprint={2504.16832},
166
+ archivePrefix={arXiv},
167
+ primaryClass={cs.CL},
168
+ url={https://arxiv.org/abs/2504.16832},
169
+ }
170
+ ```
171
+
172
+ ## Contact Us
173
+
174
+ - General & Collaboration: tung.vu@greennode.ai, thuvt@greennode.ai
175
  - Technical: viethq5@greennode.ai