Improve language tag

#1
by lbourdois - opened
Files changed (1) hide show
  1. README.md +121 -112
README.md CHANGED
@@ -1,112 +1,121 @@
1
-
2
- ---
3
-
4
- base_model:
5
- - qnguyen3/VyLinh-3B
6
- - Qwen/Qwen2.5-3B-Instruct
7
- library_name: transformers
8
- tags:
9
- - mergekit
10
- - merge
11
- language:
12
- - vi
13
-
14
- ---
15
-
16
- [![QuantFactory Banner](https://lh7-rt.googleusercontent.com/docsz/AD_4nXeiuCm7c8lEwEJuRey9kiVZsRn2W-b4pWlu3-X534V3YmVuVc2ZL-NXg2RkzSOOS2JXGHutDuyyNAUtdJI65jGTo8jT9Y99tMi4H4MqL44Uc5QKG77B0d6-JfIkZHFaUA71-RtjyYZWVIhqsNZcx8-OMaA?key=xt3VSDoCbmTY7o-cwwOFwQ)](https://hf.co/QuantFactory)
17
-
18
-
19
- # QuantFactory/Arcee-VyLinh-GGUF
20
- This is quantized version of [arcee-ai/Arcee-VyLinh](https://huggingface.co/arcee-ai/Arcee-VyLinh) created using llama.cpp
21
-
22
- # Original Model Card
23
-
24
- **Quantized Version**: [arcee-ai/Arcee-VyLinh-GGUF](https://huggingface.co/arcee-ai/Arcee-VyLinh-GGUF)
25
-
26
- # Arcee-VyLinh
27
-
28
- Arcee-VyLinh is a 3B parameter instruction-following model specifically optimized for Vietnamese language understanding and generation. Built through an innovative training process combining evolved hard questions and iterative Direct Preference Optimization (DPO), it achieves remarkable performance despite its compact size.
29
-
30
- ## Model Details
31
-
32
- - **Architecture:** Based on Qwen2.5-3B
33
- - **Parameters:** 3 billion
34
- - **Context Length:** 32K tokens
35
- - **Training Data:** Custom evolved dataset + ORPO-Mix-40K (Vietnamese)
36
- - **Training Method:** Multi-stage process including EvolKit, proprietary merging, and iterative DPO
37
- - **Input Format:** Supports both English and Vietnamese, optimized for Vietnamese
38
-
39
- ## Intended Use
40
-
41
- - Vietnamese language chat and instruction following
42
- - Text generation and completion
43
- - Question answering
44
- - General language understanding tasks
45
- - Content creation and summarization
46
-
47
- ## Performance and Limitations
48
-
49
- ### Strengths
50
-
51
- - Exceptional performance on complex Vietnamese language tasks
52
- - Efficient 3B parameter architecture
53
- - Strong instruction-following capabilities
54
- - Competitive with larger models (4B-8B parameters)
55
-
56
- ### Benchmarks
57
-
58
- Tested on Vietnamese subset of m-ArenaHard (CohereForAI), with Claude 3.5 Sonnet as judge:
59
-
60
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/630430583926de1f7ec62c6b/m1bTn0vkiPKZ3uECC4b0L.png)
61
-
62
- ### Limitations
63
-
64
- - Might still hallucinate on cultural-specific content.
65
- - Primary focus on Vietnamese language understanding
66
- - May not perform optimally for specialized technical domains
67
-
68
- ## Training Process
69
-
70
- Our training pipeline consisted of several innovative stages:
71
-
72
- 1. **Base Model Selection:** Started with Qwen2.5-3B
73
- 2. **Hard Question Evolution:** Generated 20K challenging questions using EvolKit
74
- 3. **Initial Training:** Created VyLinh-SFT through supervised fine-tuning
75
- 4. **Model Merging:** Proprietary merging technique with Qwen2.5-3B-Instruct
76
- 5. **DPO Training:** 6 epochs of iterative DPO using ORPO-Mix-40K
77
- 6. **Final Merge:** Combined with Qwen2.5-3B-Instruct for optimal performance
78
-
79
- ## Usage Examples
80
-
81
- ```python
82
- from transformers import AutoModelForCausalLM, AutoTokenizer
83
-
84
- # Load the model and tokenizer
85
- model = AutoModelForCausalLM.from_pretrained("arcee-ai/Arcee-VyLinh")
86
- tokenizer = AutoTokenizer.from_pretrained("arcee-ai/Arcee-VyLinh")
87
-
88
- prompt = "Một cộng một bằng mấy?"
89
- messages = [
90
- {"role": "system", "content": "Bạn là trợ lí hữu ích."},
91
- {"role": "user", "content": prompt}
92
- ]
93
- text = tokenizer.apply_chat_template(
94
- messages,
95
- tokenize=False,
96
- add_generation_prompt=True
97
- )
98
- model_inputs = tokenizer([text], return_tensors="pt").to(device)
99
-
100
- generated_ids = model.generate(
101
- model_inputs.input_ids,
102
- max_new_tokens=1024,
103
- eos_token_id=tokenizer.eos_token_id,
104
- temperature=0.25,
105
- )
106
- generated_ids = [
107
- output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
108
- ]
109
-
110
- response = tokenizer.batch_decode(generated_ids)[0]
111
- print(response)
112
- ```
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model:
3
+ - qnguyen3/VyLinh-3B
4
+ - Qwen/Qwen2.5-3B-Instruct
5
+ library_name: transformers
6
+ tags:
7
+ - mergekit
8
+ - merge
9
+ language:
10
+ - zho
11
+ - eng
12
+ - fra
13
+ - spa
14
+ - por
15
+ - deu
16
+ - ita
17
+ - rus
18
+ - jpn
19
+ - kor
20
+ - vie
21
+ - tha
22
+ - ara
23
+ ---
24
+
25
+ [![QuantFactory Banner](https://lh7-rt.googleusercontent.com/docsz/AD_4nXeiuCm7c8lEwEJuRey9kiVZsRn2W-b4pWlu3-X534V3YmVuVc2ZL-NXg2RkzSOOS2JXGHutDuyyNAUtdJI65jGTo8jT9Y99tMi4H4MqL44Uc5QKG77B0d6-JfIkZHFaUA71-RtjyYZWVIhqsNZcx8-OMaA?key=xt3VSDoCbmTY7o-cwwOFwQ)](https://hf.co/QuantFactory)
26
+
27
+
28
+ # QuantFactory/Arcee-VyLinh-GGUF
29
+ This is quantized version of [arcee-ai/Arcee-VyLinh](https://huggingface.co/arcee-ai/Arcee-VyLinh) created using llama.cpp
30
+
31
+ # Original Model Card
32
+
33
+ **Quantized Version**: [arcee-ai/Arcee-VyLinh-GGUF](https://huggingface.co/arcee-ai/Arcee-VyLinh-GGUF)
34
+
35
+ # Arcee-VyLinh
36
+
37
+ Arcee-VyLinh is a 3B parameter instruction-following model specifically optimized for Vietnamese language understanding and generation. Built through an innovative training process combining evolved hard questions and iterative Direct Preference Optimization (DPO), it achieves remarkable performance despite its compact size.
38
+
39
+ ## Model Details
40
+
41
+ - **Architecture:** Based on Qwen2.5-3B
42
+ - **Parameters:** 3 billion
43
+ - **Context Length:** 32K tokens
44
+ - **Training Data:** Custom evolved dataset + ORPO-Mix-40K (Vietnamese)
45
+ - **Training Method:** Multi-stage process including EvolKit, proprietary merging, and iterative DPO
46
+ - **Input Format:** Supports both English and Vietnamese, optimized for Vietnamese
47
+
48
+ ## Intended Use
49
+
50
+ - Vietnamese language chat and instruction following
51
+ - Text generation and completion
52
+ - Question answering
53
+ - General language understanding tasks
54
+ - Content creation and summarization
55
+
56
+ ## Performance and Limitations
57
+
58
+ ### Strengths
59
+
60
+ - Exceptional performance on complex Vietnamese language tasks
61
+ - Efficient 3B parameter architecture
62
+ - Strong instruction-following capabilities
63
+ - Competitive with larger models (4B-8B parameters)
64
+
65
+ ### Benchmarks
66
+
67
+ Tested on Vietnamese subset of m-ArenaHard (CohereForAI), with Claude 3.5 Sonnet as judge:
68
+
69
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/630430583926de1f7ec62c6b/m1bTn0vkiPKZ3uECC4b0L.png)
70
+
71
+ ### Limitations
72
+
73
+ - Might still hallucinate on cultural-specific content.
74
+ - Primary focus on Vietnamese language understanding
75
+ - May not perform optimally for specialized technical domains
76
+
77
+ ## Training Process
78
+
79
+ Our training pipeline consisted of several innovative stages:
80
+
81
+ 1. **Base Model Selection:** Started with Qwen2.5-3B
82
+ 2. **Hard Question Evolution:** Generated 20K challenging questions using EvolKit
83
+ 3. **Initial Training:** Created VyLinh-SFT through supervised fine-tuning
84
+ 4. **Model Merging:** Proprietary merging technique with Qwen2.5-3B-Instruct
85
+ 5. **DPO Training:** 6 epochs of iterative DPO using ORPO-Mix-40K
86
+ 6. **Final Merge:** Combined with Qwen2.5-3B-Instruct for optimal performance
87
+
88
+ ## Usage Examples
89
+
90
+ ```python
91
+ from transformers import AutoModelForCausalLM, AutoTokenizer
92
+
93
+ # Load the model and tokenizer
94
+ model = AutoModelForCausalLM.from_pretrained("arcee-ai/Arcee-VyLinh")
95
+ tokenizer = AutoTokenizer.from_pretrained("arcee-ai/Arcee-VyLinh")
96
+
97
+ prompt = "Một cộng một bằng mấy?"
98
+ messages = [
99
+ {"role": "system", "content": "Bạn là trợ lí hữu ích."},
100
+ {"role": "user", "content": prompt}
101
+ ]
102
+ text = tokenizer.apply_chat_template(
103
+ messages,
104
+ tokenize=False,
105
+ add_generation_prompt=True
106
+ )
107
+ model_inputs = tokenizer([text], return_tensors="pt").to(device)
108
+
109
+ generated_ids = model.generate(
110
+ model_inputs.input_ids,
111
+ max_new_tokens=1024,
112
+ eos_token_id=tokenizer.eos_token_id,
113
+ temperature=0.25,
114
+ )
115
+ generated_ids = [
116
+ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
117
+ ]
118
+
119
+ response = tokenizer.batch_decode(generated_ids)[0]
120
+ print(response)
121
+ ```