lbourdois commited on
Commit
fd91d95
·
verified ·
1 Parent(s): 8297720

Improve language tag

Browse files

Hi! As the model is multilingual, this is a PR to add other languages than English to the language tag to improve the referencing. Note that 29 languages are announced in the README, but only 13 are explicitly listed. I was therefore only able to add these 13 languages.

Files changed (1) hide show
  1. README.md +99 -87
README.md CHANGED
@@ -1,88 +1,100 @@
1
- ---
2
- license: apache-2.0
3
- datasets:
4
- - TigerResearch/pretrain_zh
5
- language:
6
- - zh
7
- base_model:
8
- - Qwen/Qwen2.5-3B
9
- tags:
10
- - qwen2.5
11
- - text-generation-inference
12
- - Text Generation
13
- - Character
14
- ---
15
-
16
- **Qwen2.5-3B-Character**
17
-
18
- **Introduction:**
19
-
20
- **Qwen2.5-3B-Character** is the Character version of [Qwen2.5-3B](https://huggingface.co/Qwen/Qwen2.5-3B) model. It is developed based on the [Qwen2.5-3B](https://huggingface.co/Qwen/Qwen2.5-3B) model. It is specifically designed for character-to-character transformation and generation tasks.
21
-
22
- **Core Contributions:**
23
-
24
- 1. **Modified Token Vocabulary:** The original model's token vocabulary has been revised to remove tokens representing phrases and multiple characters. This refinement enhances the model's focus on individual character processing.
25
-
26
- 2. **Continued Pre-training:** Based on the modified vocabulary, the model has undergone further pre-training to optimize its performance and adaptability for character-level tasks.
27
-
28
-
29
- **Training Dataset:**
30
-
31
- The model has been trained using the `TigerResearch/pretrain_zh` dataset, a comprehensive Chinese pre-training dataset provided by **TigerResearch**. For more information about the dataset, please visit: [TigerResearch/pretrain_zh](https://huggingface.co/datasets/TigerResearch/pretrain_zh).
32
-
33
-
34
- **Training Code:**
35
-
36
- The training process for this model was facilitated by the **LLaMA-Factory**, an open-source project that provides tools and frameworks for training language models. The LLaMa-factory codebase is available at: [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory).
37
-
38
-
39
- **Results**
40
-
41
- To assess the efficacy of the Qwen2.5-3B-Character, we evaluated its performance on three widely utilized benchmarks: C-Evel, CMMLU, and MMLU. The results are tabulated as follows:
42
-
43
- | Model | ceval| cmmlu| mmlu|
44
- | :--- | :---: | :---: | :---: |
45
- | Qwen2.5-3B | 74.37| 74.94| 65.87 |
46
- | Qwen2.5-3B-filter | 70.43| 69.69| 65.53 |
47
- | Qwen2.5-3B-Character | 71.97| 71.94| 65.18 |
48
-
49
- In the table, to discern the model performance more distinctly, we have presented the test results for both the original Qwen2.5-3B (Qwen2.5-3B) and the token-modified Qwen2.5-3B (Qwen2.5-3B-filter).
50
-
51
-
52
- **Quickstart**
53
-
54
- The latest version of transformers is recommended (at least 4.37.0). Here we show a code snippet to show you how to use the chat model with transformers:
55
-
56
- ```shell
57
- from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
58
-
59
- model_name = 'Henry94/Qwen2.5-3B-Character'
60
-
61
- tokenizer = AutoTokenizer.from_pretrained(model_name)
62
- model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")
63
-
64
-
65
- prompt = "请简单介绍一下大型语言模型."
66
- messages = [
67
- {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
68
- {"role": "user", "content": prompt}
69
- ]
70
- text = tokenizer.apply_chat_template(
71
- messages,
72
- tokenize=False,
73
- add_generation_prompt=True
74
- )
75
- model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
76
-
77
- generated_ids = model.generate(
78
- **model_inputs,
79
- max_new_tokens=512
80
- )
81
- generated_ids = [
82
- output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
83
- ]
84
-
85
- response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
86
-
87
- print(response)
 
 
 
 
 
 
 
 
 
 
 
 
88
  ```
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - TigerResearch/pretrain_zh
5
+ language:
6
+ - zho
7
+ - eng
8
+ - fra
9
+ - spa
10
+ - por
11
+ - deu
12
+ - ita
13
+ - rus
14
+ - jpn
15
+ - kor
16
+ - vie
17
+ - tha
18
+ - ara
19
+ base_model:
20
+ - Qwen/Qwen2.5-3B
21
+ tags:
22
+ - qwen2.5
23
+ - text-generation-inference
24
+ - Text Generation
25
+ - Character
26
+ ---
27
+
28
+ **Qwen2.5-3B-Character**
29
+
30
+ **Introduction:**
31
+
32
+ **Qwen2.5-3B-Character** is the Character version of [Qwen2.5-3B](https://huggingface.co/Qwen/Qwen2.5-3B) model. It is developed based on the [Qwen2.5-3B](https://huggingface.co/Qwen/Qwen2.5-3B) model. It is specifically designed for character-to-character transformation and generation tasks.
33
+
34
+ **Core Contributions:**
35
+
36
+ 1. **Modified Token Vocabulary:** The original model's token vocabulary has been revised to remove tokens representing phrases and multiple characters. This refinement enhances the model's focus on individual character processing.
37
+
38
+ 2. **Continued Pre-training:** Based on the modified vocabulary, the model has undergone further pre-training to optimize its performance and adaptability for character-level tasks.
39
+
40
+
41
+ **Training Dataset:**
42
+
43
+ The model has been trained using the `TigerResearch/pretrain_zh` dataset, a comprehensive Chinese pre-training dataset provided by **TigerResearch**. For more information about the dataset, please visit: [TigerResearch/pretrain_zh](https://huggingface.co/datasets/TigerResearch/pretrain_zh).
44
+
45
+
46
+ **Training Code:**
47
+
48
+ The training process for this model was facilitated by the **LLaMA-Factory**, an open-source project that provides tools and frameworks for training language models. The LLaMa-factory codebase is available at: [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory).
49
+
50
+
51
+ **Results**
52
+
53
+ To assess the efficacy of the Qwen2.5-3B-Character, we evaluated its performance on three widely utilized benchmarks: C-Evel, CMMLU, and MMLU. The results are tabulated as follows:
54
+
55
+ | Model | ceval| cmmlu| mmlu|
56
+ | :--- | :---: | :---: | :---: |
57
+ | Qwen2.5-3B | 74.37| 74.94| 65.87 |
58
+ | Qwen2.5-3B-filter | 70.43| 69.69| 65.53 |
59
+ | Qwen2.5-3B-Character | 71.97| 71.94| 65.18 |
60
+
61
+ In the table, to discern the model performance more distinctly, we have presented the test results for both the original Qwen2.5-3B (Qwen2.5-3B) and the token-modified Qwen2.5-3B (Qwen2.5-3B-filter).
62
+
63
+
64
+ **Quickstart**
65
+
66
+ The latest version of transformers is recommended (at least 4.37.0). Here we show a code snippet to show you how to use the chat model with transformers:
67
+
68
+ ```shell
69
+ from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
70
+
71
+ model_name = 'Henry94/Qwen2.5-3B-Character'
72
+
73
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
74
+ model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")
75
+
76
+
77
+ prompt = "请简单介绍一下大型语言模型."
78
+ messages = [
79
+ {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
80
+ {"role": "user", "content": prompt}
81
+ ]
82
+ text = tokenizer.apply_chat_template(
83
+ messages,
84
+ tokenize=False,
85
+ add_generation_prompt=True
86
+ )
87
+ model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
88
+
89
+ generated_ids = model.generate(
90
+ **model_inputs,
91
+ max_new_tokens=512
92
+ )
93
+ generated_ids = [
94
+ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
95
+ ]
96
+
97
+ response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
98
+
99
+ print(response)
100
  ```