Improve language tag

#1
by lbourdois - opened
Files changed (1) hide show
  1. README.md +192 -181
README.md CHANGED
@@ -1,181 +1,192 @@
1
-
2
- ---
3
-
4
- library_name: transformers
5
- base_model: Qwen/Qwen2.5-7B-Instruct
6
- license: apache-2.0
7
- datasets:
8
- - shibing624/chinese_text_correction
9
- language:
10
- - zh
11
- metrics:
12
- - f1
13
- tags:
14
- - text-generation-inference
15
- widget:
16
- - text: "文本纠错:\n少先队员因该为老人让坐。"
17
-
18
- ---
19
-
20
- [![QuantFactory Banner](https://lh7-rt.googleusercontent.com/docsz/AD_4nXeiuCm7c8lEwEJuRey9kiVZsRn2W-b4pWlu3-X534V3YmVuVc2ZL-NXg2RkzSOOS2JXGHutDuyyNAUtdJI65jGTo8jT9Y99tMi4H4MqL44Uc5QKG77B0d6-JfIkZHFaUA71-RtjyYZWVIhqsNZcx8-OMaA?key=xt3VSDoCbmTY7o-cwwOFwQ)](https://hf.co/QuantFactory)
21
-
22
-
23
- # QuantFactory/chinese-text-correction-7b-GGUF
24
- This is quantized version of [shibing624/chinese-text-correction-7b](https://huggingface.co/shibing624/chinese-text-correction-7b) created using llama.cpp
25
-
26
- # Original Model Card
27
-
28
-
29
-
30
-
31
- # Chinese Text Correction Model
32
- 中文文本纠错模型chinese-text-correction-7b:用于拼写纠错、语法纠错
33
-
34
- `shibing624/chinese-text-correction-7b` evaluate test data:
35
-
36
- The overall performance of CSC **test**:
37
-
38
- |input_text|predict_text|
39
- |:--- |:--- |
40
- |文本纠错:\n少先队员因该为老人让坐。|少先队员应该为老人让座。|
41
-
42
- # Models
43
-
44
- | Name | Base Model | Download |
45
- |-----------------|-------------------|-----------------------------------------------------------------------|
46
- | chinese-text-correction-1.5b | Qwen/Qwen2.5-1.5B-Instruct | [🤗 Hugging Face](https://huggingface.co/shibing624/chinese-text-correction-1.5b) |
47
- | chinese-text-correction-1.5b-lora | Qwen/Qwen2.5-1.5B-Instruct | [🤗 Hugging Face](https://huggingface.co/shibing624/chinese-text-correction-1.5b-lora) |
48
- | chinese-text-correction-7b | Qwen/Qwen2.5-7B-Instruct | [🤗 Hugging Face](https://huggingface.co/shibing624/chinese-text-correction-7b) |
49
- | chinese-text-correction-7b-lora | Qwen/Qwen2.5-7B-Instruct | [🤗 Hugging Face](https://huggingface.co/shibing624/chinese-text-correction-7b-lora) |
50
-
51
-
52
- ### 评估结果
53
- - 评估指标:F1
54
- - CSC(Chinese Spelling Correction): 拼写纠错模型,表示模型可以处理音似、形似、语法等长度对齐的错误纠正
55
- - CTC(CHinese Text Correction): 文本纠错模型,表示模型支持拼写、语法等长度对齐的错误纠正,还可以处理多字、少字等长度不对齐的错误纠正
56
- - GPU:Tesla V100,显存 32 GB
57
-
58
- | Model Name | Model Link | Base Model | Avg | SIGHAN-2015 | EC-LAW | MCSC | GPU/CPU | QPS |
59
- |:-----------------|:------------------------------------------------------------------------------------------------------------------------|:---------------------------|:-----------|:------------|:-------|:-------|:--------|:--------|
60
- | Kenlm-CSC | [shibing624/chinese-kenlm-klm](https://huggingface.co/shibing624/chinese-kenlm-klm) | kenlm | 0.3409 | 0.3147 | 0.3763 | 0.3317 | CPU | 9 |
61
- | Mengzi-T5-CSC | [shibing624/mengzi-t5-base-chinese-correction](https://huggingface.co/shibing624/mengzi-t5-base-chinese-correction) | mengzi-t5-base | 0.3984 | 0.7758 | 0.3156 | 0.1039 | GPU | 214 |
62
- | ERNIE-CSC | [PaddleNLP/ernie-csc](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/legacy/examples/text_correction/ernie-csc) | PaddlePaddle/ernie-1.0-base-zh | 0.4353 | 0.8383 | 0.3357 | 0.1318 | GPU | 114 |
63
- | MacBERT-CSC | [shibing624/macbert4csc-base-chinese](https://huggingface.co/shibing624/macbert4csc-base-chinese) | hfl/chinese-macbert-base | 0.3993 | 0.8314 | 0.1610 | 0.2055 | GPU | **224** |
64
- | ChatGLM3-6B-CSC | [shibing624/chatglm3-6b-csc-chinese-lora](https://huggingface.co/shibing624/chatglm3-6b-csc-chinese-lora) | THUDM/chatglm3-6b | 0.4538 | 0.6572 | 0.4369 | 0.2672 | GPU | 3 |
65
- | Qwen2.5-1.5B-CTC | [shibing624/chinese-text-correction-1.5b](https://huggingface.co/shibing624/chinese-text-correction-1.5b) | Qwen/Qwen2.5-1.5B-Instruct | 0.6802 | 0.3032 | 0.7846 | 0.9529 | GPU | 6 |
66
- | Qwen2.5-7B-CTC | [shibing624/chinese-text-correction-7b](https://huggingface.co/shibing624/chinese-text-correction-7b) | Qwen/Qwen2.5-7B-Instruct | **0.8225** | 0.4917 | 0.9798 | 0.9959 | GPU | 3 |
67
-
68
- ## Usage (pycorrector)
69
-
70
- 本项目开源在`pycorrector`项目:[pycorrector](https://github.com/shibing624/pycorrector),可支持大模型微调后用于文本纠错,通过如下命令调用:
71
-
72
- Install package:
73
- ```shell
74
- pip install -U pycorrector
75
- ```
76
-
77
- ```python
78
- from pycorrector.gpt.gpt_corrector import GptCorrector
79
-
80
- if __name__ == '__main__':
81
- error_sentences = [
82
- '真麻烦你了。希望你们好好的跳无',
83
- '少先队员因该为老人让坐',
84
- '机七学习是人工智能领遇最能体现智能的一个分知',
85
- '一只小鱼船浮在平净的河面上',
86
- '我的家乡是有明的渔米之乡',
87
- ]
88
- m = GptCorrector("shibing624/chinese-text-correction-7b")
89
-
90
- batch_res = m.correct_batch(error_sentences)
91
- for i in batch_res:
92
- print(i)
93
- print()
94
- ```
95
-
96
- ## Usage (HuggingFace Transformers)
97
- Without [pycorrector](https://github.com/shibing624/pycorrector), you can use the model like this:
98
-
99
- First, you pass your input through the transformer model, then you get the generated sentence.
100
-
101
- Install package:
102
- ```
103
- pip install transformers
104
- ```
105
-
106
- ```python
107
- # pip install transformers
108
- from transformers import AutoModelForCausalLM, AutoTokenizer
109
- checkpoint = "shibing624/chinese-text-correction-7b"
110
-
111
- device = "cuda" # for GPU usage or "cpu" for CPU usage
112
- tokenizer = AutoTokenizer.from_pretrained(checkpoint)
113
- model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)
114
-
115
- input_content = "文本纠错:\n少先队员因该为老人让坐。"
116
-
117
- messages = [{"role": "user", "content": input_content}]
118
- input_text=tokenizer.apply_chat_template(messages, tokenize=False)
119
-
120
- print(input_text)
121
-
122
- inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
123
- outputs = model.generate(inputs, max_new_tokens=1024, temperature=0, do_sample=False, repetition_penalty=1.08)
124
-
125
- print(tokenizer.decode(outputs[0]))
126
- ```
127
-
128
- output:
129
- ```shell
130
- 少先队员应该为老人让座。
131
- ```
132
-
133
-
134
- 模型文件组成:
135
- ```
136
- shibing624/chinese-text-correction-7b
137
- |-- added_tokens.json
138
- |-- config.json
139
- |-- generation_config.json
140
- |-- merges.txt
141
- |-- model.safetensors
142
- |-- model.safetensors.index.json
143
- |-- README.md
144
- |-- special_tokens_map.json
145
- |-- tokenizer_config.json
146
- |-- tokenizer.json
147
- `-- vocab.json
148
- ```
149
-
150
- #### 训练参数:
151
-
152
- - num_epochs: 8
153
- - batch_size: 2
154
- - steps: 36000
155
- - eval_loss: 0.12
156
- - base model: Qwen/Qwen2.5-7B-Instruct
157
- - train data: [shibing624/chinese_text_correction](https://huggingface.co/datasets/shibing624/chinese_text_correction)
158
- - train time: 10 days
159
- - eval_loss: ![](https://huggingface.co/shibing624/chinese-text-correction-7b-lora/resolve/main/eval_loss_7b.png)
160
- - train_loss: ![](https://huggingface.co/shibing624/chinese-text-correction-7b-lora/resolve/main/train_loss_7b.png)
161
-
162
- ### 训练数据集
163
- #### 中文纠错数据集
164
-
165
- - 数据:[shibing624/chinese_text_correction](https://huggingface.co/datasets/shibing624/chinese_text_correction)
166
-
167
-
168
- 如果需要训练Qwen的纠错模型,请参考[https://github.com/shibing624/pycorrector](https://github.com/shibing624/pycorrector) 或者 [https://github.com/shibing624/MedicalGPT](https://github.com/shibing624/MedicalGPT)
169
-
170
- ## Citation
171
-
172
- ```latex
173
- @software{pycorrector,
174
- author = {Xu Ming},
175
- title = {pycorrector: Implementation of language model finetune},
176
- year = {2024},
177
- url = {https://github.com/shibing624/pycorrector},
178
- }
179
- ```
180
-
181
-
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ base_model: Qwen/Qwen2.5-7B-Instruct
4
+ license: apache-2.0
5
+ datasets:
6
+ - shibing624/chinese_text_correction
7
+ language:
8
+ - zho
9
+ - eng
10
+ - fra
11
+ - spa
12
+ - por
13
+ - deu
14
+ - ita
15
+ - rus
16
+ - jpn
17
+ - kor
18
+ - vie
19
+ - tha
20
+ - ara
21
+ metrics:
22
+ - f1
23
+ tags:
24
+ - text-generation-inference
25
+ widget:
26
+ - text: '文本纠错:
27
+
28
+ 少先队员因该为老人让坐。'
29
+ ---
30
+
31
+ [![QuantFactory Banner](https://lh7-rt.googleusercontent.com/docsz/AD_4nXeiuCm7c8lEwEJuRey9kiVZsRn2W-b4pWlu3-X534V3YmVuVc2ZL-NXg2RkzSOOS2JXGHutDuyyNAUtdJI65jGTo8jT9Y99tMi4H4MqL44Uc5QKG77B0d6-JfIkZHFaUA71-RtjyYZWVIhqsNZcx8-OMaA?key=xt3VSDoCbmTY7o-cwwOFwQ)](https://hf.co/QuantFactory)
32
+
33
+
34
+ # QuantFactory/chinese-text-correction-7b-GGUF
35
+ This is quantized version of [shibing624/chinese-text-correction-7b](https://huggingface.co/shibing624/chinese-text-correction-7b) created using llama.cpp
36
+
37
+ # Original Model Card
38
+
39
+
40
+
41
+
42
+ # Chinese Text Correction Model
43
+ 中文文本纠错模型chinese-text-correction-7b:用于拼写纠错、语法纠错
44
+
45
+ `shibing624/chinese-text-correction-7b` evaluate test data:
46
+
47
+ The overall performance of CSC **test**:
48
+
49
+ |input_text|predict_text|
50
+ |:--- |:--- |
51
+ |文本纠错:\n少先队员因该为老人让坐。|少先队员应该为老人让座。|
52
+
53
+ # Models
54
+
55
+ | Name | Base Model | Download |
56
+ |-----------------|-------------------|-----------------------------------------------------------------------|
57
+ | chinese-text-correction-1.5b | Qwen/Qwen2.5-1.5B-Instruct | [🤗 Hugging Face](https://huggingface.co/shibing624/chinese-text-correction-1.5b) |
58
+ | chinese-text-correction-1.5b-lora | Qwen/Qwen2.5-1.5B-Instruct | [🤗 Hugging Face](https://huggingface.co/shibing624/chinese-text-correction-1.5b-lora) |
59
+ | chinese-text-correction-7b | Qwen/Qwen2.5-7B-Instruct | [🤗 Hugging Face](https://huggingface.co/shibing624/chinese-text-correction-7b) |
60
+ | chinese-text-correction-7b-lora | Qwen/Qwen2.5-7B-Instruct | [🤗 Hugging Face](https://huggingface.co/shibing624/chinese-text-correction-7b-lora) |
61
+
62
+
63
+ ### 评估结果
64
+ - 评估指标:F1
65
+ - CSC(Chinese Spelling Correction): 拼写纠错模型,表示模型可以处理音似、形似、语法等长度对齐的错误纠正
66
+ - CTC(CHinese Text Correction): 文本纠错模型,表示模型支持拼写、语法等长度对齐的错误纠正,还可以处理多字、少字等长度不对齐的错误纠正
67
+ - GPU:Tesla V100,显存 32 GB
68
+
69
+ | Model Name | Model Link | Base Model | Avg | SIGHAN-2015 | EC-LAW | MCSC | GPU/CPU | QPS |
70
+ |:-----------------|:------------------------------------------------------------------------------------------------------------------------|:---------------------------|:-----------|:------------|:-------|:-------|:--------|:--------|
71
+ | Kenlm-CSC | [shibing624/chinese-kenlm-klm](https://huggingface.co/shibing624/chinese-kenlm-klm) | kenlm | 0.3409 | 0.3147 | 0.3763 | 0.3317 | CPU | 9 |
72
+ | Mengzi-T5-CSC | [shibing624/mengzi-t5-base-chinese-correction](https://huggingface.co/shibing624/mengzi-t5-base-chinese-correction) | mengzi-t5-base | 0.3984 | 0.7758 | 0.3156 | 0.1039 | GPU | 214 |
73
+ | ERNIE-CSC | [PaddleNLP/ernie-csc](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/legacy/examples/text_correction/ernie-csc) | PaddlePaddle/ernie-1.0-base-zh | 0.4353 | 0.8383 | 0.3357 | 0.1318 | GPU | 114 |
74
+ | MacBERT-CSC | [shibing624/macbert4csc-base-chinese](https://huggingface.co/shibing624/macbert4csc-base-chinese) | hfl/chinese-macbert-base | 0.3993 | 0.8314 | 0.1610 | 0.2055 | GPU | **224** |
75
+ | ChatGLM3-6B-CSC | [shibing624/chatglm3-6b-csc-chinese-lora](https://huggingface.co/shibing624/chatglm3-6b-csc-chinese-lora) | THUDM/chatglm3-6b | 0.4538 | 0.6572 | 0.4369 | 0.2672 | GPU | 3 |
76
+ | Qwen2.5-1.5B-CTC | [shibing624/chinese-text-correction-1.5b](https://huggingface.co/shibing624/chinese-text-correction-1.5b) | Qwen/Qwen2.5-1.5B-Instruct | 0.6802 | 0.3032 | 0.7846 | 0.9529 | GPU | 6 |
77
+ | Qwen2.5-7B-CTC | [shibing624/chinese-text-correction-7b](https://huggingface.co/shibing624/chinese-text-correction-7b) | Qwen/Qwen2.5-7B-Instruct | **0.8225** | 0.4917 | 0.9798 | 0.9959 | GPU | 3 |
78
+
79
+ ## Usage (pycorrector)
80
+
81
+ 本项目开源在`pycorrector`项目:[pycorrector](https://github.com/shibing624/pycorrector),可支持大模型微调后用于文本纠错,通过如下命令调用:
82
+
83
+ Install package:
84
+ ```shell
85
+ pip install -U pycorrector
86
+ ```
87
+
88
+ ```python
89
+ from pycorrector.gpt.gpt_corrector import GptCorrector
90
+
91
+ if __name__ == '__main__':
92
+ error_sentences = [
93
+ '真麻烦你了。希望你们好好的跳无',
94
+ '少先队员因该为老人让坐',
95
+ '机七学习是人工智能领遇最能体现智能的一个分知',
96
+ '一只小鱼船浮在平净的河面上',
97
+ '我的家乡是有明的渔米之乡',
98
+ ]
99
+ m = GptCorrector("shibing624/chinese-text-correction-7b")
100
+
101
+ batch_res = m.correct_batch(error_sentences)
102
+ for i in batch_res:
103
+ print(i)
104
+ print()
105
+ ```
106
+
107
+ ## Usage (HuggingFace Transformers)
108
+ Without [pycorrector](https://github.com/shibing624/pycorrector), you can use the model like this:
109
+
110
+ First, you pass your input through the transformer model, then you get the generated sentence.
111
+
112
+ Install package:
113
+ ```
114
+ pip install transformers
115
+ ```
116
+
117
+ ```python
118
+ # pip install transformers
119
+ from transformers import AutoModelForCausalLM, AutoTokenizer
120
+ checkpoint = "shibing624/chinese-text-correction-7b"
121
+
122
+ device = "cuda" # for GPU usage or "cpu" for CPU usage
123
+ tokenizer = AutoTokenizer.from_pretrained(checkpoint)
124
+ model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)
125
+
126
+ input_content = "文本纠错:\n少先队员因该为老人让坐。"
127
+
128
+ messages = [{"role": "user", "content": input_content}]
129
+ input_text=tokenizer.apply_chat_template(messages, tokenize=False)
130
+
131
+ print(input_text)
132
+
133
+ inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
134
+ outputs = model.generate(inputs, max_new_tokens=1024, temperature=0, do_sample=False, repetition_penalty=1.08)
135
+
136
+ print(tokenizer.decode(outputs[0]))
137
+ ```
138
+
139
+ output:
140
+ ```shell
141
+ 少先队员应该为老人让座。
142
+ ```
143
+
144
+
145
+ 模型文件组成:
146
+ ```
147
+ shibing624/chinese-text-correction-7b
148
+ |-- added_tokens.json
149
+ |-- config.json
150
+ |-- generation_config.json
151
+ |-- merges.txt
152
+ |-- model.safetensors
153
+ |-- model.safetensors.index.json
154
+ |-- README.md
155
+ |-- special_tokens_map.json
156
+ |-- tokenizer_config.json
157
+ |-- tokenizer.json
158
+ `-- vocab.json
159
+ ```
160
+
161
+ #### 训练参数:
162
+
163
+ - num_epochs: 8
164
+ - batch_size: 2
165
+ - steps: 36000
166
+ - eval_loss: 0.12
167
+ - base model: Qwen/Qwen2.5-7B-Instruct
168
+ - train data: [shibing624/chinese_text_correction](https://huggingface.co/datasets/shibing624/chinese_text_correction)
169
+ - train time: 10 days
170
+ - eval_loss: ![](https://huggingface.co/shibing624/chinese-text-correction-7b-lora/resolve/main/eval_loss_7b.png)
171
+ - train_loss: ![](https://huggingface.co/shibing624/chinese-text-correction-7b-lora/resolve/main/train_loss_7b.png)
172
+
173
+ ### 训练数据集
174
+ #### 中文纠错数据集
175
+
176
+ - 数据:[shibing624/chinese_text_correction](https://huggingface.co/datasets/shibing624/chinese_text_correction)
177
+
178
+
179
+ 如果需要训练Qwen的纠错模型,请参考[https://github.com/shibing624/pycorrector](https://github.com/shibing624/pycorrector) 或者 [https://github.com/shibing624/MedicalGPT](https://github.com/shibing624/MedicalGPT)
180
+
181
+ ## Citation
182
+
183
+ ```latex
184
+ @software{pycorrector,
185
+ author = {Xu Ming},
186
+ title = {pycorrector: Implementation of language model finetune},
187
+ year = {2024},
188
+ url = {https://github.com/shibing624/pycorrector},
189
+ }
190
+ ```
191
+
192
+