mrbingzhao commited on
Commit
7909ae7
·
1 Parent(s): 3dd9513

Upload 7 files

Browse files
Files changed (7) hide show
  1. README.md +172 -0
  2. added_tokens.json +1 -0
  3. arch1.png +0 -0
  4. config.json +26 -0
  5. special_tokens_map.json +1 -0
  6. tokenizer_config.json +1 -0
  7. vocab.txt +0 -0
README.md ADDED
@@ -0,0 +1,172 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - zh
4
+ tags:
5
+ - bert
6
+ - pytorch
7
+ - zh
8
+ license: "apache-2.0"
9
+ ---
10
+
11
+ # MacBERT for Chinese Spelling Correction(macbert4csc) Model
12
+ 中文拼写纠错模型
13
+
14
+ `macbert4csc-base-chinese` evaluate SIGHAN2015 test data:
15
+
16
+ - Char Level: precision:0.9372, recall:0.8640, f1:0.8991
17
+ - Sentence Level: precision:0.8264, recall:0.7366, f1:0.7789
18
+
19
+ 由于训练使用的数据使用了SIGHAN2015的训练集(复现paper),在SIGHAN2015的测试集上达到SOTA水平。
20
+
21
+ 模型结构,魔改于softmaskedbert:
22
+
23
+ ![arch](arch1.png)
24
+
25
+ ## Usage
26
+
27
+ 本项目开源在中文文本纠错项目:[pycorrector](https://github.com/shibing624/pycorrector),可支持macbert4csc模型,通过如下命令调用:
28
+
29
+ ```python
30
+ from pycorrector.macbert.macbert_corrector import MacBertCorrector
31
+
32
+ nlp = MacBertCorrector("shibing624/macbert4csc-base-chinese").macbert_correct
33
+
34
+ i = nlp('今天新情很好')
35
+ print(i)
36
+ ```
37
+
38
+ 当然,你也可使用官方的huggingface/transformers调用:
39
+
40
+ *Please use 'Bert' related functions to load this model!*
41
+
42
+ ```python
43
+ import operator
44
+ import torch
45
+ from transformers import BertTokenizer, BertForMaskedLM
46
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
47
+
48
+ tokenizer = BertTokenizer.from_pretrained("shibing624/macbert4csc-base-chinese")
49
+ model = BertForMaskedLM.from_pretrained("shibing624/macbert4csc-base-chinese")
50
+ model.to(device)
51
+
52
+ texts = ["今天新情很好", "你找到你最喜欢的工作,我也很高心。"]
53
+ with torch.no_grad():
54
+ outputs = model(**tokenizer(texts, padding=True, return_tensors='pt').to(device))
55
+
56
+ def get_errors(corrected_text, origin_text):
57
+ sub_details = []
58
+ for i, ori_char in enumerate(origin_text):
59
+ if ori_char in [' ', '“', '”', '‘', '’', '琊', '\n', '…', '—', '擤']:
60
+ # add unk word
61
+ corrected_text = corrected_text[:i] + ori_char + corrected_text[i:]
62
+ continue
63
+ if i >= len(corrected_text):
64
+ continue
65
+ if ori_char != corrected_text[i]:
66
+ if ori_char.lower() == corrected_text[i]:
67
+ # pass english upper char
68
+ corrected_text = corrected_text[:i] + ori_char + corrected_text[i + 1:]
69
+ continue
70
+ sub_details.append((ori_char, corrected_text[i], i, i + 1))
71
+ sub_details = sorted(sub_details, key=operator.itemgetter(2))
72
+ return corrected_text, sub_details
73
+
74
+ result = []
75
+ for ids, text in zip(outputs.logits, texts):
76
+ _text = tokenizer.decode(torch.argmax(ids, dim=-1), skip_special_tokens=True).replace(' ', '')
77
+ corrected_text = _text[:len(text)]
78
+ corrected_text, details = get_errors(corrected_text, text)
79
+ print(text, ' => ', corrected_text, details)
80
+ result.append((corrected_text, details))
81
+ print(result)
82
+ ```
83
+
84
+ output:
85
+ ```shell
86
+ 今天新情很好 => 今天心情很好 [('新', '心', 2, 3)]
87
+ 你找到你最喜欢的工作,我也很高心。 => 你找到你最喜欢的工作,我也很高兴。 [('心', '兴', 15, 16)]
88
+ ```
89
+
90
+ 模型文件组成:
91
+ ```
92
+ macbert4csc-base-chinese
93
+ ├── config.json
94
+ ├── added_tokens.json
95
+ ├── pytorch_model.bin
96
+ ├── special_tokens_map.json
97
+ ├── tokenizer_config.json
98
+ └── vocab.txt
99
+ ```
100
+
101
+ ### 训练数据集
102
+ #### SIGHAN+Wang271K中文纠错数据集
103
+
104
+
105
+ | 数据集 | 语料 | 下载链接 | 压缩包大小 |
106
+ | :------- | :--------- | :---------: | :---------: |
107
+ | **`SIGHAN+Wang271K中文纠错数据集`** | SIGHAN+Wang271K(27万条) | [百度网盘(密码01b9)](https://pan.baidu.com/s/1BV5tr9eONZCI0wERFvr0gQ)| 106M |
108
+ | **`原始SIGHAN数据集`** | SIGHAN13 14 15 | [官方csc.html](http://nlp.ee.ncu.edu.tw/resource/csc.html)| 339K |
109
+ | **`原始Wang271K数据集`** | Wang271K | [Automatic-Corpus-Generation dimmywang提供](https://github.com/wdimmy/Automatic-Corpus-Generation/blob/master/corpus/train.sgml)| 93M |
110
+
111
+
112
+ SIGHAN+Wang271K中文纠错数据集,数据格式:
113
+ ```json
114
+ [
115
+ {
116
+ "id": "B2-4029-3",
117
+ "original_text": "晚间会听到嗓音,白天的时候大家都不会太在意,但是在睡觉的时候这嗓音成为大家的恶梦。",
118
+ "wrong_ids": [
119
+ 5,
120
+ 31
121
+ ],
122
+ "correct_text": "晚间会听到噪音,白天的时候大家都不会太在意,但是在睡觉的时候这噪音成为大家的恶梦。"
123
+ },
124
+ ]
125
+ ```
126
+
127
+ ```shell
128
+ macbert4csc
129
+ ├── config.json
130
+ ├── pytorch_model.bin
131
+ ├── special_tokens_map.json
132
+ ├── tokenizer_config.json
133
+ └── vocab.txt
134
+ ```
135
+
136
+ 如果需要训练macbert4csc,请参考[https://github.com/shibing624/pycorrector/tree/master/pycorrector/macbert](https://github.com/shibing624/pycorrector/tree/master/pycorrector/macbert)
137
+
138
+
139
+ ### About MacBERT
140
+ **MacBERT** is an improved BERT with novel **M**LM **a**s **c**orrection pre-training task, which mitigates the discrepancy of pre-training and fine-tuning.
141
+
142
+ Here is an example of our pre-training task.
143
+
144
+ | task | Example |
145
+ | -------------- | ----------------- |
146
+ | **Original Sentence** | we use a language model to predict the probability of the next word. |
147
+ | **MLM** | we use a language [M] to [M] ##di ##ct the pro [M] ##bility of the next word . |
148
+ | **Whole word masking** | we use a language [M] to [M] [M] [M] the [M] [M] [M] of the next word . |
149
+ | **N-gram masking** | we use a [M] [M] to [M] [M] [M] the [M] [M] [M] [M] [M] next word . |
150
+ | **MLM as correction** | we use a text system to ca ##lc ##ulate the po ##si ##bility of the next word . |
151
+
152
+ Except for the new pre-training task, we also incorporate the following techniques.
153
+
154
+ - Whole Word Masking (WWM)
155
+ - N-gram masking
156
+ - Sentence-Order Prediction (SOP)
157
+
158
+ **Note that our MacBERT can be directly replaced with the original BERT as there is no differences in the main neural architecture.**
159
+
160
+ For more technical details, please check our paper: [Revisiting Pre-trained Models for Chinese Natural Language Processing](https://arxiv.org/abs/2004.13922)
161
+
162
+
163
+ ## Citation
164
+
165
+ ```latex
166
+ @software{pycorrector,
167
+ author = {Xu Ming},
168
+ title = {pycorrector: Text Error Correction Tool},
169
+ year = {2021},
170
+ url = {https://github.com/shibing624/pycorrector},
171
+ }
172
+ ```
added_tokens.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {}
arch1.png ADDED
config.json ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "BertForMaskedLM"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.1,
6
+ "directionality": "bidi",
7
+ "gradient_checkpointing": false,
8
+ "hidden_act": "gelu",
9
+ "hidden_dropout_prob": 0.1,
10
+ "hidden_size": 768,
11
+ "initializer_range": 0.02,
12
+ "intermediate_size": 3072,
13
+ "layer_norm_eps": 1e-12,
14
+ "max_position_embeddings": 512,
15
+ "model_type": "bert",
16
+ "num_attention_heads": 12,
17
+ "num_hidden_layers": 12,
18
+ "pad_token_id": 0,
19
+ "pooler_fc_size": 768,
20
+ "pooler_num_attention_heads": 12,
21
+ "pooler_num_fc_layers": 3,
22
+ "pooler_size_per_head": 128,
23
+ "pooler_type": "first_token_transform",
24
+ "type_vocab_size": 2,
25
+ "vocab_size": 21128
26
+ }
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"do_lower_case": true, "do_basic_tokenize": true, "never_split": null, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "name_or_path": "shibing624/macbert4csc-base-chinese", "tokenizer_class": "BertTokenizer"}
vocab.txt ADDED
The diff for this file is too large to render. See raw diff