Midsummra
/

CNMBert

@@ -14,7 +14,6 @@ CNMBert
 [Github](https://github.com/IgarashiAkatuki/zh-CN-Multi-Mask-Bert)
 # zh-CN-Multi-Mask-Bert (CNMBert)
 ![image](https://github.com/user-attachments/assets/a888fde7-6766-43f1-a753-810399418bda)
 ---
@@ -41,12 +40,12 @@ CNMBert
 ### CNMBert
-| Model           | 模型权重                                                    | Memory Usage (FP16) | QPS   | MRR   | Acc   |
-| --------------- | ----------------------------------------------------------- | ------------------- | ----- | ----- | ----- |
-| CNMBert-Default | [Huggingface](https://huggingface.co/Midsummra/CNMBert)     | 0.4GB               | 12.56 | 58.88 | 49.13 |
-| CNMBert-MoE     | [Huggingface](https://huggingface.co/Midsummra/CNMBert-MoE) | 0.8GB               | 3.20  | 60.56 | 51.09 |
-* 所有模型均在相同的150万条wiki以及知乎语料下训练
 * QPS 为 queries per second (由于没有使用c重写predict所以现在性能很糟...)
 * MRR 为平均倒数排名(mean reciprocal rank)
 * Acc 为准确率(accuracy)
@@ -56,7 +55,7 @@ CNMBert
 ```python
 from transformers import AutoTokenizer, BertConfig
-from CustomBertModel import fixed_predict
 from MoELayer import BertWwmMoE
 ```
@@ -75,34 +74,54 @@ model = BertWwmMoE.from_pretrained('Midsummra/CNMBert-MoE', config=config).to('c
 预测词语
 ```python
-print(fixed_predict("我有两千kq", "kq", model, tokenizer)[:5])
-print(fixed_predict("快去给魔理沙看b吧", "b", model, tokenizer[:5]))
 ```
 > ['块钱', 1.2056937473156175], ['块前', 0.05837443749364857], ['开千', 0.0483869208528063], ['可千', 0.03996622172280445], ['口气', 0.037183335575008414]
 > ['病', 1.6893256306648254], ['吧', 0.1642467901110649], ['呗', 0.026976384222507477], ['包', 0.021441461518406868], ['报', 0.01396679226309061]
-### 如何微调模型
-请参考TrainExample.ipynb,在数据集的格式上，只要保证csv的第一列为要训练的语料即可。
-### Q&A
-Q: 这玩意的速度太慢啦！！！
-A: 已经有计划拿C重写predict了，，，
-Q: 这玩意的准确度好差啊
-A: 因为是在很小的数据集(200w)上进行的预训练，所以泛化能力很差很正常，，，可以在更大数据集或者更加细分的领域进行微调，具体微调方式和[Chinese-BERT-wwm](https://github.com/ymcui/Chinese-BERT-wwm)差别不大，只需要将`DataCollactor`替换为`CustomBertModel.py`中的`DataCollatorForMultiMask`。
 ### 引用
 如果您对CNMBert的具体实现感兴趣的话，可以参考
 ```
 @misc{feng2024cnmbertmodelhanyupinyin,
       title={CNMBert: A Model For Hanyu Pinyin Abbreviation to Character Conversion Task},
@@ -113,4 +132,4 @@ A: 因为是在很小的数据集(200w)上进行的预训练，所以泛化能
       primaryClass={cs.CL},
       url={https://arxiv.org/abs/2411.11770},
 }
-```

 [Github](https://github.com/IgarashiAkatuki/zh-CN-Multi-Mask-Bert)
 # zh-CN-Multi-Mask-Bert (CNMBert)
 ![image](https://github.com/user-attachments/assets/a888fde7-6766-43f1-a753-810399418bda)
 ---
 ### CNMBert
+| Model           | 模型权重                                                    | Memory Usage (FP16) | Model Size | QPS   | MRR   | Acc   |
+| --------------- | ----------------------------------------------------------- | ------------------- | ---------- | ----- | ----- | ----- |
+| CNMBert-Default | [Huggingface](https://huggingface.co/Midsummra/CNMBert)     | 0.4GB               | 131M       | 12.56 | 59.70 | 49.74 |
+| CNMBert-MoE     | [Huggingface](https://huggingface.co/Midsummra/CNMBert-MoE) | 0.8GB               | 329M       | 3.20  | 61.53 | 51.86 |
+* 所有模型均在相同的200万条wiki以及知乎语料下训练
 * QPS 为 queries per second (由于没有使用c重写predict所以现在性能很糟...)
 * MRR 为平均倒数排名(mean reciprocal rank)
 * Acc 为准确率(accuracy)
 ```python
 from transformers import AutoTokenizer, BertConfig
+from CustomBertModel import predict
 from MoELayer import BertWwmMoE
 ```
 预测词语
 ```python
+print(predict("我有两千kq", "kq", model, tokenizer)[:5])
+print(predict("快去给魔理沙看b吧", "b", model, tokenizer[:5]))
 ```
 > ['块钱', 1.2056937473156175], ['块前', 0.05837443749364857], ['开千', 0.0483869208528063], ['可千', 0.03996622172280445], ['口气', 0.037183335575008414]
 > ['病', 1.6893256306648254], ['吧', 0.1642467901110649], ['呗', 0.026976384222507477], ['包', 0.021441461518406868], ['报', 0.01396679226309061]
+---
+```python
+# 默认的predict函数使用束搜索
+def predict(sentence: str,
+            predict_word: str,
+            model,
+            tokenizer,
+            top_k=8,
+            beam_size=16, # 束宽
+            threshold=0.005, # 阈值
+            fast_mode=True, # 是否使用快速模式
+            strict_mode=True): # 是否对输出结果进行检查
+# 使用回溯的无剪枝暴力搜索
+def backtrack_predict(sentence: str,
+            predict_word: str,
+            model,
+            tokenizer,
+            top_k=10,
+            fast_mode=True,
+            strict_mode=True):
+```
+> 由于BERT的自编码特性，导致其在预测MASK时，顺序不同会导致预测结果不同，如果启用`fast_mode`，则会正向和反向分别对输入进行预测，可以提升一点准确率(2%左右)，但是会带来更大的性能开销。
+> `strict_mode`会对输入进行检查，以判断其是否为一个真实存在的汉语词汇。
+### 如何微调模型
+请参考[TrainExample.ipynb](https://github.com/IgarashiAkatuki/CNMBert/blob/main/TrainExample.ipynb),在数据集的格式上，只要保证csv的第一列为要训练的语料即可。
+### Q&A
+Q: 感觉这个东西准确度有点低啊
+A: 可以尝试设置`fast_mode`和`strict_mode`为`False`。 模型是在很小的数据集(200w)上进行的预训练，所以泛化能力不足很正常，，，可以在更大数据集或者更加细分的领域进行微调，具体微调方式和[Chinese-BERT-wwm](https://github.com/ymcui/Chinese-BERT-wwm)差别不大，只需要将`DataCollactor`替换为`CustomBertModel.py`中的`DataCollatorForMultiMask`。
 ### 引用
 如果您对CNMBert的具体实现感兴趣的话，可以参考
 ```
 @misc{feng2024cnmbertmodelhanyupinyin,
       title={CNMBert: A Model For Hanyu Pinyin Abbreviation to Character Conversion Task},
       primaryClass={cs.CL},
       url={https://arxiv.org/abs/2411.11770},
 }
+```