ChineseBERT-base / README.md
junnyu's picture
add model weights
a25dde7
---
language: zh
tags:
- glycebert
inference: False
---
# https://github.com/JunnYu/ChineseBert_pytorch
# ChineseBert_pytorch
本项目主要自定义了tokenization_chinesebert_fast.py文件中的ChineseBertTokenizerFast代码。从而可以从huggingface.co调用。
```python
pretrained_tokenizer_name = "junnyu/ChineseBERT-base"
tokenizer = ChineseBertTokenizerFast.from_pretrained(pretrained_tokenizer_name)
```
# Paper
**[ChineseBERT: Chinese Pretraining Enhanced by Glyph and Pinyin Information](https://arxiv.org/pdf/2106.16038.pdf)**
*Zijun Sun, Xiaoya Li, Xiaofei Sun, Yuxian Meng, Xiang Ao, Qing He, Fei Wu and Jiwei Li*
# Install
```bash
pip install chinesebert
or
pip install git+https://github.com/JunnYu/ChineseBert_pytorch.git
```
# Usage
```python
import torch
from chinesebert import ChineseBertForMaskedLM, ChineseBertTokenizerFast, ChineseBertConfig
pretrained_model_name = "junnyu/ChineseBERT-base"
tokenizer = ChineseBertTokenizerFast.from_pretrained(pretrained_model_name)
chinese_bert = ChineseBertForMaskedLM.from_pretrained(pretrained_model_name)
text = "北京是[MASK]国的首都。"
inputs = tokenizer(text, return_tensors="pt")
print(inputs)
maskpos = 4
with torch.no_grad():
o = chinese_bert(**inputs)
value, index = o.logits.softmax(-1)[0, maskpos].topk(10)
pred_tokens = tokenizer.convert_ids_to_tokens(index.tolist())
pred_values = value.tolist()
outputs = []
for t, p in zip(pred_tokens, pred_values):
outputs.append(f"{t}|{round(p,4)}")
print(outputs)
# base ['中|0.711', '我|0.2488', '祖|0.016', '法|0.0057', '美|0.0048', '全|0.0042', '韩|0.0015', '英|0.0011', '两|0.0008', '王|0.0006']
# large ['中|0.8341', '我|0.1479', '祖|0.0157', '全|0.0007', '国|0.0005', '帝|0.0001', '该|0.0001', '法|0.0001', '一|0.0001', '咱|0.0001']
```
# Reference
https://github.com/ShannonAI/ChineseBert