ShannonAI
/

ChineseBERT-base

Model card Files Files and versions

ChineseBERT-base / README.md

zijun's picture

Create README.md

bb22b88 over 4 years ago

|

history blame contribute delete

1.57 kB

	# ChineseBERT-base

	This repository contains code, model, dataset for ChineseBERT at ACL2021.

	paper:
	[ChineseBERT: Chinese Pretraining Enhanced by Glyph and Pinyin Information](https://arxiv.org/abs/2106.16038)
	Zijun Sun, Xiaoya Li, Xiaofei Sun, Yuxian Meng, Xiang Ao, Qing He, Fei Wu and Jiwei Li

	code:
	[ChineseBERT github link](https://github.com/ShannonAI/ChineseBert)

	## Model description
	We propose ChineseBERT, which incorporates both the glyph and pinyin information of Chinese
	characters into language model pretraining.

	First, for each Chinese character, we get three kind of embedding.
	- Char Embedding: the same as origin BERT token embedding.
	- Glyph Embedding: capture visual features based on different fonts of a Chinese character.
	- Pinyin Embedding: capture phonetic feature from the pinyin sequence ot a Chinese Character.

	Then, char embedding, glyph embedding and pinyin embedding
	are first concatenated, and mapped to a D-dimensional embedding through a fully
	connected layer to form the fusion embedding.
	Finally, the fusion embedding is added with the position embedding, which is fed as input to the BERT model.
	The following image shows an overview architecture of ChineseBERT model.

	![MODEL](https://raw.githubusercontent.com/ShannonAI/ChineseBert/main/images/ChineseBERT.png)

	ChineseBERT leverages the glyph and pinyin information of Chinese
	characters to enhance the model's ability of capturing
	context semantics from surface character forms and
	disambiguating polyphonic characters in Chinese.