--- license: apache-2.0 language: - zh --- # modernbert-base-chinese A modernbert model pretrained on a corpus of Simplified Chinese, Traditional Chinese, and Cantonese. **Note:** This model is undertrained due to budget constraints. The tokenizer is a character-based `BertTokenizer`, where each Chinese character is a separate token. This was a design decision to facilitate sequence tagging tasks. It also supports mixed Chinese and English text. ## How to use You can use this model directly with a pipeline for masked language modeling. Since the tokenizer is character-based, you should only mask single characters. ```python from transformers import pipeline fill_mask = pipeline( "fill-mask", model="ming030890/modernbert-base-chinese", tokenizer="ming030890/modernbert-base-chinese" ) # Mainland Mandarin (Simplified Chinese) result = fill_mask("今天天[MASK]真好。") print(result) # Traditional Chinese Example result = fill_mask("這碗牛[MASK]麵好吃。") print(result) # Cantonese Example result = fill_mask("你咁樣做真係[MASK]衰。") print(result) # Mixed Chinese and English Example — code switching result = fill_mask("我啱啱買咗[MASK]新laptop。") print(result) ``` You can also load the model and tokenizer directly: ```python from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("ming030890/modernbert-base-chinese") model = AutoModelForMaskedLM.from_pretrained("ming030890/modernbert-base-chinese") ```