---
license: apache-2.0
language:
- zh
---

# modernbert-base-chinese

A modernbert model pretrained on a corpus of Simplified Chinese, Traditional Chinese, and Cantonese.

**Note:** This model is undertrained due to budget constraints.

The tokenizer is a character-based `BertTokenizer`, where each Chinese character is a separate token. This was a design decision to facilitate sequence tagging tasks. It also supports mixed Chinese and English text.

## How to use

You can use this model directly with a pipeline for masked language modeling. Since the tokenizer is character-based, you should only mask single characters.

```python
from transformers import pipeline
fill_mask = pipeline(
    "fill-mask",
    model="ming030890/modernbert-base-chinese",
    tokenizer="ming030890/modernbert-base-chinese"
)

# Mainland Mandarin (Simplified Chinese)
result = fill_mask("今天天[MASK]真好。")
print(result)

# Traditional Chinese Example
result = fill_mask("這碗牛[MASK]麵好吃。")
print(result)

# Cantonese Example
result = fill_mask("你咁樣做真係[MASK]衰。")
print(result)

# Mixed Chinese and English Example — code switching
result = fill_mask("我啱啱買咗[MASK]新laptop。")
print(result)
```

You can also load the model and tokenizer directly:

```python
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("ming030890/modernbert-base-chinese")
model = AutoModelForMaskedLM.from_pretrained("ming030890/modernbert-base-chinese")
```