Model Card

A ModernBERT model pretrained from scratch in Chinese novels corpus with MLM task.

Model Description

Architecture: ModernBERT‑base
Pretraining objective: Masked Language Modeling
Language: Chinese.

Data

The dataset was built from Chivi's novel corpus, containing approximately 325M sentences.
Preprocessing: normalization → tokenization with custom BPE tokenizer → randomly mask 15% tokens.

this BPE tokenizer with a 25k tokens vocab was trained from scratch. Ensures that the tokenizer is well-suited to novel-style content, including names, informal phrases, and rare words.

Training Config

Epochs: 3
Optimizer: AdamW
Learning rate: 1e‑4 with warm-up 20k steps
Batch size: 128
Max sequence length: 1024
Total training steps: ~7M steps in ~800 hours

How to use

Use the model with a pipeline for masked language modeling task:

from transformers import AutoTokenizer, AutoModelForMaskedLM, pipeline

tokenizer = AutoTokenizer.from_pretrained("chi-vi/chivi-modern-bert")
model = AutoModelForMaskedLM.from_pretrained("chi-vi/chivi-modern-bert")

pipe = pipeline("fill-mask", model=model, tokenizer=tokenizer)
print(pipe("9月14号周日晚间，美林公司同意以440亿美元出售[UNK]米国银行。"))

Downloads last month: 4

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for chi-vi/chivi-modern-bert

Base model

answerdotai/ModernBERT-base

Finetuned

(1350)

this model