mmBERT-base / README.md
datalama's picture
Update README: add migration details, verification results, and v5 usage examples
bac9c5c verified
metadata
library_name: transformers
tags:
  - modernbert
  - fill-mask
  - multilingual
license: apache-2.0
base_model: jhu-clsp/mmBERT-base

mmBERT-base

Transformers v5 compatible checkpoint of jhu-clsp/mmBERT-base.

Parameters 307M
Hidden size 768
Layers 22
Attention heads 12
Max seq length 8,192
RoPE theta 160,000 (both global & local)

Usage (transformers v5)

from transformers import ModernBertModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("datalama/mmBERT-base")
model = ModernBertModel.from_pretrained("datalama/mmBERT-base")

inputs = tokenizer("인공지능 기술은 빠르게 발전하고 있습니다.", return_tensors="pt")
outputs = model(**inputs)

# [CLS] embedding (768-dim)
cls_embedding = outputs.last_hidden_state[:, 0, :]

For masked language modeling:

from transformers import ModernBertForMaskedLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("datalama/mmBERT-base")
model = ModernBertForMaskedLM.from_pretrained("datalama/mmBERT-base")

inputs = tokenizer("The capital of France is [MASK].", return_tensors="pt")
outputs = model(**inputs)

Migration Details

This checkpoint was migrated from jhu-clsp/mmBERT-base with the following changes:

1. Weight format: pytorch_model.binmodel.safetensors

  • Tied weights (model.embeddings.tok_embeddings.weightdecoder.weight) were cloned to separate tensors before saving
  • All 138 tensors verified bitwise equal after conversion

2. Config: Added explicit rope_parameters for transformers v5

{
  "global_rope_theta": 160000,
  "local_rope_theta": 160000,
  "rope_parameters": {
    "full_attention": {"rope_type": "default", "rope_theta": 160000.0},
    "sliding_attention": {"rope_type": "default", "rope_theta": 160000.0}
  }
}

The original flat fields (global_rope_theta, local_rope_theta) are preserved for backward compatibility. In transformers v5, ModernBertConfig defaults sliding_attention.rope_theta to 10,000 — but mmBERT uses 160,000 for both, so explicit rope_parameters are required.

Verification

Cross-environment verification was performed between transformers v4 (original) and v5 (this checkpoint):

Check Result
RoPE config rope_parameters present, theta=160,000 for both attention types
Weight integrity 138 tensors bitwise equal (jhu-clsp .bin vs datalama .safetensors)
Inference output v4 vs v5 max diff across 4 multilingual sentences: 7.63e-06
Fine-tuning readiness Tokenizer roundtrip, forward+backward pass, gradient propagation — all OK

Credit

Original model by JHU CLSP. See the original model card for training details and benchmarks.