How to use from the
Use from the
Transformers library
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("fill-mask", model="datalama/mmBERT-small")
# Load model directly
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("datalama/mmBERT-small")
model = AutoModelForMaskedLM.from_pretrained("datalama/mmBERT-small")
Quick Links

mmBERT-small

Transformers v5 compatible checkpoint of jhu-clsp/mmBERT-small.

Parameters 140M
Hidden size 384
Layers 22
Attention heads 6
Max seq length 8,192
RoPE theta 160,000 (both global & local)

Usage (transformers v5)

from transformers import ModernBertModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("datalama/mmBERT-small")
model = ModernBertModel.from_pretrained("datalama/mmBERT-small")

inputs = tokenizer("์ธ๊ณต์ง€๋Šฅ ๊ธฐ์ˆ ์€ ๋น ๋ฅด๊ฒŒ ๋ฐœ์ „ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.", return_tensors="pt")
outputs = model(**inputs)

# [CLS] embedding (384-dim)
cls_embedding = outputs.last_hidden_state[:, 0, :]

For masked language modeling:

from transformers import ModernBertForMaskedLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("datalama/mmBERT-small")
model = ModernBertForMaskedLM.from_pretrained("datalama/mmBERT-small")

inputs = tokenizer("The capital of France is [MASK].", return_tensors="pt")
outputs = model(**inputs)

Migration Details

This checkpoint was migrated from jhu-clsp/mmBERT-small with the following changes:

1. Weight format: pytorch_model.bin โ†’ model.safetensors

  • Tied weights (model.embeddings.tok_embeddings.weight โ†” decoder.weight) were cloned to separate tensors before saving
  • All 138 tensors verified bitwise equal after conversion

2. Config: Added explicit rope_parameters for transformers v5

{
  "global_rope_theta": 160000,
  "local_rope_theta": 160000,
  "rope_parameters": {
    "full_attention": {"rope_type": "default", "rope_theta": 160000.0},
    "sliding_attention": {"rope_type": "default", "rope_theta": 160000.0}
  }
}

The original flat fields (global_rope_theta, local_rope_theta) are preserved for backward compatibility. In transformers v5, ModernBertConfig defaults sliding_attention.rope_theta to 10,000 โ€” but mmBERT uses 160,000 for both, so explicit rope_parameters are required.

Verification

Cross-environment verification was performed between transformers v4 (original) and v5 (this checkpoint):

Check Result
RoPE config rope_parameters present, theta=160,000 for both attention types
Weight integrity 138 tensors bitwise equal (jhu-clsp .bin vs datalama .safetensors)
Inference output v4 vs v5 max diff across 4 multilingual sentences: 1.14e-05
Fine-tuning readiness Tokenizer roundtrip, forward+backward pass, gradient propagation โ€” all OK

Credit

Original model by JHU CLSP. See the original model card for training details and benchmarks.

Downloads last month
1
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for datalama/mmBERT-small

Finetuned
(39)
this model