| --- |
| library_name: transformers |
| tags: |
| - modernbert |
| - fill-mask |
| - multilingual |
| license: apache-2.0 |
| base_model: jhu-clsp/mmBERT-base |
| --- |
| |
| # mmBERT-base |
|
|
| Transformers v5 compatible checkpoint of [jhu-clsp/mmBERT-base](https://huggingface.co/jhu-clsp/mmBERT-base). |
|
|
| | | | |
| |---|---| |
| | **Parameters** | 307M | |
| | **Hidden size** | 768 | |
| | **Layers** | 22 | |
| | **Attention heads** | 12 | |
| | **Max seq length** | 8,192 | |
| | **RoPE theta** | 160,000 (both global & local) | |
|
|
| ## Usage (transformers v5) |
|
|
| ```python |
| from transformers import ModernBertModel, AutoTokenizer |
| |
| tokenizer = AutoTokenizer.from_pretrained("datalama/mmBERT-base") |
| model = ModernBertModel.from_pretrained("datalama/mmBERT-base") |
| |
| inputs = tokenizer("์ธ๊ณต์ง๋ฅ ๊ธฐ์ ์ ๋น ๋ฅด๊ฒ ๋ฐ์ ํ๊ณ ์์ต๋๋ค.", return_tensors="pt") |
| outputs = model(**inputs) |
| |
| # [CLS] embedding (768-dim) |
| cls_embedding = outputs.last_hidden_state[:, 0, :] |
| ``` |
|
|
| For masked language modeling: |
|
|
| ```python |
| from transformers import ModernBertForMaskedLM, AutoTokenizer |
| |
| tokenizer = AutoTokenizer.from_pretrained("datalama/mmBERT-base") |
| model = ModernBertForMaskedLM.from_pretrained("datalama/mmBERT-base") |
| |
| inputs = tokenizer("The capital of France is [MASK].", return_tensors="pt") |
| outputs = model(**inputs) |
| ``` |
|
|
| ## Migration Details |
|
|
| This checkpoint was migrated from [jhu-clsp/mmBERT-base](https://huggingface.co/jhu-clsp/mmBERT-base) with the following changes: |
|
|
| **1. Weight format**: `pytorch_model.bin` โ `model.safetensors` |
| - Tied weights (`model.embeddings.tok_embeddings.weight` โ `decoder.weight`) were cloned to separate tensors before saving |
| - All 138 tensors verified bitwise equal after conversion |
|
|
| **2. Config**: Added explicit `rope_parameters` for transformers v5 |
| ```json |
| { |
| "global_rope_theta": 160000, |
| "local_rope_theta": 160000, |
| "rope_parameters": { |
| "full_attention": {"rope_type": "default", "rope_theta": 160000.0}, |
| "sliding_attention": {"rope_type": "default", "rope_theta": 160000.0} |
| } |
| } |
| ``` |
| The original flat fields (`global_rope_theta`, `local_rope_theta`) are preserved for backward compatibility. |
| In transformers v5, `ModernBertConfig` defaults `sliding_attention.rope_theta` to 10,000 โ but mmBERT uses 160,000 for both, so explicit `rope_parameters` are required. |
|
|
| ## Verification |
|
|
| Cross-environment verification was performed between transformers v4 (original) and v5 (this checkpoint): |
|
|
| | Check | Result | |
| |---|---| |
| | **RoPE config** | `rope_parameters` present, theta=160,000 for both attention types | |
| | **Weight integrity** | 138 tensors bitwise equal (jhu-clsp `.bin` vs datalama `.safetensors`) | |
| | **Inference output** | v4 vs v5 max diff across 4 multilingual sentences: **7.63e-06** | |
| | **Fine-tuning readiness** | Tokenizer roundtrip, forward+backward pass, gradient propagation โ all OK | |
|
|
| ## Credit |
|
|
| Original model by [JHU CLSP](https://huggingface.co/jhu-clsp). See the [original model card](https://huggingface.co/jhu-clsp/mmBERT-base) for training details and benchmarks. |
|
|