---
library_name: transformers
tags:
- modernbert
- fill-mask
- multilingual
license: apache-2.0
base_model: jhu-clsp/mmBERT-base
---

# mmBERT-base

Transformers v5 compatible checkpoint of [jhu-clsp/mmBERT-base](https://huggingface.co/jhu-clsp/mmBERT-base).

| | |
|---|---|
| **Parameters** | 307M |
| **Hidden size** | 768 |
| **Layers** | 22 |
| **Attention heads** | 12 |
| **Max seq length** | 8,192 |
| **RoPE theta** | 160,000 (both global & local) |

## Usage (transformers v5)

```python
from transformers import ModernBertModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("datalama/mmBERT-base")
model = ModernBertModel.from_pretrained("datalama/mmBERT-base")

inputs = tokenizer("인공지능 기술은 빠르게 발전하고 있습니다.", return_tensors="pt")
outputs = model(**inputs)

# [CLS] embedding (768-dim)
cls_embedding = outputs.last_hidden_state[:, 0, :]
```

For masked language modeling:

```python
from transformers import ModernBertForMaskedLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("datalama/mmBERT-base")
model = ModernBertForMaskedLM.from_pretrained("datalama/mmBERT-base")

inputs = tokenizer("The capital of France is [MASK].", return_tensors="pt")
outputs = model(**inputs)
```

## Migration Details

This checkpoint was migrated from [jhu-clsp/mmBERT-base](https://huggingface.co/jhu-clsp/mmBERT-base) with the following changes:

**1. Weight format**: `pytorch_model.bin` → `model.safetensors`
- Tied weights (`model.embeddings.tok_embeddings.weight` ↔ `decoder.weight`) were cloned to separate tensors before saving
- All 138 tensors verified bitwise equal after conversion

**2. Config**: Added explicit `rope_parameters` for transformers v5
```json
{
  "global_rope_theta": 160000,
  "local_rope_theta": 160000,
  "rope_parameters": {
    "full_attention": {"rope_type": "default", "rope_theta": 160000.0},
    "sliding_attention": {"rope_type": "default", "rope_theta": 160000.0}
  }
}
```
The original flat fields (`global_rope_theta`, `local_rope_theta`) are preserved for backward compatibility.
In transformers v5, `ModernBertConfig` defaults `sliding_attention.rope_theta` to 10,000 — but mmBERT uses 160,000 for both, so explicit `rope_parameters` are required.

## Verification

Cross-environment verification was performed between transformers v4 (original) and v5 (this checkpoint):

| Check | Result |
|---|---|
| **RoPE config** | `rope_parameters` present, theta=160,000 for both attention types |
| **Weight integrity** | 138 tensors bitwise equal (jhu-clsp `.bin` vs datalama `.safetensors`) |
| **Inference output** | v4 vs v5 max diff across 4 multilingual sentences: **7.63e-06** |
| **Fine-tuning readiness** | Tokenizer roundtrip, forward+backward pass, gradient propagation — all OK |

## Credit

Original model by [JHU CLSP](https://huggingface.co/jhu-clsp). See the [original model card](https://huggingface.co/jhu-clsp/mmBERT-base) for training details and benchmarks.