Update README: add migration details, verification results, and v5 usage examples
Browse files
README.md
CHANGED
|
@@ -12,22 +12,75 @@ base_model: jhu-clsp/mmBERT-small
|
|
| 12 |
|
| 13 |
Transformers v5 compatible checkpoint of [jhu-clsp/mmBERT-small](https://huggingface.co/jhu-clsp/mmBERT-small).
|
| 14 |
|
| 15 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 16 |
|
| 17 |
-
|
| 18 |
-
- **Config**: Added explicit `rope_parameters` for transformers v5 compatibility
|
| 19 |
-
- **Parameters**: 140M
|
| 20 |
|
| 21 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 22 |
|
| 23 |
```python
|
| 24 |
-
from transformers import
|
| 25 |
|
| 26 |
-
model = AutoModel.from_pretrained("datalama/mmBERT-small")
|
| 27 |
tokenizer = AutoTokenizer.from_pretrained("datalama/mmBERT-small")
|
|
|
|
|
|
|
|
|
|
|
|
|
| 28 |
```
|
| 29 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 30 |
## Credit
|
| 31 |
|
| 32 |
-
Original model by [JHU CLSP](https://huggingface.co/jhu-clsp).
|
| 33 |
-
See the [original model card](https://huggingface.co/jhu-clsp/mmBERT-small) for training details and benchmarks.
|
|
|
|
| 12 |
|
| 13 |
Transformers v5 compatible checkpoint of [jhu-clsp/mmBERT-small](https://huggingface.co/jhu-clsp/mmBERT-small).
|
| 14 |
|
| 15 |
+
| | |
|
| 16 |
+
|---|---|
|
| 17 |
+
| **Parameters** | 140M |
|
| 18 |
+
| **Hidden size** | 384 |
|
| 19 |
+
| **Layers** | 22 |
|
| 20 |
+
| **Attention heads** | 6 |
|
| 21 |
+
| **Max seq length** | 8,192 |
|
| 22 |
+
| **RoPE theta** | 160,000 (both global & local) |
|
| 23 |
|
| 24 |
+
## Usage (transformers v5)
|
|
|
|
|
|
|
| 25 |
|
| 26 |
+
```python
|
| 27 |
+
from transformers import ModernBertModel, AutoTokenizer
|
| 28 |
+
|
| 29 |
+
tokenizer = AutoTokenizer.from_pretrained("datalama/mmBERT-small")
|
| 30 |
+
model = ModernBertModel.from_pretrained("datalama/mmBERT-small")
|
| 31 |
+
|
| 32 |
+
inputs = tokenizer("인공지능 기술은 빠르게 발전하고 있습니다.", return_tensors="pt")
|
| 33 |
+
outputs = model(**inputs)
|
| 34 |
+
|
| 35 |
+
# [CLS] embedding (384-dim)
|
| 36 |
+
cls_embedding = outputs.last_hidden_state[:, 0, :]
|
| 37 |
+
```
|
| 38 |
+
|
| 39 |
+
For masked language modeling:
|
| 40 |
|
| 41 |
```python
|
| 42 |
+
from transformers import ModernBertForMaskedLM, AutoTokenizer
|
| 43 |
|
|
|
|
| 44 |
tokenizer = AutoTokenizer.from_pretrained("datalama/mmBERT-small")
|
| 45 |
+
model = ModernBertForMaskedLM.from_pretrained("datalama/mmBERT-small")
|
| 46 |
+
|
| 47 |
+
inputs = tokenizer("The capital of France is [MASK].", return_tensors="pt")
|
| 48 |
+
outputs = model(**inputs)
|
| 49 |
```
|
| 50 |
|
| 51 |
+
## Migration Details
|
| 52 |
+
|
| 53 |
+
This checkpoint was migrated from [jhu-clsp/mmBERT-small](https://huggingface.co/jhu-clsp/mmBERT-small) with the following changes:
|
| 54 |
+
|
| 55 |
+
**1. Weight format**: `pytorch_model.bin` → `model.safetensors`
|
| 56 |
+
- Tied weights (`model.embeddings.tok_embeddings.weight` ↔ `decoder.weight`) were cloned to separate tensors before saving
|
| 57 |
+
- All 138 tensors verified bitwise equal after conversion
|
| 58 |
+
|
| 59 |
+
**2. Config**: Added explicit `rope_parameters` for transformers v5
|
| 60 |
+
```json
|
| 61 |
+
{
|
| 62 |
+
"global_rope_theta": 160000,
|
| 63 |
+
"local_rope_theta": 160000,
|
| 64 |
+
"rope_parameters": {
|
| 65 |
+
"full_attention": {"rope_type": "default", "rope_theta": 160000.0},
|
| 66 |
+
"sliding_attention": {"rope_type": "default", "rope_theta": 160000.0}
|
| 67 |
+
}
|
| 68 |
+
}
|
| 69 |
+
```
|
| 70 |
+
The original flat fields (`global_rope_theta`, `local_rope_theta`) are preserved for backward compatibility.
|
| 71 |
+
In transformers v5, `ModernBertConfig` defaults `sliding_attention.rope_theta` to 10,000 — but mmBERT uses 160,000 for both, so explicit `rope_parameters` are required.
|
| 72 |
+
|
| 73 |
+
## Verification
|
| 74 |
+
|
| 75 |
+
Cross-environment verification was performed between transformers v4 (original) and v5 (this checkpoint):
|
| 76 |
+
|
| 77 |
+
| Check | Result |
|
| 78 |
+
|---|---|
|
| 79 |
+
| **RoPE config** | `rope_parameters` present, theta=160,000 for both attention types |
|
| 80 |
+
| **Weight integrity** | 138 tensors bitwise equal (jhu-clsp `.bin` vs datalama `.safetensors`) |
|
| 81 |
+
| **Inference output** | v4 vs v5 max diff across 4 multilingual sentences: **1.14e-05** |
|
| 82 |
+
| **Fine-tuning readiness** | Tokenizer roundtrip, forward+backward pass, gradient propagation — all OK |
|
| 83 |
+
|
| 84 |
## Credit
|
| 85 |
|
| 86 |
+
Original model by [JHU CLSP](https://huggingface.co/jhu-clsp). See the [original model card](https://huggingface.co/jhu-clsp/mmBERT-small) for training details and benchmarks.
|
|
|