File size: 4,525 Bytes
8382ff6 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 | ---
license: mit
datasets:
- Salesforce/wikitext
- kd13/stack-v2-mini
language:
- en
metrics:
- perplexity
base_model:
- answerdotai/ModernBERT-base
new_version: kd13/ModernBERT-base-mlm-wiki-code
pipeline_tag: fill-mask
library_name: transformers
tags:
- code
- mlm
- wiki
---
# ModernBERT-base-mlm-wiki-code
## Model Summary
**ModernBERT-base-mlm-wiki-code** is a continued pre-trained version of [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base), further trained on a large mixed corpus of English natural language text and multi-language source code using **Masked Language Modeling (MLM)**.
The model was trained with a challenging **30% masking probability** (vs. the standard 15% in BERT) over a **2048 token context window**, achieving a final perplexity of **1.9507** — indicating strong and confident language + code understanding across both domains.
---
## Evaluation Results
Evaluation was performed separately on each domain to understand per-domain performance.
| Dataset | Eval Loss | Perplexity |
|--------------------------------|-----------|------------|
| WikiText (Natural Language) | 1.3994 | 4.0526 |
| The Stack V2 (Code) | 0.4091 | 1.5054 |
| **Combined (NL + Code)** | **0.6728**| **1.9598** |
> MLM Probability: 30% — Context Length: 2048 tokens
---
## Model Details
| Property | Value |
|-----------------------|--------------------------------------------------|
| **Base Model** | `answerdotai/ModernBERT-base` |
| **Model Type** | Masked Language Model (MLM) |
| **Architecture** | ModernBERT (Encoder-only Transformer) |
| **Context Length** | 2048 tokens |
| **MLM Probability** | 30% |
| **Languages** | English, C++, Go, Java, JavaScript, Python |
| **License** | MiT |
---
## Usage
### Installation
```bash
pip install transformers torch
```
### Load the Model
```python
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("kd13/ModernBERT-base-mlm-wiki-code")
model = AutoModelForMaskedLM.from_pretrained("kd13/ModernBERT-base-mlm-wiki-code")
model.config.reference_compile = False
```
### Fill-Mask — Natural Language
```python
from transformers import pipeline
pipe = pipeline("fill-mask", model="kd13/ModernBERT-base-mlm-wiki-code")
result = pipe("The capital of France is [MASK].")
for r in result:
print(f"{r['token_str']:15s} → {r['score']:.4f}")
```
### Fill-Mask — Source Code
```python
from transformers import pipeline
pipe = pipeline("fill-mask", model="kd13/ModernBERT-base-mlm-wiki-code")
result = pipe("def fibonacci(n): return n if n <= 1 else fibonacci(n-1) [MASK] fibonacci(n-2)")
for r in result:
print(f"{r['token_str']:15s} → {r['score']:.4f}")
```
### Feature Extraction (Embeddings)
```python
from transformers import AutoTokenizer, AutoModel
import torch
tokenizer = AutoTokenizer.from_pretrained("kd13/ModernBERT-base-mlm-wiki-code")
model = AutoModel.from_pretrained("kd13/ModernBERT-base-mlm-wiki-code")
text = "def quicksort(arr): return arr if len(arr) <= 1 else ..."
inputs = tokenizer(text, return_tensors="pt", max_length=2048, truncation=True)
with torch.no_grad():
outputs = model(**inputs)
# CLS token embedding — shape: (1, 768)
embedding = outputs.last_hidden_state[:, 0, :]
print(embedding.shape)
```
---
## Citation
If you use this model, please cite the original ModernBERT paper:
```bibtex
@article{modernbert2024,
title = {Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference},
author = {Benjamin Warner and Antoine Chaffin and Benjamin Clavié and Orion Weller and Oskar Hallström and Said Taghadouini and Alexis Gallagher and Raja Biswas and Faisal Ladhak and Tom Aarsen and Nathan Cooper and Griffin Adams and Jeremy Howard and Iacopo Poli},
year = {2024},
url = {https://arxiv.org/abs/2412.13663}
}
```
---
## Acknowledgements
- Base model by [Answer.AI](https://huggingface.co/answerdotai)
- Training data from [WikiText](https://huggingface.co/datasets/wikimedia/wikipedia) and [BigCode The Stack V2](https://huggingface.co/datasets/bigcode/the-stack-v2) |