File size: 4,525 Bytes

8382ff6

---
license: mit
datasets:
- Salesforce/wikitext
- kd13/stack-v2-mini
language:
- en
metrics:
- perplexity
base_model:
- answerdotai/ModernBERT-base
new_version: kd13/ModernBERT-base-mlm-wiki-code
pipeline_tag: fill-mask
library_name: transformers
tags:
- code
- mlm
- wiki
---

# ModernBERT-base-mlm-wiki-code

## Model Summary

**ModernBERT-base-mlm-wiki-code** is a continued pre-trained version of [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base), further trained on a large mixed corpus of English natural language text and multi-language source code using **Masked Language Modeling (MLM)**.

The model was trained with a challenging **30% masking probability** (vs. the standard 15% in BERT) over a **2048 token context window**, achieving a final perplexity of **1.9507** — indicating strong and confident language + code understanding across both domains.

---

## Evaluation Results

Evaluation was performed separately on each domain to understand per-domain performance.

| Dataset                        | Eval Loss | Perplexity |
|--------------------------------|-----------|------------|
| WikiText (Natural Language)    | 1.3994    | 4.0526     |
| The Stack V2 (Code)            | 0.4091    | 1.5054     |
| **Combined (NL + Code)**       | **0.6728**| **1.9598** |

> MLM Probability: 30% — Context Length: 2048 tokens
---

## Model Details

| Property              | Value                                            |
|-----------------------|--------------------------------------------------|
| **Base Model**        | `answerdotai/ModernBERT-base`                    |
| **Model Type**        | Masked Language Model (MLM)                      |
| **Architecture**      | ModernBERT (Encoder-only Transformer)            |
| **Context Length**    | 2048 tokens                                      |
| **MLM Probability**   | 30%                                              |
| **Languages**         | English, C++, Go, Java, JavaScript, Python       |
| **License**           | MiT                                              |

---

## Usage

### Installation

```bash
pip install transformers torch
```

### Load the Model

```python
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("kd13/ModernBERT-base-mlm-wiki-code")
model = AutoModelForMaskedLM.from_pretrained("kd13/ModernBERT-base-mlm-wiki-code")
model.config.reference_compile = False
```

### Fill-Mask — Natural Language

```python
from transformers import pipeline

pipe = pipeline("fill-mask", model="kd13/ModernBERT-base-mlm-wiki-code")

result = pipe("The capital of France is [MASK].")
for r in result:
    print(f"{r['token_str']:15s} → {r['score']:.4f}")
```

### Fill-Mask — Source Code

```python
from transformers import pipeline

pipe = pipeline("fill-mask", model="kd13/ModernBERT-base-mlm-wiki-code")

result = pipe("def fibonacci(n): return n if n <= 1 else fibonacci(n-1) [MASK] fibonacci(n-2)")
for r in result:
    print(f"{r['token_str']:15s} → {r['score']:.4f}")
```

### Feature Extraction (Embeddings)

```python
from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained("kd13/ModernBERT-base-mlm-wiki-code")
model = AutoModel.from_pretrained("kd13/ModernBERT-base-mlm-wiki-code")

text = "def quicksort(arr): return arr if len(arr) <= 1 else ..."
inputs = tokenizer(text, return_tensors="pt", max_length=2048, truncation=True)

with torch.no_grad():
    outputs = model(**inputs)

# CLS token embedding — shape: (1, 768)
embedding = outputs.last_hidden_state[:, 0, :]
print(embedding.shape)
```

---

## Citation

If you use this model, please cite the original ModernBERT paper:

```bibtex
@article{modernbert2024,
  title   = {Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference},
  author  = {Benjamin Warner and Antoine Chaffin and Benjamin Clavié and Orion Weller and Oskar Hallström and Said Taghadouini and Alexis Gallagher and Raja Biswas and Faisal Ladhak and Tom Aarsen and Nathan Cooper and Griffin Adams and Jeremy Howard and Iacopo Poli},
  year    = {2024},
  url     = {https://arxiv.org/abs/2412.13663}
}
```

---

## Acknowledgements

- Base model by [Answer.AI](https://huggingface.co/answerdotai)
- Training data from [WikiText](https://huggingface.co/datasets/wikimedia/wikipedia) and [BigCode The Stack V2](https://huggingface.co/datasets/bigcode/the-stack-v2)