ModernBERT-base-32K
A long-context extension of ModernBERT-base fine-tuned for 32K token context length using YaRN (Yet another RoPE extensioN) scaling.
Model Description
This model extends the original ModernBERT-base from 8,192 tokens to 32,768 tokens context length while preserving the base model's capabilities. It was fine-tuned using specialized techniques to maintain long-range retrieval performance:
- YaRN RoPE Scaling: Factor of 4.0x to extend positional embeddings
- Retrieval-Aware Masking: Custom MLM objective that encourages long-range token dependencies
- Elastic Weight Consolidation (EWC): Regularization to prevent catastrophic forgetting
- Conservative Learning Rate: 1e-5 with constant warmup schedule
Model Details
| Property | Value |
|---|---|
| Parameters | 149M |
| Context Length | 32,768 tokens |
| Hidden Size | 768 |
| Attention Heads | 12 |
| Layers | 22 |
| Vocabulary | 50,368 |
| Architecture | ModernBERT (RoPE + Sliding Window + Global Attention) |
Training
Dataset
- Source: Leooyii/Slimpajama_downsample_32k_1B (subset)
- Samples: ~1B tokens of long-form text (≥32K tokens per document)
- Preprocessing: Concatenated documents with proper sequence boundaries
Hyperparameters
learning_rate: 1e-5
lr_scheduler: constant_with_warmup
warmup_ratio: 0.1
epochs: 1
batch_size: 6
gradient_accumulation: 1
precision: bf16
max_length: 32768
# RoPE Scaling
rope_scaling_type: yarn
rope_scaling_factor: 4.0
# Retrieval Masking
mlm_probability: 0.30
retrieval_probability: 0.10
min_distance_for_retrieval: 512
# EWC Regularization
ewc_lambda: 1000.0
ewc_samples: 200
Training Infrastructure
- Hardware: AMD MI300X GPU (192GB HBM3)
- Framework: PyTorch + Transformers
- Training Time: ~4 hours
Evaluation Results
Comparison with Base Model (at 32K context)
| Metric | This Model | Base ModernBERT |
|---|---|---|
| Basic MLM Accuracy | 100% | 100% |
| Perplexity @ 32K | 1.00 | 1.00 |
| Passkey Retrieval @ 512 | 60% | 50% |
| Passkey Retrieval @ 1K | 50% | 50% |
| Long-Range Coreference | 100% | 100% |
| Position Accuracy (early) | 90% | 85% |
| Position Accuracy (late) | 71% | 64% |
| Repeated Info Consistency | 100% | 100% |
Distance-Based Retrieval Accuracy
| Distance | Accuracy |
|---|---|
| 64 tokens | 80% |
| 128 tokens | 60% |
| 256 tokens | 73% |
| 512 tokens | 27% |
| 1024 tokens | 40% |
| 2048 tokens | 47% |
| 4096 tokens | 53% |
| 8192 tokens | 20% |
Perplexity by Context Length
| Context | Perplexity |
|---|---|
| 512 | 1.84 |
| 1024 | 1.29 |
| 2048 | 1.11 |
| 4096 | 1.02 |
| 8192 | 1.00 |
| 16384 | 1.00 |
| 24576 | 1.00 |
| 32768 | 1.00 |
Usage
Basic Usage
from transformers import AutoModelForMaskedLM, AutoTokenizer
model = AutoModelForMaskedLM.from_pretrained("llm-semantic-router/modernbert-base-32k")
tokenizer = AutoTokenizer.from_pretrained("llm-semantic-router/modernbert-base-32k")
text = "The capital of France is [MASK]."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
# Get predictions
mask_idx = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
logits = outputs.logits[0, mask_idx, :]
top_tokens = logits.topk(5).indices[0]
print([tokenizer.decode(t) for t in top_tokens])
# ['Paris', 'Lyon', 'Nancy', ...]
Long Context Usage
# For sequences longer than 8192 tokens, YaRN scaling is automatically applied
long_text = "..." * 10000 # Very long document
inputs = tokenizer(
long_text,
return_tensors="pt",
max_length=32768,
truncation=True
)
outputs = model(**inputs)
Feature Extraction
from transformers import AutoModel
model = AutoModel.from_pretrained("llm-semantic-router/modernbert-base-32k")
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
embeddings = outputs.last_hidden_state # [batch, seq_len, 768]
# Mean pooling for sentence embedding
attention_mask = inputs["attention_mask"]
masked_embeddings = embeddings * attention_mask.unsqueeze(-1)
sentence_embedding = masked_embeddings.sum(1) / attention_mask.sum(1, keepdim=True)
Limitations
Passkey Retrieval at Long Distances: Like the base model, this model struggles with needle-in-haystack retrieval beyond ~1K tokens. This is a fundamental limitation of MLM architectures, not the fine-tuning.
Context Utilization: At very long contexts (16K+), the model may not always benefit from additional context for MLM predictions. This is expected behavior for encoder-only models.
Memory Requirements: Processing 32K tokens requires significant GPU memory (~16GB+ for inference).
Domain: Trained primarily on web text; may not generalize well to specialized domains without additional fine-tuning.
Intended Use
- Long-document understanding: Processing documents that exceed typical BERT context limits
- Information retrieval: Embedding long documents for semantic search
- Long-form text classification: Sentiment analysis, topic classification on long texts
- Coreference resolution: Resolving references across long documents
- Feature extraction: Getting embeddings for downstream tasks
Training Methodology
This model uses several techniques to preserve long-range capabilities during fine-tuning:
1. YaRN RoPE Scaling
Extends positional embeddings from 8K to 32K using Yet another RoPE extensioN with:
- Scaling factor: 4.0
- Original base frequency: 10000
- Attention scaling for numerical stability
2. Retrieval-Aware Masking
Custom MLM objective that creates explicit long-range dependencies:
- Standard MLM masking (30%)
- Additional masking of tokens that appear in distant context (10% probability)
- Minimum distance threshold: 512 tokens
3. Elastic Weight Consolidation
Prevents catastrophic forgetting by:
- Computing Fisher information matrix on base model
- Penalizing changes to important weights during fine-tuning
- Lambda: 1000.0
Citation
If you use this model, please cite:
@misc{modernbert-32k,
title={ModernBERT-base-32K: Long-Context Extension of ModernBERT},
author={LLM Semantic Router},
year={2026},
url={https://huggingface.co/llm-semantic-router/modernbert-base-32k}
}
Also cite the original ModernBERT paper:
@article{modernbert2024,
title={Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference},
author={Warner, Benjamin and Chaffin, Antoine and Geiger, Benjamin and Werra, Leandro von and Tunstall, Lewis and Bartolo, Max and Thrush, Tristan},
journal={arXiv preprint},
year={2024}
}
License
Apache 2.0 (same as base model)
Acknowledgments
- Downloads last month
- 58