ModernBERT-base-32K

A long-context extension of ModernBERT-base fine-tuned for 32K token context length using YaRN (Yet another RoPE extensioN) scaling.

Model Description

This model extends the original ModernBERT-base from 8,192 tokens to 32,768 tokens context length while preserving the base model's capabilities. It was fine-tuned using specialized techniques to maintain long-range retrieval performance:

YaRN RoPE Scaling: Factor of 4.0x to extend positional embeddings
Retrieval-Aware Masking: Custom MLM objective that encourages long-range token dependencies
Elastic Weight Consolidation (EWC): Regularization to prevent catastrophic forgetting
Conservative Learning Rate: 1e-5 with constant warmup schedule

Model Details

Property	Value
Parameters	149M
Context Length	32,768 tokens
Hidden Size	768
Attention Heads	12
Layers	22
Vocabulary	50,368
Architecture	ModernBERT (RoPE + Sliding Window + Global Attention)

Training

Dataset

Source: Leooyii/Slimpajama_downsample_32k_1B (subset)
Samples: ~1B tokens of long-form text (≥32K tokens per document)
Preprocessing: Concatenated documents with proper sequence boundaries

Hyperparameters

learning_rate: 1e-5
lr_scheduler: constant_with_warmup
warmup_ratio: 0.1
epochs: 1
batch_size: 6
gradient_accumulation: 1
precision: bf16
max_length: 32768

# RoPE Scaling
rope_scaling_type: yarn
rope_scaling_factor: 4.0

# Retrieval Masking
mlm_probability: 0.30
retrieval_probability: 0.10
min_distance_for_retrieval: 512

# EWC Regularization
ewc_lambda: 1000.0
ewc_samples: 200

Training Infrastructure

Hardware: AMD MI300X GPU (192GB HBM3)
Framework: PyTorch + Transformers
Training Time: ~4 hours

Evaluation Results

Comparison with Base Model (at 32K context)

Metric	This Model	Base ModernBERT
Basic MLM Accuracy	100%	100%
Perplexity @ 32K	1.00	1.00
Passkey Retrieval @ 512	60%	50%
Passkey Retrieval @ 1K	50%	50%
Long-Range Coreference	100%	100%
Position Accuracy (early)	90%	85%
Position Accuracy (late)	71%	64%
Repeated Info Consistency	100%	100%

Distance-Based Retrieval Accuracy

Distance	Accuracy
64 tokens	80%
128 tokens	60%
256 tokens	73%
512 tokens	27%
1024 tokens	40%
2048 tokens	47%
4096 tokens	53%
8192 tokens	20%

Perplexity by Context Length

Context	Perplexity
512	1.84
1024	1.29
2048	1.11
4096	1.02
8192	1.00
16384	1.00
24576	1.00
32768	1.00

Usage

Basic Usage

from transformers import AutoModelForMaskedLM, AutoTokenizer

model = AutoModelForMaskedLM.from_pretrained("llm-semantic-router/modernbert-base-32k")
tokenizer = AutoTokenizer.from_pretrained("llm-semantic-router/modernbert-base-32k")

text = "The capital of France is [MASK]."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

# Get predictions
mask_idx = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
logits = outputs.logits[0, mask_idx, :]
top_tokens = logits.topk(5).indices[0]
print([tokenizer.decode(t) for t in top_tokens])
# ['Paris', 'Lyon', 'Nancy', ...]

Long Context Usage

# For sequences longer than 8192 tokens, YaRN scaling is automatically applied
long_text = "..." * 10000  # Very long document
inputs = tokenizer(
    long_text, 
    return_tensors="pt", 
    max_length=32768, 
    truncation=True
)
outputs = model(**inputs)

Feature Extraction

from transformers import AutoModel

model = AutoModel.from_pretrained("llm-semantic-router/modernbert-base-32k")

inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
embeddings = outputs.last_hidden_state  # [batch, seq_len, 768]

# Mean pooling for sentence embedding
attention_mask = inputs["attention_mask"]
masked_embeddings = embeddings * attention_mask.unsqueeze(-1)
sentence_embedding = masked_embeddings.sum(1) / attention_mask.sum(1, keepdim=True)

Limitations

Passkey Retrieval at Long Distances: Like the base model, this model struggles with needle-in-haystack retrieval beyond ~1K tokens. This is a fundamental limitation of MLM architectures, not the fine-tuning.
Context Utilization: At very long contexts (16K+), the model may not always benefit from additional context for MLM predictions. This is expected behavior for encoder-only models.
Memory Requirements: Processing 32K tokens requires significant GPU memory (~16GB+ for inference).
Domain: Trained primarily on web text; may not generalize well to specialized domains without additional fine-tuning.

Intended Use

Long-document understanding: Processing documents that exceed typical BERT context limits
Information retrieval: Embedding long documents for semantic search
Long-form text classification: Sentiment analysis, topic classification on long texts
Coreference resolution: Resolving references across long documents
Feature extraction: Getting embeddings for downstream tasks

Training Methodology

This model uses several techniques to preserve long-range capabilities during fine-tuning:

1. YaRN RoPE Scaling

Extends positional embeddings from 8K to 32K using Yet another RoPE extensioN with:

Scaling factor: 4.0
Original base frequency: 10000
Attention scaling for numerical stability

2. Retrieval-Aware Masking

Custom MLM objective that creates explicit long-range dependencies:

Standard MLM masking (30%)
Additional masking of tokens that appear in distant context (10% probability)
Minimum distance threshold: 512 tokens

3. Elastic Weight Consolidation

Prevents catastrophic forgetting by:

Computing Fisher information matrix on base model
Penalizing changes to important weights during fine-tuning
Lambda: 1000.0

Citation

If you use this model, please cite:

@misc{modernbert-32k,
  title={ModernBERT-base-32K: Long-Context Extension of ModernBERT},
  author={LLM Semantic Router},
  year={2026},
  url={https://huggingface.co/llm-semantic-router/modernbert-base-32k}
}

Also cite the original ModernBERT paper:

@article{modernbert2024,
  title={Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference},
  author={Warner, Benjamin and Chaffin, Antoine and Geiger, Benjamin and Werra, Leandro von and Tunstall, Lewis and Bartolo, Max and Thrush, Tristan},
  journal={arXiv preprint},
  year={2024}
}

License

Apache 2.0 (same as base model)

Acknowledgments

Answer.AI for the original ModernBERT model
Leooyii for the SlimPajama dataset
AMD for the MI300X GPU

Downloads last month: 5

Safetensors

Model size

0.1B params

Tensor type

F32

BF16

Model tree for llm-semantic-router/modernbert-base-32k

Base model

answerdotai/ModernBERT-base

Finetuned

(1355)

this model

Finetunes

2 models

llm-semantic-router
/

modernbert-base-32k