ModernBERT-base-32K

A long-context extension of ModernBERT-base fine-tuned for 32K token context length using YaRN (Yet another RoPE extensioN) scaling.

Model Description

This model extends the original ModernBERT-base from 8,192 tokens to 32,768 tokens context length while preserving the base model's capabilities. It was fine-tuned using specialized techniques to maintain long-range retrieval performance:

  • YaRN RoPE Scaling: Factor of 4.0x to extend positional embeddings
  • Retrieval-Aware Masking: Custom MLM objective that encourages long-range token dependencies
  • Elastic Weight Consolidation (EWC): Regularization to prevent catastrophic forgetting
  • Conservative Learning Rate: 1e-5 with constant warmup schedule

Model Details

Property Value
Parameters 149M
Context Length 32,768 tokens
Hidden Size 768
Attention Heads 12
Layers 22
Vocabulary 50,368
Architecture ModernBERT (RoPE + Sliding Window + Global Attention)

Training

Dataset

  • Source: Leooyii/Slimpajama_downsample_32k_1B (subset)
  • Samples: ~1B tokens of long-form text (≥32K tokens per document)
  • Preprocessing: Concatenated documents with proper sequence boundaries

Hyperparameters

learning_rate: 1e-5
lr_scheduler: constant_with_warmup
warmup_ratio: 0.1
epochs: 1
batch_size: 6
gradient_accumulation: 1
precision: bf16
max_length: 32768

# RoPE Scaling
rope_scaling_type: yarn
rope_scaling_factor: 4.0

# Retrieval Masking
mlm_probability: 0.30
retrieval_probability: 0.10
min_distance_for_retrieval: 512

# EWC Regularization
ewc_lambda: 1000.0
ewc_samples: 200

Training Infrastructure

  • Hardware: AMD MI300X GPU (192GB HBM3)
  • Framework: PyTorch + Transformers
  • Training Time: ~4 hours

Evaluation Results

Comparison with Base Model (at 32K context)

Metric This Model Base ModernBERT
Basic MLM Accuracy 100% 100%
Perplexity @ 32K 1.00 1.00
Passkey Retrieval @ 512 60% 50%
Passkey Retrieval @ 1K 50% 50%
Long-Range Coreference 100% 100%
Position Accuracy (early) 90% 85%
Position Accuracy (late) 71% 64%
Repeated Info Consistency 100% 100%

Distance-Based Retrieval Accuracy

Distance Accuracy
64 tokens 80%
128 tokens 60%
256 tokens 73%
512 tokens 27%
1024 tokens 40%
2048 tokens 47%
4096 tokens 53%
8192 tokens 20%

Perplexity by Context Length

Context Perplexity
512 1.84
1024 1.29
2048 1.11
4096 1.02
8192 1.00
16384 1.00
24576 1.00
32768 1.00

Usage

Basic Usage

from transformers import AutoModelForMaskedLM, AutoTokenizer

model = AutoModelForMaskedLM.from_pretrained("llm-semantic-router/modernbert-base-32k")
tokenizer = AutoTokenizer.from_pretrained("llm-semantic-router/modernbert-base-32k")

text = "The capital of France is [MASK]."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

# Get predictions
mask_idx = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
logits = outputs.logits[0, mask_idx, :]
top_tokens = logits.topk(5).indices[0]
print([tokenizer.decode(t) for t in top_tokens])
# ['Paris', 'Lyon', 'Nancy', ...]

Long Context Usage

# For sequences longer than 8192 tokens, YaRN scaling is automatically applied
long_text = "..." * 10000  # Very long document
inputs = tokenizer(
    long_text, 
    return_tensors="pt", 
    max_length=32768, 
    truncation=True
)
outputs = model(**inputs)

Feature Extraction

from transformers import AutoModel

model = AutoModel.from_pretrained("llm-semantic-router/modernbert-base-32k")

inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
embeddings = outputs.last_hidden_state  # [batch, seq_len, 768]

# Mean pooling for sentence embedding
attention_mask = inputs["attention_mask"]
masked_embeddings = embeddings * attention_mask.unsqueeze(-1)
sentence_embedding = masked_embeddings.sum(1) / attention_mask.sum(1, keepdim=True)

Limitations

  1. Passkey Retrieval at Long Distances: Like the base model, this model struggles with needle-in-haystack retrieval beyond ~1K tokens. This is a fundamental limitation of MLM architectures, not the fine-tuning.

  2. Context Utilization: At very long contexts (16K+), the model may not always benefit from additional context for MLM predictions. This is expected behavior for encoder-only models.

  3. Memory Requirements: Processing 32K tokens requires significant GPU memory (~16GB+ for inference).

  4. Domain: Trained primarily on web text; may not generalize well to specialized domains without additional fine-tuning.

Intended Use

  • Long-document understanding: Processing documents that exceed typical BERT context limits
  • Information retrieval: Embedding long documents for semantic search
  • Long-form text classification: Sentiment analysis, topic classification on long texts
  • Coreference resolution: Resolving references across long documents
  • Feature extraction: Getting embeddings for downstream tasks

Training Methodology

This model uses several techniques to preserve long-range capabilities during fine-tuning:

1. YaRN RoPE Scaling

Extends positional embeddings from 8K to 32K using Yet another RoPE extensioN with:

  • Scaling factor: 4.0
  • Original base frequency: 10000
  • Attention scaling for numerical stability

2. Retrieval-Aware Masking

Custom MLM objective that creates explicit long-range dependencies:

  • Standard MLM masking (30%)
  • Additional masking of tokens that appear in distant context (10% probability)
  • Minimum distance threshold: 512 tokens

3. Elastic Weight Consolidation

Prevents catastrophic forgetting by:

  • Computing Fisher information matrix on base model
  • Penalizing changes to important weights during fine-tuning
  • Lambda: 1000.0

Citation

If you use this model, please cite:

@misc{modernbert-32k,
  title={ModernBERT-base-32K: Long-Context Extension of ModernBERT},
  author={LLM Semantic Router},
  year={2026},
  url={https://huggingface.co/llm-semantic-router/modernbert-base-32k}
}

Also cite the original ModernBERT paper:

@article{modernbert2024,
  title={Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference},
  author={Warner, Benjamin and Chaffin, Antoine and Geiger, Benjamin and Werra, Leandro von and Tunstall, Lewis and Bartolo, Max and Thrush, Tristan},
  journal={arXiv preprint},
  year={2024}
}

License

Apache 2.0 (same as base model)

Acknowledgments

  • Answer.AI for the original ModernBERT model
  • Leooyii for the SlimPajama dataset
  • AMD for the MI300X GPU
Downloads last month
58
Safetensors
Model size
0.1B params
Tensor type
F32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for llm-semantic-router/modernbert-base-32k

Finetuned
(1048)
this model
Finetunes
2 models

Dataset used to train llm-semantic-router/modernbert-base-32k