language:
- pl
- en
license: apache-2.0
base_model: answerdotai/ModernBERT-base
tags:
- chunking
- semantic-segmentation
- token-classification
- modernbert
- nlp
- rag
pipeline_tag: token-classification
datasets:
- wikimedia/wikipedia
ModernBERT Chunker Base ๐
This model is a fine-tuned version of ModernBERT-base, specialized in semantic boundary detection. It is designed to be used with the fine-chunker library for high-quality text segmentation in RAG applications.
Model Highlights
- Context Length: 8192 tokens (full ModernBERT capacity).
- Architecture: ModernBERT-base + Deep Classification Head (Linear-ReLU-Dropout-Linear).
- Training Strategy: Sequential packing of full Wikipedia articles with weighted Cross-Entropy.
- Languages: Bilingual support for Polish and English.
Usage
The easiest way to use this model is through the official library:
from fine_chunker import Chunker
# Load the model (runs optimally on CUDA or CPU)
chunker = Chunker.from_pretrained(device="cpu", use_onnx=True)
text = "Your long multi-topic document..."
chunks = chunker.chunk(text)
for chunk in chunks:
print(f"Index: {chunk.index} | Content: {chunk.content[:100]}...")
Training Details
Dataset
The model was trained on Wikipedia (20231101 version) for both Polish and English.
- Preprocessing: Full articles were cleaned of wiki-noise (references, external links, metadata). Additionally, 40% of chunk starts were replaced by a lowercase letter, and 40% of the last dots in chunks were removed.
- Ground Truth: Segmentation was based on natural paragraph boundaries (
\n\n) found in well-structured Wikipedia articles. - Packing: Multiple articles were packed into single
8192token sequences to maximize training efficiency.
Training Configuration
- Hardware: 4x NVIDIA A100-SXM4-40GB.
- Duration: 1 day, 6 hours, 1 minute.
- Precision:
bfloat16with Flash Attention 2. - Epochs: 1
- Optimization:
- Loss Function: Weighted Cross-Entropy (
[1.0, 7.0]) to address boundary sparsity. - Gradient Accumulation: 8 steps.
- Dropout: 0.1.
- Loss Function: Weighted Cross-Entropy (
Architecture Details
Unlike standard token classifiers that use a single linear layer, this model uses a deep classification head:
Linear(hidden_size, hidden_size)ReLUDropout(0.1)Linear(hidden_size, 2)(Boundary vs. Non-boundary)
This allows the model to learn more complex semantic cues for segmentation.
Intended Use
- RAG Pipelines: Generating semantic chunks that preserve context better than fixed-size splitting.
- Long Document Analysis: Segmenting reports, legal documents, or books into logical chapters/sections.
- Pre-processing for LLMs: Ensuring input fragments are semantically complete.
Limitations
- While effective on general knowledge, it may require further fine-tuning for extremely niche domains (e.g., medical or highly technical code documentation).
- Performance is best on texts with clear logical structures.
Evaluation
Status: Under Development > Systematic evaluation of the model's performance across different domains and languages is currently in progress.
Author
Developed by Jerzy Boksa. Contact: devjerzy@gmail.com GitHub: fine-chunker
Acknowledgements
This model was trained using the infrastructure provided by Cyfronet (Academic Computer Centre Cyfronet AGH) as part of a educational grant.
Citation
If you use this model or the fine-chunker library in your research or project, please cite it as follows:
@misc{boksa2024modernbertchunker,
author = {Jerzy Boksa},
title = {ModernBERT Chunker Base: Specialized Semantic Boundary Detection for RAG},
year = {2026},
publisher = {Hugging Face},
journal = {Hugging Face Model Hub},
howpublished = {\url{https://huggingface.co/jboksa/modbert-chunker-base}}
}