jboksa's picture
Update README.md
ddb8a99 verified
|
raw
history blame
4.02 kB
metadata
language:
  - pl
  - en
license: apache-2.0
base_model: answerdotai/ModernBERT-base
tags:
  - chunking
  - semantic-segmentation
  - token-classification
  - modernbert
  - nlp
  - rag
pipeline_tag: token-classification
datasets:
  - wikimedia/wikipedia

ModernBERT Chunker Base ๐Ÿš€

This model is a fine-tuned version of ModernBERT-base, specialized in semantic boundary detection. It is designed to be used with the fine-chunker library for high-quality text segmentation in RAG applications.

Model Highlights

  • Context Length: 8192 tokens (full ModernBERT capacity).
  • Architecture: ModernBERT-base + Deep Classification Head (Linear-ReLU-Dropout-Linear).
  • Training Strategy: Sequential packing of full Wikipedia articles with weighted Cross-Entropy.
  • Languages: Bilingual support for Polish and English.

Usage

The easiest way to use this model is through the official library:

from fine_chunker import Chunker

# Load the model (runs optimally on CUDA or CPU)
chunker = Chunker.from_pretrained(device="cpu", use_onnx=True)

text = "Your long multi-topic document..."
chunks = chunker.chunk(text)

for chunk in chunks:
    print(f"Index: {chunk.index} | Content: {chunk.content[:100]}...")

Training Details

Dataset

The model was trained on Wikipedia (20231101 version) for both Polish and English.

  • Preprocessing: Full articles were cleaned of wiki-noise (references, external links, metadata). Additionally, 40% of chunk starts were replaced by a lowercase letter, and 40% of the last dots in chunks were removed.
  • Ground Truth: Segmentation was based on natural paragraph boundaries (\n\n) found in well-structured Wikipedia articles.
  • Packing: Multiple articles were packed into single 8192 token sequences to maximize training efficiency.

Training Configuration

  • Hardware: 4x NVIDIA A100-SXM4-40GB.
  • Duration: 1 day, 6 hours, 1 minute.
  • Precision: bfloat16 with Flash Attention 2.
  • Epochs: 1
  • Optimization:
    • Loss Function: Weighted Cross-Entropy ([1.0, 7.0]) to address boundary sparsity.
    • Gradient Accumulation: 8 steps.
    • Dropout: 0.1.

Architecture Details

Unlike standard token classifiers that use a single linear layer, this model uses a deep classification head:

  1. Linear(hidden_size, hidden_size)
  2. ReLU
  3. Dropout(0.1)
  4. Linear(hidden_size, 2) (Boundary vs. Non-boundary)

This allows the model to learn more complex semantic cues for segmentation.

Intended Use

  • RAG Pipelines: Generating semantic chunks that preserve context better than fixed-size splitting.
  • Long Document Analysis: Segmenting reports, legal documents, or books into logical chapters/sections.
  • Pre-processing for LLMs: Ensuring input fragments are semantically complete.

Limitations

  • While effective on general knowledge, it may require further fine-tuning for extremely niche domains (e.g., medical or highly technical code documentation).
  • Performance is best on texts with clear logical structures.

Evaluation

Status: Under Development > Systematic evaluation of the model's performance across different domains and languages is currently in progress.

Author

Developed by Jerzy Boksa. Contact: devjerzy@gmail.com GitHub: fine-chunker

Acknowledgements

This model was trained using the infrastructure provided by Cyfronet (Academic Computer Centre Cyfronet AGH) as part of a educational grant.

Citation

If you use this model or the fine-chunker library in your research or project, please cite it as follows:

@misc{boksa2024modernbertchunker,
  author = {Jerzy Boksa},
  title = {ModernBERT Chunker Base: Specialized Semantic Boundary Detection for RAG},
  year = {2026},
  publisher = {Hugging Face},
  journal = {Hugging Face Model Hub},
  howpublished = {\url{https://huggingface.co/jboksa/modbert-chunker-base}}
}