Update citation

50db250 verified 2 months ago

4.71 kB

	---
	language:
	- pl
	- en
	license: apache-2.0
	base_model: answerdotai/ModernBERT-base
	tags:
	- chunking
	- semantic-segmentation
	- token-classification
	- modernbert
	- nlp
	- rag
	pipeline_tag: token-classification
	datasets:
	- wikimedia/wikipedia
	---

	# ModernBERT Chunker Base 🚀

	This model is a fine-tuned version of ModernBERT-base, specialized in semantic boundary detection. It is designed to be used with the [fine-chunker](https://github.com/JerzyCode/fine-chunker) library for high-quality text segmentation in RAG applications.

	## Model Highlights
	- Context Length: 8192 tokens (full ModernBERT capacity).
	- Architecture: ModernBERT-base + Deep Classification Head (Linear-ReLU-Dropout-Linear).
	- Training Strategy: Sequential packing of full Wikipedia articles with weighted Cross-Entropy.
	- Languages: Bilingual support for Polish and English.

	## Usage

	The easiest way to use this model is through the official library:

	```python
	from fine_chunker import Chunker

	# Load the model (runs optimally on CUDA or CPU)
	chunker = Chunker.from_pretrained(device="cpu", use_onnx=True)

	text = "Your long multi-topic document..."
	chunks = chunker.chunk(text)

	for chunk in chunks:
	print(f"Index: {chunk.index} \| Content: {chunk.content[:100]}...")
	```

	## Training Details

	### Dataset
	The model was trained on Wikipedia (20231101 version) for both Polish and English.
	- Preprocessing: Full articles were cleaned of wiki-noise (references, external links, metadata). Additionally, 40% of chunk starts were replaced by a lowercase letter, and 40% of the last dots in chunks were removed.
	- Ground Truth: Segmentation was based on natural paragraph boundaries (`\n\n`) found in well-structured Wikipedia articles.
	- Packing: Multiple articles were packed into single `8192` token sequences to maximize training efficiency.

	### Training Configuration
	- Hardware: 4x NVIDIA A100-SXM4-40GB.
	- Duration: 1 day, 6 hours, 1 minute.
	- Precision: `bfloat16` with Flash Attention 2.
	- Epochs: 1
	- Optimization:
	- Loss Function: Weighted Cross-Entropy (`[1.0, 7.0]`) to address boundary sparsity.
	- Gradient Accumulation: 8 steps.
	- Dropout: 0.1.

	### Architecture Details
	Unlike standard token classifiers that use a single linear layer, this model uses a deep classification head:
	1. `Linear(hidden_size, hidden_size)`
	2. `ReLU`
	3. `Dropout(0.1)`
	4. `Linear(hidden_size, 2)` (Boundary vs. Non-boundary)

	This allows the model to learn more complex semantic cues for segmentation.

	## Intended Use
	- RAG Pipelines: Generating semantic chunks that preserve context better than fixed-size splitting.
	- Long Document Analysis: Segmenting reports, legal documents, or books into logical chapters/sections.
	- Pre-processing for LLMs: Ensuring input fragments are semantically complete.

	## Limitations & Future Work

	- Training Data Focus: The current version was trained exclusively on Wikipedia datasets (English and Polish). While it excels at structured, informative prose, it hasn't been exposed to noisy data, conversational text, or specific journalistic styles (news).
	- Base Model Version: This is a general-purpose base model. While it performs excellently on standard structured text, specialized domains (e.g., legal contracts, medical records, or minified code) might require additional fine-tuning for optimal boundary detection.
	- Logical Structure: Performance is best on documents with clear paragraph breaks and logical flow, similar to the encyclopedic style of its training data.
	- Niche Domains: If you're working with datasets far removed from Wikipedia's structure, feel free to reach out or share your feedback—we're looking into domain-specific refinements.

	## Evaluation
	Status: Under Development > Systematic evaluation of the model's performance across different domains and languages is currently in progress.


	## Author
	Developed by Jerzy Boksa.
	Contact: devjerzy@gmail.com
	GitHub: [fine-chunker](https://github.com/JerzyCode/fine-chunker)

	## Acknowledgements
	This model was trained using the infrastructure provided by Cyfronet (Academic Computer Centre Cyfronet AGH) as part of a educational grant.



	## Citation

	If you use this model or the `fine-chunker` library in your research or project, please cite it as follows:

	```bibtex
	@misc{boksa2026modernbertchunker,
	author = {Jerzy Boksa},
	title = {ModernBERT Chunker Base: Specialized Semantic Boundary Detection for RAG},
	year = {2026},
	publisher = {Hugging Face},
	journal = {Hugging Face Model Hub},
	howpublished = {\url{https://huggingface.co/jboksa/modbert-chunker-base}}
	}