Update README.md

ddb8a99 verified 2 months ago

4.02 kB

	---
	language:
	- pl
	- en
	license: apache-2.0
	base_model: answerdotai/ModernBERT-base
	tags:
	- chunking
	- semantic-segmentation
	- token-classification
	- modernbert
	- nlp
	- rag
	pipeline_tag: token-classification
	datasets:
	- wikimedia/wikipedia
	---

	# ModernBERT Chunker Base 🚀

	This model is a fine-tuned version of ModernBERT-base, specialized in semantic boundary detection. It is designed to be used with the [fine-chunker](https://github.com/JerzyCode/fine-chunker) library for high-quality text segmentation in RAG applications.

	## Model Highlights
	- Context Length: 8192 tokens (full ModernBERT capacity).
	- Architecture: ModernBERT-base + Deep Classification Head (Linear-ReLU-Dropout-Linear).
	- Training Strategy: Sequential packing of full Wikipedia articles with weighted Cross-Entropy.
	- Languages: Bilingual support for Polish and English.

	## Usage

	The easiest way to use this model is through the official library:

	```python
	from fine_chunker import Chunker

	# Load the model (runs optimally on CUDA or CPU)
	chunker = Chunker.from_pretrained(device="cpu", use_onnx=True)

	text = "Your long multi-topic document..."
	chunks = chunker.chunk(text)

	for chunk in chunks:
	print(f"Index: {chunk.index} \| Content: {chunk.content[:100]}...")
	```

	## Training Details

	### Dataset
	The model was trained on Wikipedia (20231101 version) for both Polish and English.
	- Preprocessing: Full articles were cleaned of wiki-noise (references, external links, metadata). Additionally, 40% of chunk starts were replaced by a lowercase letter, and 40% of the last dots in chunks were removed.
	- Ground Truth: Segmentation was based on natural paragraph boundaries (`\n\n`) found in well-structured Wikipedia articles.
	- Packing: Multiple articles were packed into single `8192` token sequences to maximize training efficiency.

	### Training Configuration
	- Hardware: 4x NVIDIA A100-SXM4-40GB.
	- Duration: 1 day, 6 hours, 1 minute.
	- Precision: `bfloat16` with Flash Attention 2.
	- Epochs: 1
	- Optimization:
	- Loss Function: Weighted Cross-Entropy (`[1.0, 7.0]`) to address boundary sparsity.
	- Gradient Accumulation: 8 steps.
	- Dropout: 0.1.

	### Architecture Details
	Unlike standard token classifiers that use a single linear layer, this model uses a deep classification head:
	1. `Linear(hidden_size, hidden_size)`
	2. `ReLU`
	3. `Dropout(0.1)`
	4. `Linear(hidden_size, 2)` (Boundary vs. Non-boundary)

	This allows the model to learn more complex semantic cues for segmentation.

	## Intended Use
	- RAG Pipelines: Generating semantic chunks that preserve context better than fixed-size splitting.
	- Long Document Analysis: Segmenting reports, legal documents, or books into logical chapters/sections.
	- Pre-processing for LLMs: Ensuring input fragments are semantically complete.

	## Limitations
	- While effective on general knowledge, it may require further fine-tuning for extremely niche domains (e.g., medical or highly technical code documentation).
	- Performance is best on texts with clear logical structures.

	## Evaluation
	Status: Under Development > Systematic evaluation of the model's performance across different domains and languages is currently in progress.


	## Author
	Developed by Jerzy Boksa.
	Contact: devjerzy@gmail.com
	GitHub: [fine-chunker](https://github.com/JerzyCode/fine-chunker)

	## Acknowledgements
	This model was trained using the infrastructure provided by Cyfronet (Academic Computer Centre Cyfronet AGH) as part of a educational grant.

	## Citation

	If you use this model or the `fine-chunker` library in your research or project, please cite it as follows:

	```bibtex
	@misc{boksa2024modernbertchunker,
	author = {Jerzy Boksa},
	title = {ModernBERT Chunker Base: Specialized Semantic Boundary Detection for RAG},
	year = {2026},
	publisher = {Hugging Face},
	journal = {Hugging Face Model Hub},
	howpublished = {\url{https://huggingface.co/jboksa/modbert-chunker-base}}
	}