Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,84 @@
|
|
| 1 |
---
|
| 2 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
language:
|
| 3 |
+
- pl
|
| 4 |
+
- en
|
| 5 |
+
license: apache-2.0
|
| 6 |
+
base_model: answerdotai/ModernBERT-base
|
| 7 |
+
tags:
|
| 8 |
+
- chunking
|
| 9 |
+
- semantic-segmentation
|
| 10 |
+
- token-classification
|
| 11 |
+
- modernbert
|
| 12 |
+
- nlp
|
| 13 |
+
- rag
|
| 14 |
+
pipeline_tag: token-classification
|
| 15 |
+
datasets:
|
| 16 |
+
- wikimedia/wikipedia
|
| 17 |
---
|
| 18 |
+
|
| 19 |
+
# ModernBERT Chunker Base 🚀
|
| 20 |
+
|
| 21 |
+
This model is a fine-tuned version of **ModernBERT-base**, specialized in **semantic boundary detection**. It is designed to be used with the [modbert-chunker](https://github.com/jboksa/modbert-chunker) library for high-quality text segmentation in RAG applications.
|
| 22 |
+
|
| 23 |
+
## Model Highlights
|
| 24 |
+
- **Context Length**: 8192 tokens (full ModernBERT capacity).
|
| 25 |
+
- **Architecture**: ModernBERT-base + Deep Classification Head (Linear-ReLU-Dropout-Linear).
|
| 26 |
+
- **Training Strategy**: Sequential packing of full Wikipedia articles with weighted Cross-Entropy.
|
| 27 |
+
- **Languages**: Bilingual support for **Polish** and **English**.
|
| 28 |
+
|
| 29 |
+
## Usage
|
| 30 |
+
|
| 31 |
+
The easiest way to use this model is through the official library:
|
| 32 |
+
|
| 33 |
+
```python
|
| 34 |
+
from modbert_chunker import Chunker
|
| 35 |
+
|
| 36 |
+
# Load the model (runs optimally on CUDA or CPU)
|
| 37 |
+
chunker = Chunker.from_pretrained("jboksa/modbert-chunker-base")
|
| 38 |
+
|
| 39 |
+
text = "Your long multi-topic document..."
|
| 40 |
+
chunks = chunker.chunk(text)
|
| 41 |
+
|
| 42 |
+
for chunk in chunks:
|
| 43 |
+
print(f"Index: {chunk.index} | Content: {chunk.content[:100]}...")
|
| 44 |
+
```
|
| 45 |
+
|
| 46 |
+
## Training Details
|
| 47 |
+
|
| 48 |
+
### Dataset
|
| 49 |
+
The model was trained on **Wikipedia (20231101 version)** for both Polish and English.
|
| 50 |
+
- **Preprocessing**: Full articles were cleaned of wiki-noise (references, external links, metadata).
|
| 51 |
+
- **Ground Truth**: Segmentation was based on natural paragraph boundaries (`\n\n`) found in well-structured Wikipedia articles.
|
| 52 |
+
- **Packing**: Multiple articles were packed into single `8192` token sequences to maximize training efficiency.
|
| 53 |
+
|
| 54 |
+
### Training Configuration
|
| 55 |
+
- **Hardware**: 4x NVIDIA A100-SXM4-40GB.
|
| 56 |
+
- **Duration**: 1 day, 6 hours, 1 minute.
|
| 57 |
+
- **Precision**: `bfloat16` with Flash Attention 2.
|
| 58 |
+
- **Epochs**: 1
|
| 59 |
+
- **Optimization**:
|
| 60 |
+
- **Loss Function**: Weighted Cross-Entropy (`[1.0, 7.0]`) to address boundary sparsity.
|
| 61 |
+
- **Gradient Accumulation**: 8 steps.
|
| 62 |
+
- **Dropout**: 0.1.
|
| 63 |
+
|
| 64 |
+
### Architecture Details
|
| 65 |
+
Unlike standard token classifiers that use a single linear layer, this model uses a **deep classification head**:
|
| 66 |
+
1. `Linear(hidden_size, hidden_size)`
|
| 67 |
+
2. `ReLU`
|
| 68 |
+
3. `Dropout(0.1)`
|
| 69 |
+
4. `Linear(hidden_size, 2)` (Boundary vs. Non-boundary)
|
| 70 |
+
|
| 71 |
+
This allows the model to learn more complex semantic cues for segmentation.
|
| 72 |
+
|
| 73 |
+
## Intended Use
|
| 74 |
+
- **RAG Pipelines**: Generating semantic chunks that preserve context better than fixed-size splitting.
|
| 75 |
+
- **Long Document Analysis**: Segmenting reports, legal documents, or books into logical chapters/sections.
|
| 76 |
+
- **Pre-processing for LLMs**: Ensuring input fragments are semantically complete.
|
| 77 |
+
|
| 78 |
+
## Limitations
|
| 79 |
+
- While effective on general knowledge, it may require further fine-tuning for extremely niche domains (e.g., medical or highly technical code documentation).
|
| 80 |
+
- Performance is best on texts with clear logical structures.
|
| 81 |
+
|
| 82 |
+
## Author
|
| 83 |
+
Developed by **Jerzy Boksa**.
|
| 84 |
+
GitHub: [modbert-chunker](https://github.com/jboksa/modbert-chunker)
|