jboksa commited on
Commit
28f451d
·
verified ·
1 Parent(s): 5a62cee

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +82 -1
README.md CHANGED
@@ -1,3 +1,84 @@
1
  ---
2
- license: mit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - pl
4
+ - en
5
+ license: apache-2.0
6
+ base_model: answerdotai/ModernBERT-base
7
+ tags:
8
+ - chunking
9
+ - semantic-segmentation
10
+ - token-classification
11
+ - modernbert
12
+ - nlp
13
+ - rag
14
+ pipeline_tag: token-classification
15
+ datasets:
16
+ - wikimedia/wikipedia
17
  ---
18
+
19
+ # ModernBERT Chunker Base 🚀
20
+
21
+ This model is a fine-tuned version of **ModernBERT-base**, specialized in **semantic boundary detection**. It is designed to be used with the [modbert-chunker](https://github.com/jboksa/modbert-chunker) library for high-quality text segmentation in RAG applications.
22
+
23
+ ## Model Highlights
24
+ - **Context Length**: 8192 tokens (full ModernBERT capacity).
25
+ - **Architecture**: ModernBERT-base + Deep Classification Head (Linear-ReLU-Dropout-Linear).
26
+ - **Training Strategy**: Sequential packing of full Wikipedia articles with weighted Cross-Entropy.
27
+ - **Languages**: Bilingual support for **Polish** and **English**.
28
+
29
+ ## Usage
30
+
31
+ The easiest way to use this model is through the official library:
32
+
33
+ ```python
34
+ from modbert_chunker import Chunker
35
+
36
+ # Load the model (runs optimally on CUDA or CPU)
37
+ chunker = Chunker.from_pretrained("jboksa/modbert-chunker-base")
38
+
39
+ text = "Your long multi-topic document..."
40
+ chunks = chunker.chunk(text)
41
+
42
+ for chunk in chunks:
43
+ print(f"Index: {chunk.index} | Content: {chunk.content[:100]}...")
44
+ ```
45
+
46
+ ## Training Details
47
+
48
+ ### Dataset
49
+ The model was trained on **Wikipedia (20231101 version)** for both Polish and English.
50
+ - **Preprocessing**: Full articles were cleaned of wiki-noise (references, external links, metadata).
51
+ - **Ground Truth**: Segmentation was based on natural paragraph boundaries (`\n\n`) found in well-structured Wikipedia articles.
52
+ - **Packing**: Multiple articles were packed into single `8192` token sequences to maximize training efficiency.
53
+
54
+ ### Training Configuration
55
+ - **Hardware**: 4x NVIDIA A100-SXM4-40GB.
56
+ - **Duration**: 1 day, 6 hours, 1 minute.
57
+ - **Precision**: `bfloat16` with Flash Attention 2.
58
+ - **Epochs**: 1
59
+ - **Optimization**:
60
+ - **Loss Function**: Weighted Cross-Entropy (`[1.0, 7.0]`) to address boundary sparsity.
61
+ - **Gradient Accumulation**: 8 steps.
62
+ - **Dropout**: 0.1.
63
+
64
+ ### Architecture Details
65
+ Unlike standard token classifiers that use a single linear layer, this model uses a **deep classification head**:
66
+ 1. `Linear(hidden_size, hidden_size)`
67
+ 2. `ReLU`
68
+ 3. `Dropout(0.1)`
69
+ 4. `Linear(hidden_size, 2)` (Boundary vs. Non-boundary)
70
+
71
+ This allows the model to learn more complex semantic cues for segmentation.
72
+
73
+ ## Intended Use
74
+ - **RAG Pipelines**: Generating semantic chunks that preserve context better than fixed-size splitting.
75
+ - **Long Document Analysis**: Segmenting reports, legal documents, or books into logical chapters/sections.
76
+ - **Pre-processing for LLMs**: Ensuring input fragments are semantically complete.
77
+
78
+ ## Limitations
79
+ - While effective on general knowledge, it may require further fine-tuning for extremely niche domains (e.g., medical or highly technical code documentation).
80
+ - Performance is best on texts with clear logical structures.
81
+
82
+ ## Author
83
+ Developed by **Jerzy Boksa**.
84
+ GitHub: [modbert-chunker](https://github.com/jboksa/modbert-chunker)