Update README.md
Browse files
README.md
CHANGED
|
@@ -47,7 +47,7 @@ for chunk in chunks:
|
|
| 47 |
|
| 48 |
### Dataset
|
| 49 |
The model was trained on **Wikipedia (20231101 version)** for both Polish and English.
|
| 50 |
-
- **Preprocessing**: Full articles were cleaned of wiki-noise (references, external links, metadata). Additionally, 40% of chunk starts were replaced by lowercase letter, and 40% of last
|
| 51 |
- **Ground Truth**: Segmentation was based on natural paragraph boundaries (`\n\n`) found in well-structured Wikipedia articles.
|
| 52 |
- **Packing**: Multiple articles were packed into single `8192` token sequences to maximize training efficiency.
|
| 53 |
|
|
@@ -79,6 +79,24 @@ This allows the model to learn more complex semantic cues for segmentation.
|
|
| 79 |
- While effective on general knowledge, it may require further fine-tuning for extremely niche domains (e.g., medical or highly technical code documentation).
|
| 80 |
- Performance is best on texts with clear logical structures.
|
| 81 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 82 |
## Author
|
| 83 |
Developed by **Jerzy Boksa**.
|
| 84 |
-
GitHub: [fine-chunker](https://github.com/JerzyCode/fine-chunker)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 47 |
|
| 48 |
### Dataset
|
| 49 |
The model was trained on **Wikipedia (20231101 version)** for both Polish and English.
|
| 50 |
+
- **Preprocessing**: Full articles were cleaned of wiki-noise (references, external links, metadata). Additionally, 40% of chunk starts were replaced by a lowercase letter, and 40% of the last dots in chunks were removed.
|
| 51 |
- **Ground Truth**: Segmentation was based on natural paragraph boundaries (`\n\n`) found in well-structured Wikipedia articles.
|
| 52 |
- **Packing**: Multiple articles were packed into single `8192` token sequences to maximize training efficiency.
|
| 53 |
|
|
|
|
| 79 |
- While effective on general knowledge, it may require further fine-tuning for extremely niche domains (e.g., medical or highly technical code documentation).
|
| 80 |
- Performance is best on texts with clear logical structures.
|
| 81 |
|
| 82 |
+
## Evaluation
|
| 83 |
+
Status: Under Development > Systematic evaluation of the model's performance across different domains and languages is currently in progress.
|
| 84 |
+
|
| 85 |
+
|
| 86 |
## Author
|
| 87 |
Developed by **Jerzy Boksa**.
|
| 88 |
+
GitHub: [fine-chunker](https://github.com/JerzyCode/fine-chunker)
|
| 89 |
+
|
| 90 |
+
## Citation
|
| 91 |
+
|
| 92 |
+
If you use this model or the `fine-chunker` library in your research or project, please cite it as follows:
|
| 93 |
+
|
| 94 |
+
```bibtex
|
| 95 |
+
@misc{boksa2024modernbertchunker,
|
| 96 |
+
author = {Jerzy Boksa},
|
| 97 |
+
title = {ModernBERT Chunker Base: Specialized Semantic Boundary Detection for RAG},
|
| 98 |
+
year = {2026},
|
| 99 |
+
publisher = {Hugging Face},
|
| 100 |
+
journal = {Hugging Face Model Hub},
|
| 101 |
+
howpublished = {\url{[https://huggingface.co/jboksa/modbert-chunker-base](https://huggingface.co/jboksa/modbert-chunker-base)}}
|
| 102 |
+
}
|