jboksa
/

modbert-chunker-base

Token Classification

semantic-segmentation

Model card Files Files and versions

jboksa commited on Mar 21

Commit

43794e2

·

verified ·

1 Parent(s): 5e11b67

Update README.md

Files changed (1) hide show

README.md +20 -2

README.md CHANGED Viewed

@@ -47,7 +47,7 @@ for chunk in chunks:
 ### Dataset
 The model was trained on **Wikipedia (20231101 version)** for both Polish and English.
-- **Preprocessing**: Full articles were cleaned of wiki-noise (references, external links, metadata). Additionally, 40% of chunk starts were replaced by lowercase letter, and 40% of last chunk dot, were removed.
 - **Ground Truth**: Segmentation was based on natural paragraph boundaries (`\n\n`) found in well-structured Wikipedia articles.
 - **Packing**: Multiple articles were packed into single `8192` token sequences to maximize training efficiency.
@@ -79,6 +79,24 @@ This allows the model to learn more complex semantic cues for segmentation.
 - While effective on general knowledge, it may require further fine-tuning for extremely niche domains (e.g., medical or highly technical code documentation).
 - Performance is best on texts with clear logical structures.
 ## Author
 Developed by **Jerzy Boksa**.
-GitHub: [fine-chunker](https://github.com/JerzyCode/fine-chunker)

 ### Dataset
 The model was trained on **Wikipedia (20231101 version)** for both Polish and English.
+- **Preprocessing**: Full articles were cleaned of wiki-noise (references, external links, metadata). Additionally, 40% of chunk starts were replaced by a lowercase letter, and 40% of the last dots in chunks were removed.
 - **Ground Truth**: Segmentation was based on natural paragraph boundaries (`\n\n`) found in well-structured Wikipedia articles.
 - **Packing**: Multiple articles were packed into single `8192` token sequences to maximize training efficiency.
 - While effective on general knowledge, it may require further fine-tuning for extremely niche domains (e.g., medical or highly technical code documentation).
 - Performance is best on texts with clear logical structures.
+## Evaluation
+Status: Under Development > Systematic evaluation of the model's performance across different domains and languages is currently in progress.
 ## Author
 Developed by **Jerzy Boksa**.
+GitHub: [fine-chunker](https://github.com/JerzyCode/fine-chunker)
+## Citation
+If you use this model or the `fine-chunker` library in your research or project, please cite it as follows:
+```bibtex
+@misc{boksa2024modernbertchunker,
+  author = {Jerzy Boksa},
+  title = {ModernBERT Chunker Base: Specialized Semantic Boundary Detection for RAG},
+  year = {2026},
+  publisher = {Hugging Face},
+  journal = {Hugging Face Model Hub},
+  howpublished = {\url{[https://huggingface.co/jboksa/modbert-chunker-base](https://huggingface.co/jboksa/modbert-chunker-base)}}
+}