jboksa commited on
Commit
43794e2
·
verified ·
1 Parent(s): 5e11b67

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +20 -2
README.md CHANGED
@@ -47,7 +47,7 @@ for chunk in chunks:
47
 
48
  ### Dataset
49
  The model was trained on **Wikipedia (20231101 version)** for both Polish and English.
50
- - **Preprocessing**: Full articles were cleaned of wiki-noise (references, external links, metadata). Additionally, 40% of chunk starts were replaced by lowercase letter, and 40% of last chunk dot, were removed.
51
  - **Ground Truth**: Segmentation was based on natural paragraph boundaries (`\n\n`) found in well-structured Wikipedia articles.
52
  - **Packing**: Multiple articles were packed into single `8192` token sequences to maximize training efficiency.
53
 
@@ -79,6 +79,24 @@ This allows the model to learn more complex semantic cues for segmentation.
79
  - While effective on general knowledge, it may require further fine-tuning for extremely niche domains (e.g., medical or highly technical code documentation).
80
  - Performance is best on texts with clear logical structures.
81
 
 
 
 
 
82
  ## Author
83
  Developed by **Jerzy Boksa**.
84
- GitHub: [fine-chunker](https://github.com/JerzyCode/fine-chunker)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
47
 
48
  ### Dataset
49
  The model was trained on **Wikipedia (20231101 version)** for both Polish and English.
50
+ - **Preprocessing**: Full articles were cleaned of wiki-noise (references, external links, metadata). Additionally, 40% of chunk starts were replaced by a lowercase letter, and 40% of the last dots in chunks were removed.
51
  - **Ground Truth**: Segmentation was based on natural paragraph boundaries (`\n\n`) found in well-structured Wikipedia articles.
52
  - **Packing**: Multiple articles were packed into single `8192` token sequences to maximize training efficiency.
53
 
 
79
  - While effective on general knowledge, it may require further fine-tuning for extremely niche domains (e.g., medical or highly technical code documentation).
80
  - Performance is best on texts with clear logical structures.
81
 
82
+ ## Evaluation
83
+ Status: Under Development > Systematic evaluation of the model's performance across different domains and languages is currently in progress.
84
+
85
+
86
  ## Author
87
  Developed by **Jerzy Boksa**.
88
+ GitHub: [fine-chunker](https://github.com/JerzyCode/fine-chunker)
89
+
90
+ ## Citation
91
+
92
+ If you use this model or the `fine-chunker` library in your research or project, please cite it as follows:
93
+
94
+ ```bibtex
95
+ @misc{boksa2024modernbertchunker,
96
+ author = {Jerzy Boksa},
97
+ title = {ModernBERT Chunker Base: Specialized Semantic Boundary Detection for RAG},
98
+ year = {2026},
99
+ publisher = {Hugging Face},
100
+ journal = {Hugging Face Model Hub},
101
+ howpublished = {\url{[https://huggingface.co/jboksa/modbert-chunker-base](https://huggingface.co/jboksa/modbert-chunker-base)}}
102
+ }