jboksa commited on
Commit
1f86a92
·
verified ·
1 Parent(s): ddb8a99

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +8 -3
README.md CHANGED
@@ -75,9 +75,12 @@ This allows the model to learn more complex semantic cues for segmentation.
75
  - **Long Document Analysis**: Segmenting reports, legal documents, or books into logical chapters/sections.
76
  - **Pre-processing for LLMs**: Ensuring input fragments are semantically complete.
77
 
78
- ## Limitations
79
- - While effective on general knowledge, it may require further fine-tuning for extremely niche domains (e.g., medical or highly technical code documentation).
80
- - Performance is best on texts with clear logical structures.
 
 
 
81
 
82
  ## Evaluation
83
  Status: Under Development > Systematic evaluation of the model's performance across different domains and languages is currently in progress.
@@ -91,6 +94,8 @@ GitHub: [fine-chunker](https://github.com/JerzyCode/fine-chunker)
91
  ## Acknowledgements
92
  This model was trained using the infrastructure provided by **Cyfronet** (Academic Computer Centre Cyfronet AGH) as part of a educational grant.
93
 
 
 
94
  ## Citation
95
 
96
  If you use this model or the `fine-chunker` library in your research or project, please cite it as follows:
 
75
  - **Long Document Analysis**: Segmenting reports, legal documents, or books into logical chapters/sections.
76
  - **Pre-processing for LLMs**: Ensuring input fragments are semantically complete.
77
 
78
+ ## Limitations & Future Work
79
+
80
+ - **Training Data Focus**: The current version was trained exclusively on **Wikipedia datasets** (English and Polish). While it excels at structured, informative prose, it hasn't been exposed to noisy data, conversational text, or specific journalistic styles (news).
81
+ - **Base Model Version**: This is a general-purpose base model. While it performs excellently on standard structured text, specialized domains (e.g., legal contracts, medical records, or minified code) might require additional fine-tuning for optimal boundary detection.
82
+ - **Logical Structure**: Performance is best on documents with clear paragraph breaks and logical flow, similar to the encyclopedic style of its training data.
83
+ - **Niche Domains**: If you're working with datasets far removed from Wikipedia's structure, feel free to reach out or share your feedback—we're looking into domain-specific refinements.
84
 
85
  ## Evaluation
86
  Status: Under Development > Systematic evaluation of the model's performance across different domains and languages is currently in progress.
 
94
  ## Acknowledgements
95
  This model was trained using the infrastructure provided by **Cyfronet** (Academic Computer Centre Cyfronet AGH) as part of a educational grant.
96
 
97
+
98
+
99
  ## Citation
100
 
101
  If you use this model or the `fine-chunker` library in your research or project, please cite it as follows: