jboksa
/

modbert-chunker-base

Token Classification

semantic-segmentation

Model card Files Files and versions

jboksa commited on Mar 21

Commit

1f86a92

·

verified ·

1 Parent(s): ddb8a99

Update README.md

Files changed (1) hide show

README.md +8 -3

README.md CHANGED Viewed

@@ -75,9 +75,12 @@ This allows the model to learn more complex semantic cues for segmentation.
 - **Long Document Analysis**: Segmenting reports, legal documents, or books into logical chapters/sections.
 - **Pre-processing for LLMs**: Ensuring input fragments are semantically complete.
-## Limitations
-- While effective on general knowledge, it may require further fine-tuning for extremely niche domains (e.g., medical or highly technical code documentation).
-- Performance is best on texts with clear logical structures.
 ## Evaluation
 Status: Under Development > Systematic evaluation of the model's performance across different domains and languages is currently in progress.
@@ -91,6 +94,8 @@ GitHub: [fine-chunker](https://github.com/JerzyCode/fine-chunker)
 ## Acknowledgements
 This model was trained using the infrastructure provided by **Cyfronet** (Academic Computer Centre Cyfronet AGH) as part of a educational grant.
 ## Citation
 If you use this model or the `fine-chunker` library in your research or project, please cite it as follows:

 - **Long Document Analysis**: Segmenting reports, legal documents, or books into logical chapters/sections.
 - **Pre-processing for LLMs**: Ensuring input fragments are semantically complete.
+## Limitations & Future Work
+- **Training Data Focus**: The current version was trained exclusively on **Wikipedia datasets** (English and Polish). While it excels at structured, informative prose, it hasn't been exposed to noisy data, conversational text, or specific journalistic styles (news).
+- **Base Model Version**: This is a general-purpose base model. While it performs excellently on standard structured text, specialized domains (e.g., legal contracts, medical records, or minified code) might require additional fine-tuning for optimal boundary detection.
+- **Logical Structure**: Performance is best on documents with clear paragraph breaks and logical flow, similar to the encyclopedic style of its training data.
+- **Niche Domains**: If you're working with datasets far removed from Wikipedia's structure, feel free to reach out or share your feedback—we're looking into domain-specific refinements.
 ## Evaluation
 Status: Under Development > Systematic evaluation of the model's performance across different domains and languages is currently in progress.
 ## Acknowledgements
 This model was trained using the infrastructure provided by **Cyfronet** (Academic Computer Centre Cyfronet AGH) as part of a educational grant.
 ## Citation
 If you use this model or the `fine-chunker` library in your research or project, please cite it as follows: