Update README.md
Browse files
README.md
CHANGED
|
@@ -75,9 +75,12 @@ This allows the model to learn more complex semantic cues for segmentation.
|
|
| 75 |
- **Long Document Analysis**: Segmenting reports, legal documents, or books into logical chapters/sections.
|
| 76 |
- **Pre-processing for LLMs**: Ensuring input fragments are semantically complete.
|
| 77 |
|
| 78 |
-
## Limitations
|
| 79 |
-
|
| 80 |
-
-
|
|
|
|
|
|
|
|
|
|
| 81 |
|
| 82 |
## Evaluation
|
| 83 |
Status: Under Development > Systematic evaluation of the model's performance across different domains and languages is currently in progress.
|
|
@@ -91,6 +94,8 @@ GitHub: [fine-chunker](https://github.com/JerzyCode/fine-chunker)
|
|
| 91 |
## Acknowledgements
|
| 92 |
This model was trained using the infrastructure provided by **Cyfronet** (Academic Computer Centre Cyfronet AGH) as part of a educational grant.
|
| 93 |
|
|
|
|
|
|
|
| 94 |
## Citation
|
| 95 |
|
| 96 |
If you use this model or the `fine-chunker` library in your research or project, please cite it as follows:
|
|
|
|
| 75 |
- **Long Document Analysis**: Segmenting reports, legal documents, or books into logical chapters/sections.
|
| 76 |
- **Pre-processing for LLMs**: Ensuring input fragments are semantically complete.
|
| 77 |
|
| 78 |
+
## Limitations & Future Work
|
| 79 |
+
|
| 80 |
+
- **Training Data Focus**: The current version was trained exclusively on **Wikipedia datasets** (English and Polish). While it excels at structured, informative prose, it hasn't been exposed to noisy data, conversational text, or specific journalistic styles (news).
|
| 81 |
+
- **Base Model Version**: This is a general-purpose base model. While it performs excellently on standard structured text, specialized domains (e.g., legal contracts, medical records, or minified code) might require additional fine-tuning for optimal boundary detection.
|
| 82 |
+
- **Logical Structure**: Performance is best on documents with clear paragraph breaks and logical flow, similar to the encyclopedic style of its training data.
|
| 83 |
+
- **Niche Domains**: If you're working with datasets far removed from Wikipedia's structure, feel free to reach out or share your feedback—we're looking into domain-specific refinements.
|
| 84 |
|
| 85 |
## Evaluation
|
| 86 |
Status: Under Development > Systematic evaluation of the model's performance across different domains and languages is currently in progress.
|
|
|
|
| 94 |
## Acknowledgements
|
| 95 |
This model was trained using the infrastructure provided by **Cyfronet** (Academic Computer Centre Cyfronet AGH) as part of a educational grant.
|
| 96 |
|
| 97 |
+
|
| 98 |
+
|
| 99 |
## Citation
|
| 100 |
|
| 101 |
If you use this model or the `fine-chunker` library in your research or project, please cite it as follows:
|