Update README.md
Browse files
README.md
CHANGED
|
@@ -9,7 +9,7 @@ pipeline_tag: token-classification
|
|
| 9 |
|
| 10 |
[GitHub](https://github.com/jackfsuia/bert-chunker/tree/main/bc3)
|
| 11 |
|
| 12 |
-
bert-chunker-3 is a text chunker based on BertForTokenClassification to predict the start token of chunks (for use in RAG, etc), and using a sliding window it cuts documents of any size into chunks. We see it as an alternative of [semantic chunker](https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb), but specially, it not only works for the structured texts, but also the **unstructured and messy texts**.
|
| 13 |
|
| 14 |
Different from [bc-2](https://huggingface.co/tim1900/bert-chunker-2) and [bc](https://huggingface.co/tim1900/bert-chunker), to overcome the data distribution shift, our training data were labeled by a LLM and trainng pipeline was improved, therefore it is **more stable**. It has a **competitive** [**performance**](#evaluation).
|
| 15 |
|
|
@@ -420,6 +420,8 @@ Evaluation is done by code from [brandonstarxel/chunking_evaluation](https://git
|
|
| 420 |
Cluster | 400 (~182) | 0 | 91.3 ± 25.4 | 4.5 ± 3.4 | 20.7 ± 14.5 | 4.5 ± 3.4 | O(N<sup>2</sup>)| No
|
| 421 |
Cluster | 200 (~103) | 0 | 87.3 ± 29.8 | **8.0 ± 6.0** | **34.0 ± 19.7** | **8.0 ± 6.0** | O(N<sup>2</sup>)| No
|
| 422 |
LLM (GPT4o) | N/A (~240) | 0 | **91.9 ± 26.5** | 3.9 ± 3.2 | 19.9 ± 16.3 | 3.9 ± 3.2 | O(N<sup>2</sup>)| No
|
|
|
|
|
|
|
| 423 |
★ bert-chunker-3 (experimental, prob_threshold=0.50543) | <= 400 | 0 | 91.3 ± 26.6 | 5.4 ± 4.7 | 23.1 ± 17.6 | 5.4 ± 4.7 |**O(N)** | **Yes**
|
| 424 |
★ bert-chunker-3 (experimental, prob_threshold=0.50543) | <= 200 | 0 | 89.7 ± 27.9 | 7.6 ± 6.0 | 30.9 ± 19.1 | 7.7 ± 5.8 |**O(N)**| **Yes**
|
| 425 |
★ bert-chunker-3 (prob_threshold=0.50543) | N/A | 0 | 90.4 ± 28.7 | 3.3 ± 3.1 | 16.0 ± 17.0 | 3.3 ± 3.1 |**O(N)**| No
|
|
|
|
| 9 |
|
| 10 |
[GitHub](https://github.com/jackfsuia/bert-chunker/tree/main/bc3)
|
| 11 |
|
| 12 |
+
bert-chunker-3 is a text chunker based on BertForTokenClassification to predict the start token of chunks (for use in RAG, etc), and using a sliding window it cuts documents of any size into chunks. We see it as an alternative of [Kamradt semantic chunker](https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb), but specially, it not only works for the structured texts, but also the **unstructured and messy texts**.
|
| 13 |
|
| 14 |
Different from [bc-2](https://huggingface.co/tim1900/bert-chunker-2) and [bc](https://huggingface.co/tim1900/bert-chunker), to overcome the data distribution shift, our training data were labeled by a LLM and trainng pipeline was improved, therefore it is **more stable**. It has a **competitive** [**performance**](#evaluation).
|
| 15 |
|
|
|
|
| 420 |
Cluster | 400 (~182) | 0 | 91.3 ± 25.4 | 4.5 ± 3.4 | 20.7 ± 14.5 | 4.5 ± 3.4 | O(N<sup>2</sup>)| No
|
| 421 |
Cluster | 200 (~103) | 0 | 87.3 ± 29.8 | **8.0 ± 6.0** | **34.0 ± 19.7** | **8.0 ± 6.0** | O(N<sup>2</sup>)| No
|
| 422 |
LLM (GPT4o) | N/A (~240) | 0 | **91.9 ± 26.5** | 3.9 ± 3.2 | 19.9 ± 16.3 | 3.9 ± 3.2 | O(N<sup>2</sup>)| No
|
| 423 |
+
semchunk | <=400 | 0 | 90.0 ± 29.1 | 3.6 ± 2.8 | 17.3 ± 12.6 | 3.6 ± 2.8 | **O(N)**| **Yes**
|
| 424 |
+
semchunk | <=200 | 0 | 89.3 ± 28.7 | 6.8 ± 5.2 | 28.9 ± 17.1 | 6.7 ± 5.1 | **O(N)**| **Yes**
|
| 425 |
★ bert-chunker-3 (experimental, prob_threshold=0.50543) | <= 400 | 0 | 91.3 ± 26.6 | 5.4 ± 4.7 | 23.1 ± 17.6 | 5.4 ± 4.7 |**O(N)** | **Yes**
|
| 426 |
★ bert-chunker-3 (experimental, prob_threshold=0.50543) | <= 200 | 0 | 89.7 ± 27.9 | 7.6 ± 6.0 | 30.9 ± 19.1 | 7.7 ± 5.8 |**O(N)**| **Yes**
|
| 427 |
★ bert-chunker-3 (prob_threshold=0.50543) | N/A | 0 | 90.4 ± 28.7 | 3.3 ± 3.1 | 16.0 ± 17.0 | 3.3 ± 3.1 |**O(N)**| No
|