tim1900
/

bert-chunker-3

@@ -9,7 +9,7 @@ pipeline_tag: token-classification
 [GitHub](https://github.com/jackfsuia/bert-chunker/tree/main/bc3)
-bert-chunker-3 is a text chunker based on BertForTokenClassification to predict the start token of chunks (for use in RAG, etc), and using a sliding window it cuts documents of any size into chunks. We see it as an alternative of [semantic chunker](https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb), but specially, it not only works for the structured texts, but also the **unstructured and messy texts**.
 Different from [bc-2](https://huggingface.co/tim1900/bert-chunker-2) and [bc](https://huggingface.co/tim1900/bert-chunker), to overcome the data distribution shift, our training data were labeled by a LLM and trainng pipeline was improved, therefore it is **more stable**. It has a **competitive** [**performance**](#evaluation).
@@ -420,6 +420,8 @@ Evaluation is done by code from [brandonstarxel/chunking_evaluation](https://git
  Cluster	 | 400 (~182) | 	0	 | 91.3 ± 25.4	 | 4.5 ± 3.4 | 	20.7 ± 14.5	 | 4.5 ± 3.4 | O(N<sup>2</sup>)| No
  Cluster	 | 200 (~103) | 	0	 | 87.3 ± 29.8	 | **8.0 ± 6.0**	 | **34.0 ± 19.7**	 | **8.0 ± 6.0** | O(N<sup>2</sup>)| No
  LLM (GPT4o) | 	N/A (~240)	 | 0	 | **91.9 ± 26.5**	 | 3.9 ± 3.2	 | 19.9 ± 16.3 | 	3.9 ± 3.2 | O(N<sup>2</sup>)| No
 ★ bert-chunker-3 (experimental, prob_threshold=0.50543) | <= 400 | 0 | 91.3 ± 26.6 | 5.4 ± 4.7 | 23.1 ± 17.6 | 5.4 ± 4.7 |**O(N)** | **Yes**
 ★ bert-chunker-3 (experimental, prob_threshold=0.50543) | <= 200 | 0 | 89.7 ± 27.9 | 7.6 ± 6.0 | 30.9 ± 19.1 | 7.7 ± 5.8 |**O(N)**| **Yes**
 ★ bert-chunker-3 (prob_threshold=0.50543) | N/A | 0 | 90.4 ± 28.7 | 3.3 ± 3.1 | 16.0 ± 17.0 | 3.3 ± 3.1 |**O(N)**| No

 [GitHub](https://github.com/jackfsuia/bert-chunker/tree/main/bc3)
+bert-chunker-3 is a text chunker based on BertForTokenClassification to predict the start token of chunks (for use in RAG, etc), and using a sliding window it cuts documents of any size into chunks. We see it as an alternative of [Kamradt semantic chunker](https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb), but specially, it not only works for the structured texts, but also the **unstructured and messy texts**.
 Different from [bc-2](https://huggingface.co/tim1900/bert-chunker-2) and [bc](https://huggingface.co/tim1900/bert-chunker), to overcome the data distribution shift, our training data were labeled by a LLM and trainng pipeline was improved, therefore it is **more stable**. It has a **competitive** [**performance**](#evaluation).
  Cluster	 | 400 (~182) | 	0	 | 91.3 ± 25.4	 | 4.5 ± 3.4 | 	20.7 ± 14.5	 | 4.5 ± 3.4 | O(N<sup>2</sup>)| No
  Cluster	 | 200 (~103) | 	0	 | 87.3 ± 29.8	 | **8.0 ± 6.0**	 | **34.0 ± 19.7**	 | **8.0 ± 6.0** | O(N<sup>2</sup>)| No
  LLM (GPT4o) | 	N/A (~240)	 | 0	 | **91.9 ± 26.5**	 | 3.9 ± 3.2	 | 19.9 ± 16.3 | 	3.9 ± 3.2 | O(N<sup>2</sup>)| No
+ semchunk  | 	<=400	 | 0	 | 90.0 ± 29.1	 | 3.6 ± 2.8	 | 17.3 ± 12.6 | 	3.6 ± 2.8 | **O(N)**| **Yes**
+ semchunk  | 	<=200	 | 0	 | 89.3 ± 28.7	 | 6.8 ± 5.2	 | 28.9 ± 17.1 | 	6.7 ± 5.1 | **O(N)**| **Yes**
 ★ bert-chunker-3 (experimental, prob_threshold=0.50543) | <= 400 | 0 | 91.3 ± 26.6 | 5.4 ± 4.7 | 23.1 ± 17.6 | 5.4 ± 4.7 |**O(N)** | **Yes**
 ★ bert-chunker-3 (experimental, prob_threshold=0.50543) | <= 200 | 0 | 89.7 ± 27.9 | 7.6 ± 6.0 | 30.9 ± 19.1 | 7.7 ± 5.8 |**O(N)**| **Yes**
 ★ bert-chunker-3 (prob_threshold=0.50543) | N/A | 0 | 90.4 ± 28.7 | 3.3 ± 3.1 | 16.0 ± 17.0 | 3.3 ± 3.1 |**O(N)**| No