Update README.md
Browse files
README.md
CHANGED
|
@@ -410,21 +410,22 @@ for i, (c, t) in enumerate(zip(chunks, token_pos)):
|
|
| 410 |
Evaluation is done by code from [brandonstarxel/chunking_evaluation](https://github.com/brandonstarxel/chunking_evaluation), most of the following results come from [Evaluating Chunking Strategies for Retrieval](https://research.trychroma.com/evaluating-chunking).
|
| 411 |
| Chunking | Size| Overlap | Recall | Precision | PrecisionΩ | IoU | Time complexity by token number N | Is chunk size strictly controlable|
|
| 412 |
|---------|---------|---------|---------|---------|---------|---------|---------|---------|
|
| 413 |
-
| Recursive | <= 800 | 400 | 85.4 ± 34.9 | 1.5 ± 1.3 | 6.7 ± 5.2 | 1.5 ± 1.3| O(N) | Yes
|
| 414 |
-
| TokenText | 800 | 400 | 87.9 ± 31.7| 1.4 ± 1.1 | 4.7 ± 3.1 | 1.4 ± 1.1
|
| 415 |
-
| Recursive | <= 400 | 200 |88.1 ± 31.6 | 3.3 ± 2.7 | 13.9 ± 10.4 | 3.3 ± 2.7 | O(N) | Yes
|
| 416 |
-
| TokenText | 400 | 200 | 88.6 ± 29.7 | 2.7 ± 2.2 | 8.4 ± 5.1 | 2.7 ± 2.2 | O(N) | Yes
|
| 417 |
-
| Recursive | <= 400 | 0 | 89.5 ± 29.7 | 3.6 ± 3.2 | 17.7 ± 14.0 | 3.6 ± 3.2 | O(N) | Yes
|
| 418 |
-
| TokenText | 400 |0 | 89.2 ± 29.2 | 2.7 ± 2.2 | 12.5 ± 8.1 | 2.7 ± 2.2 | O(N) | Yes
|
| 419 |
-
| Recursive | <= 200 | 0 | 88.1 ± 30.1 | 7.0 ± 5.6 | 29.9 ± 18.4 | 6.9 ± 5.6 | O(N) | Yes
|
| 420 |
-
| TokenText | 200 | 0 | 87.0 ± 30.8 | 5.2 ± 4.1 | 21.0 ± 11.9 | 5.1 ± 4.1 | O(N) | Yes
|
| 421 |
-
| Kamradt | N/A (~660) | 0 | 83.6 ± 36.8 | 1.5 ± 1.6 | 7.4 ± 10.2 | 1.5 ± 1.6 | O(N) | No
|
| 422 |
-
KamradtMod | <= 300 | 0 | 87.1 ± 31.9 | 2.1 ± 2.0 | 10.5 ± 12.3 | 2.1 ± 2.0 | O(N)
|
| 423 |
Cluster | 400 (~182) | 0 | 91.3 ± 25.4 | 4.5 ± 3.4 | 20.7 ± 14.5 | 4.5 ± 3.4 | O(N<sup>2</sup>)| No
|
| 424 |
Cluster | 200 (~103) | 0 | 87.3 ± 29.8 | **8.0 ± 6.0** | **34.0 ± 19.7** | **8.0 ± 6.0** | O(N<sup>2</sup>)| No
|
| 425 |
LLM (GPT4o) | N/A (~240) | 0 | **91.9 ± 26.5** | 3.9 ± 3.2 | 19.9 ± 16.3 | 3.9 ± 3.2 | O(N<sup>2</sup>)| No
|
| 426 |
-
★ bert-chunker-3 (prob_threshold=0.50543) | <= 400 | 0 | 91.2 ± 26.6 | 5.3 ± 4.5 | 23.2 ± 18.1 | 5.3 ± 4.5
|
| 427 |
-
★ bert-chunker-3 (prob_threshold=0.50543) | <= 200 | 0 | 90.5 ± 27.3 | 7.1 ± 5.5 | 29.3 ± 19.0 | 7.1 ± 5.4
|
|
|
|
| 428 |
## Citation
|
| 429 |
```bibtex
|
| 430 |
@article{bert-chunker,
|
|
|
|
| 410 |
Evaluation is done by code from [brandonstarxel/chunking_evaluation](https://github.com/brandonstarxel/chunking_evaluation), most of the following results come from [Evaluating Chunking Strategies for Retrieval](https://research.trychroma.com/evaluating-chunking).
|
| 411 |
| Chunking | Size| Overlap | Recall | Precision | PrecisionΩ | IoU | Time complexity by token number N | Is chunk size strictly controlable|
|
| 412 |
|---------|---------|---------|---------|---------|---------|---------|---------|---------|
|
| 413 |
+
| Recursive | <= 800 | 400 | 85.4 ± 34.9 | 1.5 ± 1.3 | 6.7 ± 5.2 | 1.5 ± 1.3| **O(N)** | **Yes**
|
| 414 |
+
| TokenText | 800 | 400 | 87.9 ± 31.7| 1.4 ± 1.1 | 4.7 ± 3.1 | 1.4 ± 1.1 |**O(N)** | **Yes**
|
| 415 |
+
| Recursive | <= 400 | 200 |88.1 ± 31.6 | 3.3 ± 2.7 | 13.9 ± 10.4 | 3.3 ± 2.7 | **O(N)** | **Yes**
|
| 416 |
+
| TokenText | 400 | 200 | 88.6 ± 29.7 | 2.7 ± 2.2 | 8.4 ± 5.1 | 2.7 ± 2.2 | **O(N)** | **Yes**
|
| 417 |
+
| Recursive | <= 400 | 0 | 89.5 ± 29.7 | 3.6 ± 3.2 | 17.7 ± 14.0 | 3.6 ± 3.2 | **O(N)** | **Yes**
|
| 418 |
+
| TokenText | 400 |0 | 89.2 ± 29.2 | 2.7 ± 2.2 | 12.5 ± 8.1 | 2.7 ± 2.2 | **O(N)** | **Yes**
|
| 419 |
+
| Recursive | <= 200 | 0 | 88.1 ± 30.1 | 7.0 ± 5.6 | 29.9 ± 18.4 | 6.9 ± 5.6 | **O(N)** | **Yes**
|
| 420 |
+
| TokenText | 200 | 0 | 87.0 ± 30.8 | 5.2 ± 4.1 | 21.0 ± 11.9 | 5.1 ± 4.1 | **O(N)** | **Yes**
|
| 421 |
+
| Kamradt | N/A (~660) | 0 | 83.6 ± 36.8 | 1.5 ± 1.6 | 7.4 ± 10.2 | 1.5 ± 1.6 | **O(N)** | No
|
| 422 |
+
KamradtMod | <= 300 | 0 | 87.1 ± 31.9 | 2.1 ± 2.0 | 10.5 ± 12.3 | 2.1 ± 2.0 | **O(N)**| **Yes**
|
| 423 |
Cluster | 400 (~182) | 0 | 91.3 ± 25.4 | 4.5 ± 3.4 | 20.7 ± 14.5 | 4.5 ± 3.4 | O(N<sup>2</sup>)| No
|
| 424 |
Cluster | 200 (~103) | 0 | 87.3 ± 29.8 | **8.0 ± 6.0** | **34.0 ± 19.7** | **8.0 ± 6.0** | O(N<sup>2</sup>)| No
|
| 425 |
LLM (GPT4o) | N/A (~240) | 0 | **91.9 ± 26.5** | 3.9 ± 3.2 | 19.9 ± 16.3 | 3.9 ± 3.2 | O(N<sup>2</sup>)| No
|
| 426 |
+
★ bert-chunker-3 (experimental, prob_threshold=0.50543) | <= 400 | 0 | 91.2 ± 26.6 | 5.3 ± 4.5 | 23.2 ± 18.1 | 5.3 ± 4.5 |**O(N)** | **Yes**
|
| 427 |
+
★ bert-chunker-3 (experimental, prob_threshold=0.50543) | <= 200 | 0 | 90.5 ± 27.3 | 7.1 ± 5.5 | 29.3 ± 19.0 | 7.1 ± 5.4 |**O(N)**| **Yes**
|
| 428 |
+
★ bert-chunker-3 (prob_threshold=0.50543) | N/A | 0 | 90.4 ± 28.7 | 3.3 ± 3.1 | 16.0 ± 17.0 | 3.3 ± 3.1 |**O(N)**| No
|
| 429 |
## Citation
|
| 430 |
```bibtex
|
| 431 |
@article{bert-chunker,
|