tim1900 commited on
Commit
9bc98d7
·
verified ·
1 Parent(s): ca852f5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +13 -12
README.md CHANGED
@@ -410,21 +410,22 @@ for i, (c, t) in enumerate(zip(chunks, token_pos)):
410
  Evaluation is done by code from [brandonstarxel/chunking_evaluation](https://github.com/brandonstarxel/chunking_evaluation), most of the following results come from [Evaluating Chunking Strategies for Retrieval](https://research.trychroma.com/evaluating-chunking).
411
  | Chunking | Size| Overlap | Recall | Precision | PrecisionΩ | IoU | Time complexity by token number N | Is chunk size strictly controlable|
412
  |---------|---------|---------|---------|---------|---------|---------|---------|---------|
413
- | Recursive | <= 800 | 400 | 85.4 ± 34.9 | 1.5 ± 1.3 | 6.7 ± 5.2 | 1.5 ± 1.3| O(N) | Yes
414
- | TokenText | 800 | 400 | 87.9 ± 31.7| 1.4 ± 1.1 | 4.7 ± 3.1 | 1.4 ± 1.1 |O(N) | Yes
415
- | Recursive | <= 400 | 200 |88.1 ± 31.6 | 3.3 ± 2.7 | 13.9 ± 10.4 | 3.3 ± 2.7 | O(N) | Yes
416
- | TokenText | 400 | 200 | 88.6 ± 29.7 | 2.7 ± 2.2 | 8.4 ± 5.1 | 2.7 ± 2.2 | O(N) | Yes
417
- | Recursive | <= 400 | 0 | 89.5 ± 29.7 | 3.6 ± 3.2 | 17.7 ± 14.0 | 3.6 ± 3.2 | O(N) | Yes
418
- | TokenText | 400 |0 | 89.2 ± 29.2 | 2.7 ± 2.2 | 12.5 ± 8.1 | 2.7 ± 2.2 | O(N) | Yes
419
- | Recursive | <= 200 | 0 | 88.1 ± 30.1 | 7.0 ± 5.6 | 29.9 ± 18.4 | 6.9 ± 5.6 | O(N) | Yes
420
- | TokenText | 200 | 0 | 87.0 ± 30.8 | 5.2 ± 4.1 | 21.0 ± 11.9 | 5.1 ± 4.1 | O(N) | Yes
421
- | Kamradt | N/A (~660) | 0 | 83.6 ± 36.8 | 1.5 ± 1.6 | 7.4 ± 10.2 | 1.5 ± 1.6 | O(N) | No
422
- KamradtMod | <= 300 | 0 | 87.1 ± 31.9 | 2.1 ± 2.0 | 10.5 ± 12.3 | 2.1 ± 2.0 | O(N)| Yes
423
  Cluster | 400 (~182) | 0 | 91.3 ± 25.4 | 4.5 ± 3.4 | 20.7 ± 14.5 | 4.5 ± 3.4 | O(N<sup>2</sup>)| No
424
  Cluster | 200 (~103) | 0 | 87.3 ± 29.8 | **8.0 ± 6.0** | **34.0 ± 19.7** | **8.0 ± 6.0** | O(N<sup>2</sup>)| No
425
  LLM (GPT4o) | N/A (~240) | 0 | **91.9 ± 26.5** | 3.9 ± 3.2 | 19.9 ± 16.3 | 3.9 ± 3.2 | O(N<sup>2</sup>)| No
426
- ★ bert-chunker-3 (prob_threshold=0.50543) | <= 400 | 0 | 91.2 ± 26.6 | 5.3 ± 4.5 | 23.2 ± 18.1 | 5.3 ± 4.5 |O(N) | Yes
427
- ★ bert-chunker-3 (prob_threshold=0.50543) | <= 200 | 0 | 90.5 ± 27.3 | 7.1 ± 5.5 | 29.3 ± 19.0 | 7.1 ± 5.4 |O(N)| Yes
 
428
  ## Citation
429
  ```bibtex
430
  @article{bert-chunker,
 
410
  Evaluation is done by code from [brandonstarxel/chunking_evaluation](https://github.com/brandonstarxel/chunking_evaluation), most of the following results come from [Evaluating Chunking Strategies for Retrieval](https://research.trychroma.com/evaluating-chunking).
411
  | Chunking | Size| Overlap | Recall | Precision | PrecisionΩ | IoU | Time complexity by token number N | Is chunk size strictly controlable|
412
  |---------|---------|---------|---------|---------|---------|---------|---------|---------|
413
+ | Recursive | <= 800 | 400 | 85.4 ± 34.9 | 1.5 ± 1.3 | 6.7 ± 5.2 | 1.5 ± 1.3| **O(N)** | **Yes**
414
+ | TokenText | 800 | 400 | 87.9 ± 31.7| 1.4 ± 1.1 | 4.7 ± 3.1 | 1.4 ± 1.1 |**O(N)** | **Yes**
415
+ | Recursive | <= 400 | 200 |88.1 ± 31.6 | 3.3 ± 2.7 | 13.9 ± 10.4 | 3.3 ± 2.7 | **O(N)** | **Yes**
416
+ | TokenText | 400 | 200 | 88.6 ± 29.7 | 2.7 ± 2.2 | 8.4 ± 5.1 | 2.7 ± 2.2 | **O(N)** | **Yes**
417
+ | Recursive | <= 400 | 0 | 89.5 ± 29.7 | 3.6 ± 3.2 | 17.7 ± 14.0 | 3.6 ± 3.2 | **O(N)** | **Yes**
418
+ | TokenText | 400 |0 | 89.2 ± 29.2 | 2.7 ± 2.2 | 12.5 ± 8.1 | 2.7 ± 2.2 | **O(N)** | **Yes**
419
+ | Recursive | <= 200 | 0 | 88.1 ± 30.1 | 7.0 ± 5.6 | 29.9 ± 18.4 | 6.9 ± 5.6 | **O(N)** | **Yes**
420
+ | TokenText | 200 | 0 | 87.0 ± 30.8 | 5.2 ± 4.1 | 21.0 ± 11.9 | 5.1 ± 4.1 | **O(N)** | **Yes**
421
+ | Kamradt | N/A (~660) | 0 | 83.6 ± 36.8 | 1.5 ± 1.6 | 7.4 ± 10.2 | 1.5 ± 1.6 | **O(N)** | No
422
+ KamradtMod | <= 300 | 0 | 87.1 ± 31.9 | 2.1 ± 2.0 | 10.5 ± 12.3 | 2.1 ± 2.0 | **O(N)**| **Yes**
423
  Cluster | 400 (~182) | 0 | 91.3 ± 25.4 | 4.5 ± 3.4 | 20.7 ± 14.5 | 4.5 ± 3.4 | O(N<sup>2</sup>)| No
424
  Cluster | 200 (~103) | 0 | 87.3 ± 29.8 | **8.0 ± 6.0** | **34.0 ± 19.7** | **8.0 ± 6.0** | O(N<sup>2</sup>)| No
425
  LLM (GPT4o) | N/A (~240) | 0 | **91.9 ± 26.5** | 3.9 ± 3.2 | 19.9 ± 16.3 | 3.9 ± 3.2 | O(N<sup>2</sup>)| No
426
+ ★ bert-chunker-3 (experimental, prob_threshold=0.50543) | <= 400 | 0 | 91.2 ± 26.6 | 5.3 ± 4.5 | 23.2 ± 18.1 | 5.3 ± 4.5 |**O(N)** | **Yes**
427
+ ★ bert-chunker-3 (experimental, prob_threshold=0.50543) | <= 200 | 0 | 90.5 ± 27.3 | 7.1 ± 5.5 | 29.3 ± 19.0 | 7.1 ± 5.4 |**O(N)**| **Yes**
428
+ ★ bert-chunker-3 (prob_threshold=0.50543) | N/A | 0 | 90.4 ± 28.7 | 3.3 ± 3.1 | 16.0 ± 17.0 | 3.3 ± 3.1 |**O(N)**| No
429
  ## Citation
430
  ```bibtex
431
  @article{bert-chunker,