tim1900 commited on
Commit
ca852f5
·
verified ·
1 Parent(s): e223f78

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +19 -0
README.md CHANGED
@@ -406,6 +406,25 @@ for i, (c, t) in enumerate(zip(chunks, token_pos)):
406
  print(f"-----chunk: {i}----token_idx: {t}--------")
407
  print(c)
408
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
409
  ## Citation
410
  ```bibtex
411
  @article{bert-chunker,
 
406
  print(f"-----chunk: {i}----token_idx: {t}--------")
407
  print(c)
408
  ```
409
+ ## Evaluation
410
+ Evaluation is done by code from [brandonstarxel/chunking_evaluation](https://github.com/brandonstarxel/chunking_evaluation), most of the following results come from [Evaluating Chunking Strategies for Retrieval](https://research.trychroma.com/evaluating-chunking).
411
+ | Chunking | Size| Overlap | Recall | Precision | PrecisionΩ | IoU | Time complexity by token number N | Is chunk size strictly controlable|
412
+ |---------|---------|---------|---------|---------|---------|---------|---------|---------|
413
+ | Recursive | <= 800 | 400 | 85.4 ± 34.9 | 1.5 ± 1.3 | 6.7 ± 5.2 | 1.5 ± 1.3| O(N) | Yes
414
+ | TokenText | 800 | 400 | 87.9 ± 31.7| 1.4 ± 1.1 | 4.7 ± 3.1 | 1.4 ± 1.1 |O(N) | Yes
415
+ | Recursive | <= 400 | 200 |88.1 ± 31.6 | 3.3 ± 2.7 | 13.9 ± 10.4 | 3.3 ± 2.7 | O(N) | Yes
416
+ | TokenText | 400 | 200 | 88.6 ± 29.7 | 2.7 ± 2.2 | 8.4 ± 5.1 | 2.7 ± 2.2 | O(N) | Yes
417
+ | Recursive | <= 400 | 0 | 89.5 ± 29.7 | 3.6 ± 3.2 | 17.7 ± 14.0 | 3.6 ± 3.2 | O(N) | Yes
418
+ | TokenText | 400 |0 | 89.2 ± 29.2 | 2.7 ± 2.2 | 12.5 ± 8.1 | 2.7 ± 2.2 | O(N) | Yes
419
+ | Recursive | <= 200 | 0 | 88.1 ± 30.1 | 7.0 ± 5.6 | 29.9 ± 18.4 | 6.9 ± 5.6 | O(N) | Yes
420
+ | TokenText | 200 | 0 | 87.0 ± 30.8 | 5.2 ± 4.1 | 21.0 ± 11.9 | 5.1 ± 4.1 | O(N) | Yes
421
+ | Kamradt | N/A (~660) | 0 | 83.6 ± 36.8 | 1.5 ± 1.6 | 7.4 ± 10.2 | 1.5 ± 1.6 | O(N) | No
422
+ KamradtMod | <= 300 | 0 | 87.1 ± 31.9 | 2.1 ± 2.0 | 10.5 ± 12.3 | 2.1 ± 2.0 | O(N)| Yes
423
+ Cluster | 400 (~182) | 0 | 91.3 ± 25.4 | 4.5 ± 3.4 | 20.7 ± 14.5 | 4.5 ± 3.4 | O(N<sup>2</sup>)| No
424
+ Cluster | 200 (~103) | 0 | 87.3 ± 29.8 | **8.0 ± 6.0** | **34.0 ± 19.7** | **8.0 ± 6.0** | O(N<sup>2</sup>)| No
425
+ LLM (GPT4o) | N/A (~240) | 0 | **91.9 ± 26.5** | 3.9 ± 3.2 | 19.9 ± 16.3 | 3.9 ± 3.2 | O(N<sup>2</sup>)| No
426
+ ★ bert-chunker-3 (prob_threshold=0.50543) | <= 400 | 0 | 91.2 ± 26.6 | 5.3 ± 4.5 | 23.2 ± 18.1 | 5.3 ± 4.5 |O(N) | Yes
427
+ ★ bert-chunker-3 (prob_threshold=0.50543) | <= 200 | 0 | 90.5 ± 27.3 | 7.1 ± 5.5 | 29.3 ± 19.0 | 7.1 ± 5.4 |O(N)| Yes
428
  ## Citation
429
  ```bibtex
430
  @article{bert-chunker,