File size: 3,516 Bytes
704323c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
---
license: other
license_name: youtu-hichunk
license_link: https://huggingface.co/tencent/Youtu-HiChunk/blob/main/LICENSE.txt
library_name: transformers
base_model_relation: finetune
language:
- zh
---
<div align="center">

# <img src="assets/logo.png" alt="Youtu-Parsing Logo" height="100px">

[πŸ“ƒ License](./LICENSE.txt) β€’ [πŸ‘¨β€πŸ’» Github](https://github.com/TencentCloudADP/hichunk.git) β€’ [πŸ“‘ Paper](https://arxiv.org/pdf/2509.11552)

</div>


## 🎯 Introduction

**Youtu-HiChunk** is a hierarchical document chunking framework developed by Tencent Youtu Lab. Combined with the Auto-Merge retrieval algorithm, it can dynamically adjust the semantic granularity of retrieval fragments, mitigating issues of incomplete information caused by chunking.

- **Hierarchical Document Structuring**
HiChunk is a hierarchical document structuring framework designed to address the limitations of traditional linear chunking methods in RAG systems. It focuses on modeling multi-level semantic granularity (e.g., sections, subsections, paragraphs) rather than flat text sequences, enabling RAG systems to retrieve information at contextually appropriate abstraction levels.


- **Auto-Merge Retrieval Algorithm**
Auto-Merge Retrieval Algorithm dynamically adjusts chunk granularity via three complementary conditions, balancing semantic completeness and retrieval quality for both evidence-dense and sparse tasks.

<div align="center">
<img src="./assets/framework.png" width="800"/>
</div>

<a id="benchmarks"></a>

## πŸ“Š Performance
### 1. RAG piepline performance

<div align="center">
<img src="./assets/perf1.png" width="800"/>
</div>

### 2. Performance in various retrieval size
<div align="center">
<img src="./assets/perf2.png" width="800"/>
</div>


<a id="quickstart"></a>

## πŸš€ Quick Start
### Install packages
```bash
uv venv hichunk --python 3.12
source hichunk/bin/activate
uv pip install torch==2.7.0 vllm==0.9.1 transformers==4.53.0 liger_kernel
uv pip install nltk
python -c "import nltk; nltk.download('punkt_tab')"
```
Then, you can deploy HiChunk model according [link](https://youtu-rag-docs.vercel.app/docs/en/hichunk/deploying-locally).

### Usage
```python
import os
os.environ['OPENAI_BASE_URL'] = "http://{serve_ip}:{serve_port}"
from HiChunk import HiChunkInferenceEngine, PROMPT

engine = HiChunkInferenceEngine(window_size=16*1024, line_max_len=100, max_level=10, prompt=PROMPT)
document_text = open('doc.txt', 'r').read()
chunked_document, chunks = engine.inference(document_text, recurrent_type=2)
print(chunked_document)
```

## 🎨 Visualization
### Case 1
<div align="center">
<img src="./assets/case1.png" width="800"/>
</div>

### Case 2

<div align="center">
<img src="./assets/case2.png" width="800"/>
</div>


## 🀝 Acknowledgements
The project is based on the excellent work of several open source projects:
- [Youtu-LLM](https://github.com/TencentCloudADP/youtu-tip/tree/master/youtu-llm)
- [LongBench](https://github.com/THUDM/LongBench/tree/main)
- [GraphRAG-Benchmark](https://github.com/GraphRAG-Bench/GraphRAG-Benchmark/tree/main)


## πŸ“š Citation

If you find our work useful in your research, please consider citing the following paper:
```
@misc{hi-chunk-2025,
  title={HiChunk: Evaluating and Enhancing Retrieval-Augmented Generation with Hierarchical Chunking},
  author={Tencent Youtu Lab},
  year={2025},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/TencentYoutuResearch/HiChunk.git}},
}
```