| | --- |
| | license: other |
| | license_name: youtu-hichunk |
| | license_link: https://huggingface.co/tencent/Youtu-HiChunk/blob/main/LICENSE.txt |
| | library_name: transformers |
| | base_model_relation: finetune |
| | language: |
| | - zh |
| | --- |
| | <div align="center"> |
| |
|
| | # <img src="assets/logo.png" alt="Youtu-Parsing Logo" height="100px"> |
| |
|
| | [π License](./LICENSE.txt) β’ [π¨βπ» Github](https://github.com/TencentCloudADP/hichunk.git) β’ [π Paper](https://arxiv.org/pdf/2509.11552) |
| |
|
| | </div> |
| |
|
| |
|
| | ## π― Introduction |
| |
|
| | **Youtu-HiChunk** is a hierarchical document chunking framework developed by Tencent Youtu Lab. Combined with the Auto-Merge retrieval algorithm, it can dynamically adjust the semantic granularity of retrieval fragments, mitigating issues of incomplete information caused by chunking. |
| |
|
| | - **Hierarchical Document Structuring** |
| | HiChunk is a hierarchical document structuring framework designed to address the limitations of traditional linear chunking methods in RAG systems. It focuses on modeling multi-level semantic granularity (e.g., sections, subsections, paragraphs) rather than flat text sequences, enabling RAG systems to retrieve information at contextually appropriate abstraction levels. |
| |
|
| |
|
| | - **Auto-Merge Retrieval Algorithm** |
| | Auto-Merge Retrieval Algorithm dynamically adjusts chunk granularity via three complementary conditions, balancing semantic completeness and retrieval quality for both evidence-dense and sparse tasks. |
| |
|
| | <div align="center"> |
| | <img src="./assets/framework.png" width="800"/> |
| | </div> |
| |
|
| | <a id="benchmarks"></a> |
| |
|
| | ## π Performance |
| | ### 1. RAG piepline performance |
| |
|
| | <div align="center"> |
| | <img src="./assets/perf1.png" width="800"/> |
| | </div> |
| |
|
| | ### 2. Performance in various retrieval size |
| | <div align="center"> |
| | <img src="./assets/perf2.png" width="800"/> |
| | </div> |
| |
|
| |
|
| | <a id="quickstart"></a> |
| |
|
| | ## π Quick Start |
| | ### Install packages |
| | ```bash |
| | uv venv hichunk --python 3.12 |
| | source hichunk/bin/activate |
| | uv pip install torch==2.7.0 vllm==0.9.1 transformers==4.53.0 liger_kernel |
| | uv pip install nltk |
| | python -c "import nltk; nltk.download('punkt_tab')" |
| | ``` |
| | Then, you can deploy HiChunk model according [link](https://youtu-rag-docs.vercel.app/docs/en/hichunk/deploying-locally). |
| |
|
| | ### Usage |
| | ```python |
| | import os |
| | os.environ['OPENAI_BASE_URL'] = "http://{serve_ip}:{serve_port}" |
| | from HiChunk import HiChunkInferenceEngine, PROMPT |
| | |
| | engine = HiChunkInferenceEngine(window_size=16*1024, line_max_len=100, max_level=10, prompt=PROMPT) |
| | document_text = open('doc.txt', 'r').read() |
| | chunked_document, chunks = engine.inference(document_text, recurrent_type=2) |
| | print(chunked_document) |
| | ``` |
| |
|
| | ## π¨ Visualization |
| | ### Case 1 |
| | <div align="center"> |
| | <img src="./assets/case1.png" width="800"/> |
| | </div> |
| |
|
| | ### Case 2 |
| |
|
| | <div align="center"> |
| | <img src="./assets/case2.png" width="800"/> |
| | </div> |
| |
|
| |
|
| | ## π€ Acknowledgements |
| | The project is based on the excellent work of several open source projects: |
| | - [Youtu-LLM](https://github.com/TencentCloudADP/youtu-tip/tree/master/youtu-llm) |
| | - [LongBench](https://github.com/THUDM/LongBench/tree/main) |
| | - [GraphRAG-Benchmark](https://github.com/GraphRAG-Bench/GraphRAG-Benchmark/tree/main) |
| |
|
| |
|
| | ## π Citation |
| |
|
| | If you find our work useful in your research, please consider citing the following paper: |
| | ``` |
| | @misc{hi-chunk-2025, |
| | title={HiChunk: Evaluating and Enhancing Retrieval-Augmented Generation with Hierarchical Chunking}, |
| | author={Tencent Youtu Lab}, |
| | year={2025}, |
| | publisher = {GitHub}, |
| | journal = {GitHub repository}, |
| | howpublished = {\url{https://github.com/TencentYoutuResearch/HiChunk.git}}, |
| | } |
| | ``` |