tencent
/

Youtu-HiChunk

Feature Extraction

Model card Files Files and versions

Youtu-HiChunk / README.md

Luuuuk's picture

Upload folder using huggingface_hub

704323c verified 23 days ago

|

history blame contribute delete

3.52 kB

	---
	license: other
	license_name: youtu-hichunk
	license_link: https://huggingface.co/tencent/Youtu-HiChunk/blob/main/LICENSE.txt
	library_name: transformers
	base_model_relation: finetune
	language:
	- zh
	---
	<div align="center">

	# <img src="assets/logo.png" alt="Youtu-Parsing Logo" height="100px">

	[📃 License](./LICENSE.txt) • [👨‍💻 Github](https://github.com/TencentCloudADP/hichunk.git) • [📑 Paper](https://arxiv.org/pdf/2509.11552)

	</div>


	## 🎯 Introduction

	Youtu-HiChunk is a hierarchical document chunking framework developed by Tencent Youtu Lab. Combined with the Auto-Merge retrieval algorithm, it can dynamically adjust the semantic granularity of retrieval fragments, mitigating issues of incomplete information caused by chunking.

	- Hierarchical Document Structuring
	HiChunk is a hierarchical document structuring framework designed to address the limitations of traditional linear chunking methods in RAG systems. It focuses on modeling multi-level semantic granularity (e.g., sections, subsections, paragraphs) rather than flat text sequences, enabling RAG systems to retrieve information at contextually appropriate abstraction levels.


	- Auto-Merge Retrieval Algorithm
	Auto-Merge Retrieval Algorithm dynamically adjusts chunk granularity via three complementary conditions, balancing semantic completeness and retrieval quality for both evidence-dense and sparse tasks.

	<div align="center">
	<img src="./assets/framework.png" width="800"/>
	</div>

	<a id="benchmarks"></a>

	## 📊 Performance
	### 1. RAG piepline performance

	<div align="center">
	<img src="./assets/perf1.png" width="800"/>
	</div>

	### 2. Performance in various retrieval size
	<div align="center">
	<img src="./assets/perf2.png" width="800"/>
	</div>


	<a id="quickstart"></a>

	## 🚀 Quick Start
	### Install packages
	```bash
	uv venv hichunk --python 3.12
	source hichunk/bin/activate
	uv pip install torch==2.7.0 vllm==0.9.1 transformers==4.53.0 liger_kernel
	uv pip install nltk
	python -c "import nltk; nltk.download('punkt_tab')"
	```
	Then, you can deploy HiChunk model according [link](https://youtu-rag-docs.vercel.app/docs/en/hichunk/deploying-locally).

	### Usage
	```python
	import os
	os.environ['OPENAI_BASE_URL'] = "http://{serve_ip}:{serve_port}"
	from HiChunk import HiChunkInferenceEngine, PROMPT

	engine = HiChunkInferenceEngine(window_size=16*1024, line_max_len=100, max_level=10, prompt=PROMPT)
	document_text = open('doc.txt', 'r').read()
	chunked_document, chunks = engine.inference(document_text, recurrent_type=2)
	print(chunked_document)
	```

	## 🎨 Visualization
	### Case 1
	<div align="center">
	<img src="./assets/case1.png" width="800"/>
	</div>

	### Case 2

	<div align="center">
	<img src="./assets/case2.png" width="800"/>
	</div>


	## 🤝 Acknowledgements
	The project is based on the excellent work of several open source projects:
	- [Youtu-LLM](https://github.com/TencentCloudADP/youtu-tip/tree/master/youtu-llm)
	- [LongBench](https://github.com/THUDM/LongBench/tree/main)
	- [GraphRAG-Benchmark](https://github.com/GraphRAG-Bench/GraphRAG-Benchmark/tree/main)


	## 📚 Citation

	If you find our work useful in your research, please consider citing the following paper:
	```
	@misc{hi-chunk-2025,
	title={HiChunk: Evaluating and Enhancing Retrieval-Augmented Generation with Hierarchical Chunking},
	author={Tencent Youtu Lab},
	year={2025},
	publisher = {GitHub},
	journal = {GitHub repository},
	howpublished = {\url{https://github.com/TencentYoutuResearch/HiChunk.git}},
	}
	```