ml-clara / README.md

Upload folder using huggingface_hub

30c14cd verified about 1 month ago

13.9 kB

	## CLaRa: Bridging Retrieval and Generation with Continuous Latent Reasoning

	<div align="center">
	<img src="figs/clara_logo.jpg" width="400"/>
	</div>

	<div align="center">
	<a href="https://arxiv.org/abs/2511.18659"><img src="https://img.shields.io/badge/arXiv-2511.18659-b31b1b.svg" alt="arXiv"></a>
	<a href="LICENSE"><img src="https://img.shields.io/badge/License-Apple-blue" alt="License"></a>
	<a href="https://huggingface.co/apple/CLaRa-7B-Base"><img src="https://img.shields.io/badge/Hugging%20Face-CLaRa_Base-FFEB3B" alt="deploy"></a>
	<a href="https://huggingface.co/apple/CLaRa-7B-Instruct"><img src="https://img.shields.io/badge/Hugging%20Face-CLaRa_Instruct-FFEB3B" alt="deploy"></a>
	<a href="https://huggingface.co/apple/CLaRa-7B-E2E"><img src="https://img.shields.io/badge/Hugging%20Face-CLaRa_End_to_end-FFEB3B" alt="deploy"></a>
	<a href="https://huggingface.co/datasets/apple/CLaRa_multi_stage"><img src="https://img.shields.io/badge/Hugging%20Face-CLaRa_Data-FFEB3B" alt="data"></a>
	</div>

	This is the official open-source release of CLaRa, a state-of-the-art, end-to-end Retrieval-Augmented Generation model.

	### Updates

	- Dec 11, 2025. All used data are available on [Huggingface](https://huggingface.co/datasets/apple/CLaRa_multi_stage).
	- Dec 10, 2025. We are working on an MLX version of the model, to be announced soon.
	- Dec 3, 2025. Evaluation data are available in `./evaluation/evaluation_data`.
	- Nov 25, 2025. Models are available on Huggingface.


	### Motivation

	Retrieval-Augmented Generation (RAG) enhances large language models with external knowledge but suffers from long contexts and disjoint retrieval-generation optimization. Existing soft compression frameworks face two key limitations: (i) reconstruction-based objectives bias compressors toward surface patterns rather than semantic preservation; (ii) retrievers and compressors are trained separately, requiring double encoding despite compressed vectors being inherently retrievable.

	In this work, we investigate:

	- How can we improve semantic preservation in compressed representations through better pretraining objectives?
	- How can we unify retrieval and generation optimization to avoid redundant encoding and disjoint objectives?

	<div align="center">

	<img src="figs/intro.png" width="100%"/>

	</div>

	We design a Three-stage training approach and introduce document compression techniques to improve RAG efficiency. The key findings are listed below.

	### Findings

	- Efficient Compression: CLaRa achieves significant compression rates (32x-64x) while preserving essential information for accurate answer generation.

	- Three-Stage Training: A carefully designed Three-stage training approach (compression pretraining + compression instruction tuning + end-to-end fine-tuning) enables effective learning of both retrieval and generation.

	For more interesting findings, please refer to our original paper!

	---

	### Three-Stage Training

	CLaRa uses a carefully designed three-stage training approach:

	Stage 1: Compression Pretraining
	- Train the compressor using SCP framework with QA pairs and paraphrases
	- Retain key semantics through QA-based and paraphrase-guided supervision
	- Support compression rates of 1x-256x

	Stage 2: Compression Instruction Tuning
	- Fine-tune the compressor on instruction-following tasks for downstream QA
	- Use text-based QA output to ensure compressed representations retain sufficient semantics

	Stage 3: End-to-End Fine-tuning (CLaRa)
	- Jointly train reranker and generator via a single language modeling loss
	- Unify retrieval and generation in shared continuous space using differentiable top-k estimator

	In this repository, we release our implementation of CLaRa, built upon [OpenRLHF](https://github.com/OpenRLHF/OpenRLHF).

	### Getting Started

	```
	├── scripts/ # Training and evaluation scripts
	│ ├── train_pretraining.sh # Stage 1: Compression pretraining
	│ ├── train_instruction_tuning.sh # Stage 2: Compression instruction tuning
	│ ├── train_stage_end_to_end.sh # Stage 3: End-to-end training
	│ └── evaluation_end_to_end.sh # Evaluation scripts
	├── openrlhf/ # Core training framework
	│ ├── models/ # Model implementations
	│ │ └── modeling_clara.py # CLaRa model definition
	│ ├── datasets/ # Dataset handling
	│ │ └── sft_dataset.py # Training dataset
	│ ├── trainer/ # Training utilities
	│ │ └── sft_trainer.py # SFT trainer
	│ └── cli/ # Command line interface
	│ └── train_sft.py # Main training script
	├── evaluation/ # Evaluation framework
	├── example/ # Example training data
	│ ├── pretrain_data.jsonl
	│ ├── instruction_tuning_data.jsonl
	│ └── end_to_end_data.jsonl
	└── README.md # This file
	```

	Video instruction for installation (from @Fahd Mirza): https://youtu.be/al2VoAKn8GU?si=Q8bq7QNMaTvcArwa
	Video digest (from @Richard Aragon): https://www.youtube.com/watch?v=yRM92mmKNH4

	#### 1. Prepare code and environment

	Clone the repository and set up the environment:

	```bash
	# Create conda environment
	env=clara
	conda create -n $env python=3.10 -y
	conda activate $env

	# Install dependencies
	pip install -r requirements.txt

	# Set up environment variables
	export PYTHONPATH=/path/to/clara:$PYTHONPATH
	```

	Key dependencies include:
	- PyTorch >= 2.0
	- Transformers >= 4.20
	- DeepSpeed >= 0.18
	- Flash Attention 2
	- Accelerate

	#### 2. Data preparation

	Prepare training data in JSONL format. For pretraining stage:

	```bash
	# Example data format for pretraining
	{
	"data_type": "qa",
	"question": ["Question 1",],
	"answers": ["Answer 1"],
	"docs": ["Document 1"]
	}
	```

	For end-to-end training:

	```bash
	{
	"question": "Single question text",
	"docs": ["Document 1", "Document 2", ...],
	"gold_answer": "Reference answer"
	}
	```

	#### 3. Start training

	Stage 1: Salient Compressor Pretraining (SCP)

	Pre-train the document compressor :

	```bash
	bash scripts/train_pretraining.sh
	```

	Key parameters:
	- `--compress_rate`: Compression rate (default: 32)
	- `--doc_max_length`: Maximum document length (default: 256)
	- `--stage stage1`: Training stage
	- `--mse_loss`: Use MSE loss to align compressed and original representations
	- `--qa_loss`: Use QA loss for semantic preservation

	Stage 2: Compression Instruction Tuning

	Fine-tune the compressor on instruction-following tasks:

	```bash
	bash scripts/train_instruction_tuning.sh
	```

	Key parameters:
	- `--pretrain_checkpoint`: Path to stage 1 checkpoint
	- `--stage stage1_2`: Training stage
	- `--generation_top_k`: Top-k sampling for generation (default: 5)
	- `--mse_loss`: Use MSE loss for compression training
	- `--do_eval_gen`: Enable generation evaluation

	Stage 3: End-to-End Training

	Fine-tune the model end-to-end with retrieval:

	```bash
	bash scripts/train_stage_end_to_end.sh
	```

	Key parameters:
	- `--pretrain_checkpoint`: Path to stage 2 checkpoint
	- `--stage stage2`: Training stage
	- `--generation_top_k`: Top-k sampling for generation
	- `--do_eval_gen`: Enable generation evaluation

	#### 4. Distributed Training

	The training scripts support distributed training across multiple nodes and GPUs:

	- `--max_len`: Maximum sequence length (default: 2048 for stage1/stage2, 1024 for stage3)
	- `--train_batch_size`: Training batch size
	- `--micro_train_batch_size`: Micro batch size for gradient accumulation
	- `--learning_rate`: Learning rate (default: 1e-4 for stage1/stage2, 5e-6 for stage3)
	- `--max_epochs`: Maximum training epochs
	- `--zero_stage`: ZeRO optimization stage (default: 2)
	- `--bf16`: Use bfloat16 precision
	- `--flash_attn`: Use Flash Attention 2

	### Inference

	The CLaRa models can be loaded and used for inference. We provide three models corresponding to different training stages:

	<details>
	<summary>Stage 1: Compression Pretraining model (click to expand)</summary>

	```python
	from transformers import AutoModel

	model_path = "path/to/stage1/model"
	model = AutoModel.from_pretrained(
	model_path,
	trust_remote_code=True
	).to('cuda')

	# Example documents
	documents = [
	[
	"Document 1 content...",
	"Document 2 content...",
	"Document 3 content..."
	]
	]

	questions = ["" for _ in range(len(documents))]

	# Generate paraphrase from compressed representations
	output = model.generate_from_paraphrase(
	questions=questions,
	documents=documents,
	max_new_tokens=64
	)

	print('Generated paraphrase:', output[0])
	```

	</details>

	<details>
	<summary>Stage 2: Compression Instruction Tuning model (click to expand)</summary>

	```python
	from transformers import AutoModel

	model_path = "path/to/stage2/model"
	model = AutoModel.from_pretrained(
	model_path,
	trust_remote_code=True
	).to('cuda')

	# Example documents and question
	documents = [
	[
	"Document 1 content...",
	"Document 2 content...",
	"Document 3 content..."
	]
	]

	questions = ["Your question here"]

	# Generate answer from compressed representations
	output = model.generate_from_text(
	questions=questions,
	documents=documents,
	max_new_tokens=64
	)

	print('Generated answer:', output[0])
	```

	</details>

	<details>
	<summary>Stage 3: End-to-End (CLaRa) model (click to expand)</summary>

	```python
	from transformers import AutoModel

	model_path = "path/to/stage3/model"
	model = AutoModel.from_pretrained(
	model_path,
	trust_remote_code=True
	).to('cuda')

	# Example documents and question
	# Note: Stage 3 supports retrieval with multiple candidate documents
	documents = [
	["Document 1 content..." for _ in range(20)] # 20 candidate documents
	]

	questions = ["Your question here"]

	# Generate answer with retrieval and reranking
	# The top-k is decided by generation_top_k in config.json
	output, topk_indices = model.generate_from_questions(
	questions=questions,
	documents=documents,
	max_new_tokens=64
	)

	print('Generated answer:', output[0])
	print('Top-k selected document indices:', topk_indices)
	```

	</details>

	### Evaluation

	The evaluation framework is based on standard RAG benchmarks. Run evaluation:

	End-to-end evaluation:
	```bash
	bash scripts/evaluation_end_to_end.sh
	```

	Instruction tuning evaluation:
	```bash
	bash scripts/evaluation_instruction_tuning.sh
	```

	Supported datasets:
	- HotpotQA: Multi-hop question answering
	- MuSiQue: Multi-hop question answering with diverse reasoning
	- 2WikiMultiHopQA: Multi-hop question answering over Wikipedia
	- Natural Questions: Open-domain question answering



	### Results

	#### Compression Performance

	We evaluate our document compressor on four QA datasets (NQ, HotpotQA, MuSiQue, 2WikiMultiHopQA) under two settings: Normal (retrieving top-5 documents) and Oracle (gold document included). CLaRa consistently outperforms all baselines across different compression ratios.

	<div align="center">

	Main Results (Mistral-7B, Normal Setting)

	\| Model \| CR \| NQ \| HotpotQA \| MuSiQue \| 2Wiki \| Avg \|
	\|:---\|:---:\|:---:\|:---:\|:---:\|:---:\|:---:\|
	\| AutoCompressor \| - \| 17.24 \| 14.61 \| 3.81 \| 19.89 \| 13.89 \|
	\| XRAG \| 128 \| 32.35 \| 25.16 \| 3.64 \| 28.79 \| 22.48 \|
	\| COCOM \| 16 \| 24.12 \| 21.48 \| 3.52 \| 24.48 \| 18.40 \|
	\| PCC \| 16 \| 31.38 \| 22.29 \| 3.43 \| 19.47 \| 19.14 \|
	\| LLMLingua-2 \| 4 \| 47.53 \| 37.05 \| 9.02 \| 44.35 \| 34.49 \|
	\| PISCO \| 16 \| 54.39 \| 41.94 \| 10.09 \| 44.88 \| 37.83 \|
	\| Mistral-7B w/ retrieval \| - \| 54.58 \| 42.94 \| 8.94 \| 44.24 \| 37.67 \|
	\| CLaRa (CR=4) \| 4 \| 57.05 \| 45.09 \| 10.34 \| 46.94 \| 39.86 \|
	\| CLaRa (CR=16) \| 16 \| 55.56 \| 43.72 \| 10.55 \| 46.00 \| 38.96 \|
	\| CLaRa (CR=32) \| 32 \| 54.64 \| 43.52 \| 10.55 \| 46.58 \| 38.82 \|

	Oracle Setting Results (Mistral-7B)

	\| Model \| CR \| NQ \| HotpotQA \| MuSiQue \| 2Wiki \| Avg \|
	\|:---\|:---:\|:---:\|:---:\|:---:\|:---:\|:---:\|
	\| PISCO \| 16 \| 73.44 \| 66.53 \| 33.80 \| 60.45 \| 58.55 \|
	\| Mistral-7B w/ retrieval \| - \| 71.64 \| 70.77 \| 45.72 \| 68.83 \| 64.24 \|
	\| CLaRa (CR=4) \| 4 \| 76.50 \| 73.81 \| 46.26 \| 70.48 \| 66.76 \|
	\| CLaRa (CR=16) \| 16 \| 75.48 \| 70.79 \| 43.15 \| 66.16 \| 63.90 \|
	\| CLaRa (CR=32) \| 32 \| 73.77 \| 69.51 \| 38.31 \| 64.54 \| 61.53 \|

	</div>

	Key Findings:
	- ✅ CLaRa outperforms PISCO by +1.13% (Normal) and +5.35% (Oracle) on average
	- ✅ CLaRa outperforms LLMLingua-2 by +5.37% (Normal) on average
	- ✅ CLaRa matches/exceeds text-based baseline with +2.36% average gain on Mistral-7B

	#### Retrieval Performance

	<div align="center">

	<img src="figs/main_recall.png" width="80%"/>

	</div>

	For detailed experimental results and analysis, please refer to our paper.

	## Acknowledgments

	We sincerely appreciate the following works for CLaRa:

	- Our implementation is built upon the [OpenRLHF framework](https://github.com/OpenRLHF/OpenRLHF).

	- Inspired by [PISCO-mistral](https://huggingface.co/naver/pisco-mistral) for document compression techniques

	## Citation

	```bibtex
	@misc{he2025clarabridgingretrievalgeneration,
	title={CLaRa: Bridging Retrieval and Generation with Continuous Latent Reasoning},
	author={Jie He and Richard He Bai and Sinead Williamson and Jeff Z. Pan and Navdeep Jaitly and Yizhe Zhang},
	year={2025},
	eprint={2511.18659},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2511.18659},
	}
	```