---
layout: default
title: Training Guide
permalink: /training/
---

# Training Guide

This guide covers the three-stage training process in CLaRa.

## Overview

CLaRa uses a three-stage training approach:

1. **Stage 1**: Compression Pretraining
2. **Stage 2**: Compression Instruction Tuning
3. **Stage 3**: End-to-End Fine-tuning (CLaRa)

## Stage 1: Compression Pretraining

Train the compressor to learn effective document compression.

### Key Parameters

- `--stage stage1`: Training stage identifier
- `--compress_rate`: Compression rate (default: 32)
- `--doc_max_length`: Maximum document length (default: 256)
- `--mse_loss`: Use MSE loss for compression alignment
- `--qa_loss`: Use QA loss for semantic preservation

### Example Command

```bash
bash scripts/train_pretraining.sh
```

### Data Format

**Stage 1 Pretraining Data:**
```json
{
    "data_type": "qa",
    "question": ["Question 1", "Question 2", ...],
    "answers": ["Answer 1", "Answer 2", ...],
    "docs": ["Document 1", "Document 2", ...]
}
```

## Stage 2: Compression Instruction Tuning

Fine-tune the compressor on instruction-following tasks.

### Key Parameters

- `--stage stage1_2`: Training stage identifier
- `--pretrain_checkpoint`: Path to Stage 1 checkpoint
- `--generation_top_k`: Top-k sampling (default: 5)
- `--mse_loss`: Continue using MSE loss
- `--do_eval_gen`: Enable generation evaluation

### Example Command

```bash
bash scripts/train_instruction_tuning.sh
```

### Data Format

**Stage 2 Instruction Tuning Data:**
```json
{
    "question": "Single question text",
    "docs": ["Document 1", "Document 2", ...],
    "gold_answer": "Reference answer",
    "answer": "Generated answer"
}
```

## Stage 3: End-to-End Training

Jointly train reranker and generator with retrieval.

### Key Parameters

- `--stage stage2`: Training stage identifier
- `--pretrain_checkpoint`: Path to Stage 2 checkpoint
- `--generation_top_k`: Top-k sampling for generation
- `--do_eval_gen`: Enable generation evaluation

### Example Command

```bash
bash scripts/train_stage_end_to_end.sh
```

### Data Format

**Stage 3 End-to-End Data:**
```json
{
    "question": "Single question text",
    "docs": ["Document 1", "Document 2", ...],
    "gold_answer": "Reference answer"
}
```

## Distributed Training

All training stages support distributed training across multiple nodes and GPUs.

### Key Parameters

- `--max_len`: Maximum sequence length (2048 for stage1/stage2, 1024 for stage3)
- `--train_batch_size`: Training batch size
- `--micro_train_batch_size`: Micro batch size for gradient accumulation
- `--learning_rate`: Learning rate (1e-4 for stage1/stage2, 5e-6 for stage3)
- `--max_epochs`: Maximum training epochs
- `--zero_stage`: ZeRO optimization stage (default: 2)
- `--bf16`: Use bfloat16 precision
- `--flash_attn`: Use Flash Attention 2

## Monitoring Training

Training progress is logged via:
- Console output
- Wandb (if configured)
- Checkpoint files

Checkpoints are saved at the path specified by `--save_path`.