SirajRLX's picture
Upload folder using huggingface_hub
4eae728 verified
# DPO Training for Code Analysis
This folder contains a Direct Preference Optimization (DPO) trainer for fine-tuning models on code analysis tasks with preference pairs.
## Overview
DPO training uses preference pairs (chosen/rejected responses) to optimize the model to prefer better outputs over worse ones. This is particularly useful for tasks where we have multiple responses with different quality levels.
## Files
- `run_dpo.py` - Main DPO training script
- `config_dpo.yaml` - Configuration file for DPO training
- `f1_score_utils.py` - Utilities for computing F1 scores and creating preference pairs
- `requirements.txt` - Python dependencies
- `dpo_dataset.jsonl` - Sample DPO dataset
## Data Format
DPO requires data in the following format:
```jsonl
{
"prompt": "##TASK\n<task description>",
"chosen": "<better response with correct file selections>",
"rejected": "<worse response with incorrect file selections>",
"chosen_f1": 1.0,
"rejected_f1": 0.5
}
```
### Creating DPO Data from SFT Data
You can use the F1 score utility to create DPO pairs from multiple model generations:
```python
from f1_score_utils import create_dpo_pairs_from_generations
prompt = "##TASK\nAdd webhook support..."
generations = [output1, output2, output3, output4] # Multiple model outputs
ground_truth = "##OUTPUT\n...\n##SELECT\n..."
pairs = create_dpo_pairs_from_generations(
prompt, generations, ground_truth, min_f1_difference=0.1
)
```
## F1 Score Ranking
The F1 score is computed at the **file level**:
- **Precision**: Correct files / Total predicted files
- **Recall**: Correct files / Total ground truth files
- **F1**: Harmonic mean of precision and recall
Files are extracted from the `##SELECT` section:
```
##SELECT
crates/router/src/webhooks.rs::process_webhook
crates/common_enums/src/enums.rs::EventClass
<EOS>
```
## Installation
```bash
pip install -r requirements.txt
```
## Usage
### 1. Prepare DPO Dataset
You need to generate multiple outputs for each prompt and rank them by F1 score:
```python
from f1_score_utils import compute_file_level_f1, rank_outputs_by_f1
# Rank outputs
ranked = rank_outputs_by_f1(outputs, ground_truth)
for output, f1, metrics in ranked:
print(f"F1: {f1:.3f} - {metrics['true_positives']} correct files")
```
### 2. Configure Training
Edit `config_dpo.yaml`:
- Set `model.repo_id` to your SFT model path
- Adjust `dpo.beta` (temperature parameter, default 0.1)
- Set `dpo.loss_type` (sigmoid, hinge, ipo, kto)
- Configure training hyperparameters
### 3. Run Training
```bash
python run_dpo.py --config config_dpo.yaml
```
### 4. Merge Adapter (Optional)
If training is complete and you want to merge the adapter:
```bash
python run_dpo.py --config config_dpo.yaml --merge-only
```
## Configuration
### DPO Parameters
- `beta`: Temperature for DPO loss (higher = less aggressive preference learning)
- `label_smoothing`: Smoothing factor for labels
- `loss_type`: Type of loss function
- `sigmoid`: Standard DPO loss (default)
- `hinge`: Margin-based loss
- `ipo`: Identity Policy Optimization
- `kto`: Kahneman-Tversky Optimization
- `use_reference_model`: Whether to use a frozen reference model
### Training Tips
1. **Learning Rate**: Use lower LR than SFT (e.g., 5e-5 vs 2e-4)
2. **Beta**: Start with 0.1, increase for less aggressive learning
3. **Batch Size**: Larger batches are more stable
4. **Data Quality**: Ensure significant F1 difference between chosen/rejected (≥0.1)
## Output
Training outputs:
- `runs/dpo_run_14b_v1/checkpoints/` - Training checkpoints
- `runs/dpo_run_14b_v1/best_adapter/` - Best adapter weights
- `runs/dpo_run_14b_v1/merged_14b_dpo_lora/` - Merged model
- `runs/dpo_run_14b_v1/logs/` - Training logs (JSONL format)
## WandB Integration
Enable experiment tracking in `config_dpo.yaml`:
```yaml
wandb:
enabled: true
project: "dpo-training"
tags: ["dpo-lora", "preference-optimization"]
```
## Example: Generate DPO Data
```python
import json
from f1_score_utils import compute_file_level_f1, create_dpo_pairs_from_generations
# Load SFT data
with open("instruct_data.jsonl") as f:
for line in f:
data = json.loads(line)
prompt = data["input"]
ground_truth = data["output"]
# Generate multiple outputs with your model
generations = generate_multiple_outputs(prompt, num_samples=4)
# Create preference pairs
pairs = create_dpo_pairs_from_generations(
prompt, generations, ground_truth, min_f1_difference=0.1
)
# Save pairs
with open("dpo_dataset.jsonl", "a") as out:
for pair in pairs:
out.write(json.dumps(pair) + "\n")
```
## Troubleshooting
1. **OOM Errors**: Reduce batch size or enable gradient checkpointing
2. **No Improvement**: Check F1 score differences in data, increase beta
3. **Unstable Training**: Lower learning rate, increase warmup ratio
4. **Reference Model Issues**: Set `use_reference_model: false` to use implicit reference
## References
- DPO Paper: [Direct Preference Optimization](https://arxiv.org/abs/2305.18290)
- TRL Library: [HuggingFace TRL](https://github.com/huggingface/trl)