# DPO Training for Code Analysis This folder contains a Direct Preference Optimization (DPO) trainer for fine-tuning models on code analysis tasks with preference pairs. ## Overview DPO training uses preference pairs (chosen/rejected responses) to optimize the model to prefer better outputs over worse ones. This is particularly useful for tasks where we have multiple responses with different quality levels. ## Files - `run_dpo.py` - Main DPO training script - `config_dpo.yaml` - Configuration file for DPO training - `f1_score_utils.py` - Utilities for computing F1 scores and creating preference pairs - `requirements.txt` - Python dependencies - `dpo_dataset.jsonl` - Sample DPO dataset ## Data Format DPO requires data in the following format: ```jsonl { "prompt": "##TASK\n", "chosen": "", "rejected": "", "chosen_f1": 1.0, "rejected_f1": 0.5 } ``` ### Creating DPO Data from SFT Data You can use the F1 score utility to create DPO pairs from multiple model generations: ```python from f1_score_utils import create_dpo_pairs_from_generations prompt = "##TASK\nAdd webhook support..." generations = [output1, output2, output3, output4] # Multiple model outputs ground_truth = "##OUTPUT\n...\n##SELECT\n..." pairs = create_dpo_pairs_from_generations( prompt, generations, ground_truth, min_f1_difference=0.1 ) ``` ## F1 Score Ranking The F1 score is computed at the **file level**: - **Precision**: Correct files / Total predicted files - **Recall**: Correct files / Total ground truth files - **F1**: Harmonic mean of precision and recall Files are extracted from the `##SELECT` section: ``` ##SELECT crates/router/src/webhooks.rs::process_webhook crates/common_enums/src/enums.rs::EventClass ``` ## Installation ```bash pip install -r requirements.txt ``` ## Usage ### 1. Prepare DPO Dataset You need to generate multiple outputs for each prompt and rank them by F1 score: ```python from f1_score_utils import compute_file_level_f1, rank_outputs_by_f1 # Rank outputs ranked = rank_outputs_by_f1(outputs, ground_truth) for output, f1, metrics in ranked: print(f"F1: {f1:.3f} - {metrics['true_positives']} correct files") ``` ### 2. Configure Training Edit `config_dpo.yaml`: - Set `model.repo_id` to your SFT model path - Adjust `dpo.beta` (temperature parameter, default 0.1) - Set `dpo.loss_type` (sigmoid, hinge, ipo, kto) - Configure training hyperparameters ### 3. Run Training ```bash python run_dpo.py --config config_dpo.yaml ``` ### 4. Merge Adapter (Optional) If training is complete and you want to merge the adapter: ```bash python run_dpo.py --config config_dpo.yaml --merge-only ``` ## Configuration ### DPO Parameters - `beta`: Temperature for DPO loss (higher = less aggressive preference learning) - `label_smoothing`: Smoothing factor for labels - `loss_type`: Type of loss function - `sigmoid`: Standard DPO loss (default) - `hinge`: Margin-based loss - `ipo`: Identity Policy Optimization - `kto`: Kahneman-Tversky Optimization - `use_reference_model`: Whether to use a frozen reference model ### Training Tips 1. **Learning Rate**: Use lower LR than SFT (e.g., 5e-5 vs 2e-4) 2. **Beta**: Start with 0.1, increase for less aggressive learning 3. **Batch Size**: Larger batches are more stable 4. **Data Quality**: Ensure significant F1 difference between chosen/rejected (≥0.1) ## Output Training outputs: - `runs/dpo_run_14b_v1/checkpoints/` - Training checkpoints - `runs/dpo_run_14b_v1/best_adapter/` - Best adapter weights - `runs/dpo_run_14b_v1/merged_14b_dpo_lora/` - Merged model - `runs/dpo_run_14b_v1/logs/` - Training logs (JSONL format) ## WandB Integration Enable experiment tracking in `config_dpo.yaml`: ```yaml wandb: enabled: true project: "dpo-training" tags: ["dpo-lora", "preference-optimization"] ``` ## Example: Generate DPO Data ```python import json from f1_score_utils import compute_file_level_f1, create_dpo_pairs_from_generations # Load SFT data with open("instruct_data.jsonl") as f: for line in f: data = json.loads(line) prompt = data["input"] ground_truth = data["output"] # Generate multiple outputs with your model generations = generate_multiple_outputs(prompt, num_samples=4) # Create preference pairs pairs = create_dpo_pairs_from_generations( prompt, generations, ground_truth, min_f1_difference=0.1 ) # Save pairs with open("dpo_dataset.jsonl", "a") as out: for pair in pairs: out.write(json.dumps(pair) + "\n") ``` ## Troubleshooting 1. **OOM Errors**: Reduce batch size or enable gradient checkpointing 2. **No Improvement**: Check F1 score differences in data, increase beta 3. **Unstable Training**: Lower learning rate, increase warmup ratio 4. **Reference Model Issues**: Set `use_reference_model: false` to use implicit reference ## References - DPO Paper: [Direct Preference Optimization](https://arxiv.org/abs/2305.18290) - TRL Library: [HuggingFace TRL](https://github.com/huggingface/trl)