| # DPO Training for Code Analysis | |
| This folder contains a Direct Preference Optimization (DPO) trainer for fine-tuning models on code analysis tasks with preference pairs. | |
| ## Overview | |
| DPO training uses preference pairs (chosen/rejected responses) to optimize the model to prefer better outputs over worse ones. This is particularly useful for tasks where we have multiple responses with different quality levels. | |
| ## Files | |
| - `run_dpo.py` - Main DPO training script | |
| - `config_dpo.yaml` - Configuration file for DPO training | |
| - `f1_score_utils.py` - Utilities for computing F1 scores and creating preference pairs | |
| - `requirements.txt` - Python dependencies | |
| - `dpo_dataset.jsonl` - Sample DPO dataset | |
| ## Data Format | |
| DPO requires data in the following format: | |
| ```jsonl | |
| { | |
| "prompt": "##TASK\n<task description>", | |
| "chosen": "<better response with correct file selections>", | |
| "rejected": "<worse response with incorrect file selections>", | |
| "chosen_f1": 1.0, | |
| "rejected_f1": 0.5 | |
| } | |
| ``` | |
| ### Creating DPO Data from SFT Data | |
| You can use the F1 score utility to create DPO pairs from multiple model generations: | |
| ```python | |
| from f1_score_utils import create_dpo_pairs_from_generations | |
| prompt = "##TASK\nAdd webhook support..." | |
| generations = [output1, output2, output3, output4] # Multiple model outputs | |
| ground_truth = "##OUTPUT\n...\n##SELECT\n..." | |
| pairs = create_dpo_pairs_from_generations( | |
| prompt, generations, ground_truth, min_f1_difference=0.1 | |
| ) | |
| ``` | |
| ## F1 Score Ranking | |
| The F1 score is computed at the **file level**: | |
| - **Precision**: Correct files / Total predicted files | |
| - **Recall**: Correct files / Total ground truth files | |
| - **F1**: Harmonic mean of precision and recall | |
| Files are extracted from the `##SELECT` section: | |
| ``` | |
| ##SELECT | |
| crates/router/src/webhooks.rs::process_webhook | |
| crates/common_enums/src/enums.rs::EventClass | |
| <EOS> | |
| ``` | |
| ## Installation | |
| ```bash | |
| pip install -r requirements.txt | |
| ``` | |
| ## Usage | |
| ### 1. Prepare DPO Dataset | |
| You need to generate multiple outputs for each prompt and rank them by F1 score: | |
| ```python | |
| from f1_score_utils import compute_file_level_f1, rank_outputs_by_f1 | |
| # Rank outputs | |
| ranked = rank_outputs_by_f1(outputs, ground_truth) | |
| for output, f1, metrics in ranked: | |
| print(f"F1: {f1:.3f} - {metrics['true_positives']} correct files") | |
| ``` | |
| ### 2. Configure Training | |
| Edit `config_dpo.yaml`: | |
| - Set `model.repo_id` to your SFT model path | |
| - Adjust `dpo.beta` (temperature parameter, default 0.1) | |
| - Set `dpo.loss_type` (sigmoid, hinge, ipo, kto) | |
| - Configure training hyperparameters | |
| ### 3. Run Training | |
| ```bash | |
| python run_dpo.py --config config_dpo.yaml | |
| ``` | |
| ### 4. Merge Adapter (Optional) | |
| If training is complete and you want to merge the adapter: | |
| ```bash | |
| python run_dpo.py --config config_dpo.yaml --merge-only | |
| ``` | |
| ## Configuration | |
| ### DPO Parameters | |
| - `beta`: Temperature for DPO loss (higher = less aggressive preference learning) | |
| - `label_smoothing`: Smoothing factor for labels | |
| - `loss_type`: Type of loss function | |
| - `sigmoid`: Standard DPO loss (default) | |
| - `hinge`: Margin-based loss | |
| - `ipo`: Identity Policy Optimization | |
| - `kto`: Kahneman-Tversky Optimization | |
| - `use_reference_model`: Whether to use a frozen reference model | |
| ### Training Tips | |
| 1. **Learning Rate**: Use lower LR than SFT (e.g., 5e-5 vs 2e-4) | |
| 2. **Beta**: Start with 0.1, increase for less aggressive learning | |
| 3. **Batch Size**: Larger batches are more stable | |
| 4. **Data Quality**: Ensure significant F1 difference between chosen/rejected (≥0.1) | |
| ## Output | |
| Training outputs: | |
| - `runs/dpo_run_14b_v1/checkpoints/` - Training checkpoints | |
| - `runs/dpo_run_14b_v1/best_adapter/` - Best adapter weights | |
| - `runs/dpo_run_14b_v1/merged_14b_dpo_lora/` - Merged model | |
| - `runs/dpo_run_14b_v1/logs/` - Training logs (JSONL format) | |
| ## WandB Integration | |
| Enable experiment tracking in `config_dpo.yaml`: | |
| ```yaml | |
| wandb: | |
| enabled: true | |
| project: "dpo-training" | |
| tags: ["dpo-lora", "preference-optimization"] | |
| ``` | |
| ## Example: Generate DPO Data | |
| ```python | |
| import json | |
| from f1_score_utils import compute_file_level_f1, create_dpo_pairs_from_generations | |
| # Load SFT data | |
| with open("instruct_data.jsonl") as f: | |
| for line in f: | |
| data = json.loads(line) | |
| prompt = data["input"] | |
| ground_truth = data["output"] | |
| # Generate multiple outputs with your model | |
| generations = generate_multiple_outputs(prompt, num_samples=4) | |
| # Create preference pairs | |
| pairs = create_dpo_pairs_from_generations( | |
| prompt, generations, ground_truth, min_f1_difference=0.1 | |
| ) | |
| # Save pairs | |
| with open("dpo_dataset.jsonl", "a") as out: | |
| for pair in pairs: | |
| out.write(json.dumps(pair) + "\n") | |
| ``` | |
| ## Troubleshooting | |
| 1. **OOM Errors**: Reduce batch size or enable gradient checkpointing | |
| 2. **No Improvement**: Check F1 score differences in data, increase beta | |
| 3. **Unstable Training**: Lower learning rate, increase warmup ratio | |
| 4. **Reference Model Issues**: Set `use_reference_model: false` to use implicit reference | |
| ## References | |
| - DPO Paper: [Direct Preference Optimization](https://arxiv.org/abs/2305.18290) | |
| - TRL Library: [HuggingFace TRL](https://github.com/huggingface/trl) | |