Update README.md
Browse files
README.md
CHANGED
|
@@ -1,12 +1,124 @@
|
|
| 1 |
-
# DPO Assignment 4 — Full Artifacts
|
| 2 |
-
|
| 3 |
-
All local artifacts from my run (datasets on disk, DPO adapters, CSV/TXT outputs, and the notebook).
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
-
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# DPO Assignment 4 — Full Artifacts
|
| 2 |
+
|
| 3 |
+
All local artifacts from my run (datasets on disk, DPO adapters, CSV/TXT outputs, and the notebook).
|
| 4 |
+
|
| 5 |
+
Assignment 4
|
| 6 |
+
|
| 7 |
+
In this assignment, we will be generating a preference dataset with PairRM and fine tuning a model with DPO. This is a powerful training recipe that is behind some of the top models according to Alpaca Eval.
|
| 8 |
+
You may use llama-3.2 1B or llama-3.2 3B.
|
| 9 |
+
|
| 10 |
+
Preference Dataset Collection and DPO Model Training
|
| 11 |
+
|
| 12 |
+
Part 1: Dataset Generation and Judge Implementation (40 points)
|
| 13 |
+
Create two separate preference datasets using different collection methods:
|
| 14 |
+
|
| 15 |
+
a) LLM Judge-Based Collection (20 points)
|
| 16 |
+
- Implement an LLM-based judge system
|
| 17 |
+
- Document your reasoning for the judge's prompt design
|
| 18 |
+
- Explain how you ensure consistent and reliable preference judgments
|
| 19 |
+
- Include examples of the judge's evaluation process
|
| 20 |
+
- You can choose between using local inference on Colab/Lightning studio or a 3rd party provider like fireworks ai/openai/together ai/groq (kimi k2)
|
| 21 |
+
|
| 22 |
+
b) PairRM-Based Collection (20 points)
|
| 23 |
+
- Extract 50 instructions from the Lima dataset
|
| 24 |
+
- Generate 5 responses per instruction using the llama-3.2 chat template
|
| 25 |
+
- Apply PairRM to create preference pairs
|
| 26 |
+
- Upload dataset to HuggingFace
|
| 27 |
+
- Submit repository link
|
| 28 |
+
|
| 29 |
+
Part 2: Model Training and Evaluation (60 points)
|
| 30 |
+
|
| 31 |
+
a) DPO Fine-tuning (40 points)
|
| 32 |
+
- Fine-tune llama-3.2 using PairRM preference dataset
|
| 33 |
+
- Fine-tune llama-3.2 using LLM Judge preference dataset
|
| 34 |
+
- Document training parameters and process
|
| 35 |
+
- Upload PEFT adapters to HuggingFace
|
| 36 |
+
- Submit repository links
|
| 37 |
+
|
| 38 |
+
b) Comparative Analysis (20 points)
|
| 39 |
+
- Select 10 novel instructions (not in training data)
|
| 40 |
+
- Generate completions using:
|
| 41 |
+
* Original llama-3.2
|
| 42 |
+
* DPO fine-tuned model (LLM judge dataset)
|
| 43 |
+
* DPO fine-tuned model (PairRM dataset)
|
| 44 |
+
- Present results in a pandas DataFrame
|
| 45 |
+
- Analyze and compare the quality of completions
|
| 46 |
+
- Include quantitative and qualitative observations
|
| 47 |
+
|
| 48 |
+
Address the following points:
|
| 49 |
+
1. Qualitative differences in model outputs
|
| 50 |
+
2. Training stability across iterations
|
| 51 |
+
3. Computational efficiency considerations
|
| 52 |
+
4. Potential limitations and failure modes
|
| 53 |
+
5. Suggestions for improvement
|
| 54 |
+
|
| 55 |
+
Grading Criteria for Free Response:
|
| 56 |
+
- Depth of technical understanding
|
| 57 |
+
- Critical analysis of results
|
| 58 |
+
- Clear articulation of observations
|
| 59 |
+
- Original insights and suggestions
|
| 60 |
+
- Proper technical writing style
|
| 61 |
+
|
| 62 |
+
|
| 63 |
+
Extra Credit: Iterative DPO Implementation and Analysis (30 points)
|
| 64 |
+
|
| 65 |
+
a) Implementation (20 points)
|
| 66 |
+
- Implement the iterative DPO algorithm as described in "Self Rewarding Language Models"
|
| 67 |
+
- Train multiple iterations of the model (minimum 2 iterations)
|
| 68 |
+
- Document:
|
| 69 |
+
* Implementation details
|
| 70 |
+
* Training parameters
|
| 71 |
+
|
| 72 |
+
b) Comparative Analysis (10 points)
|
| 73 |
+
Free Response Question (~250 words)
|
| 74 |
+
Compare and analyze the performance and behavioral differences against the base llama-3.2 model, the DPO-PairRM model, and DPO-LLM-judge model
|
| 75 |
+
|
| 76 |
+
|
| 77 |
+
Submission Links by Requirement
|
| 78 |
+
1a) LLM Judge-Based Collection (20 pts)
|
| 79 |
+
Dataset (HF Datasets): https://huggingface.co/datasets/pyamy/dpo-llm-judge-preferences-llama3
|
| 80 |
+
Judge design doc filename: llm_judge_design_documentation_20250811_212607.txt (included in artifacts)
|
| 81 |
+
Compute: Local GPU
|
| 82 |
+
1b) PairRM-Based Collection (20 pts)
|
| 83 |
+
Dataset (HF Datasets): https://huggingface.co/datasets/pyamy/dpo-pairrm-preferences-llama3
|
| 84 |
+
Spec: 50 LIMA instructions; 5 responses/instruction; 250 preference pairs
|
| 85 |
+
|
| 86 |
+
|
| 87 |
+
2a) DPO Fine-tuning (40 pts)
|
| 88 |
+
Base model: meta-llama/Llama-3.2-1B-Instruct
|
| 89 |
+
Adapters (HF Models):
|
| 90 |
+
PairRM DPO: https://huggingface.co/pyamy/llama3-dpo-pairrm
|
| 91 |
+
LLM-Judge DPO: https://huggingface.co/pyamy/llama3-dpo-llm-judge
|
| 92 |
+
Training parameters/process: Logged in notebook output (per-step losses; LoRA adapters saved)
|
| 93 |
+
2b) Comparative Analysis (20 pts)
|
| 94 |
+
Novelty check: 10 evaluation prompts; overlap with training = 0/10
|
| 95 |
+
Results table: evaluation_results.csv (saved with outputs from base + both DPO models)
|
| 96 |
+
|
| 97 |
+
|
| 98 |
+
Quantitative snapshot (from evaluation_results.csv):
|
| 99 |
+
Base - avg_words 26.1, avg_chars 153.0, bullet_like_frac 0.10
|
| 100 |
+
DPO-PairRM - avg_words 27.3, avg_chars 153.0, bullet_like_frac 0.30
|
| 101 |
+
DPO-LLM-Judge - avg_words 26.6, avg_chars 153.0, bullet_like_frac 0.10
|
| 102 |
+
|
| 103 |
+
|
| 104 |
+
Qualitative observation (from table):
|
| 105 |
+
DPO-PairRM tends to produce more stepwise, list-style answers; DPO-LLM-Judge remains more conversational while adhering to instructions.
|
| 106 |
+
Extra Credit - Iterative DPO (30 pts)
|
| 107 |
+
Iteration 1: +20 new preference pairs → model ./iterative_dpo_model_iter_1
|
| 108 |
+
Iteration 2: +0 new pairs → model ./iterative_dpo_model_iter_2
|
| 109 |
+
Analysis file: iterative_dpo_analysis.txt
|
| 110 |
+
|
| 111 |
+
|
| 112 |
+
Free Response
|
| 113 |
+
This assignment applies Direct Preference Optimization (DPO) to Llama-3.2-1B-Instruct using two preference sources: PairRM (250 pairs) and an LLM-judge dataset (150 pairs). DPO optimizes the log-odds of “chosen” over “rejected” responses while constraining divergence from the reference with a KL term (β controls that trade-off; not reported here). Evaluation on 10 novel prompts (0/10 overlap with training) compares the base model with both DPO fine-tunes. From evaluation_results.csv, corpus-level statistics show a small style shift after DPO: average words per response increase for the DPO models relative to base, and list-style formatting rises notably for DPO-PairRM (higher bullet-like fraction), indicating stronger structural bias from PairRM preferences. Qualitatively (inspecting the table), DPO-PairRM tends toward stepwise, “instructional” phrasing; DPO-LLM-judge remains more conversational while still adhering to the prompts. Training stability and runtime were not re-measured in this run (existing models were reused), so I avoid claims there. Limitations include small preference sets and automated-judge bias; these can over-reward length/format. Improvements: log β and other hyperparameters alongside results; add an automatic win-rate over the 10 prompts (e.g., a simple LLM judge sweep) to complement length/format metrics; and broaden preference diversity (e.g., more instructions or ensemble judges). Overall, DPO nudges structure and adherence in ways consistent with the active preference signal without visible degradation on these prompts.
|
| 114 |
+
|
| 115 |
+
## Published assets
|
| 116 |
+
- Assignment 4 artifacts - https://huggingface.co/pyamy/dpo-assignment-4-artifacts
|
| 117 |
+
- PairRM dataset: https://huggingface.co/datasets/pyamy/dpo-pairrm-preferences-llama3
|
| 118 |
+
- LLM-Judge dataset: https://huggingface.co/datasets/pyamy/dpo-llm-judge-preferences-llama3
|
| 119 |
+
- DPO-PairRM adapters: https://huggingface.co/pyamy/llama3-dpo-pairrm
|
| 120 |
+
- DPO-LLM-Judge adapters: https://huggingface.co/pyamy/llama3-dpo-llm-judge
|
| 121 |
+
- Colab notebook: https://colab.research.google.com/drive/1_vgdQph7H0kO_Vx_DF4q9sPwdN8xtYvS?usp=sharing
|
| 122 |
+
|
| 123 |
+
Uploaded from: `f:\Northeastern 2024-2025\INFO7374\Assignment 4\Final`
|
| 124 |
+
Upload time (UTC): 2025-08-12T14:57:34Z
|