pyamy commited on
Commit
95cc487
·
verified ·
1 Parent(s): 360fa61

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +124 -12
README.md CHANGED
@@ -1,12 +1,124 @@
1
- # DPO Assignment 4 — Full Artifacts
2
-
3
- All local artifacts from my run (datasets on disk, DPO adapters, CSV/TXT outputs, and the notebook).
4
-
5
- ## Published assets
6
- - PairRM dataset: https://huggingface.co/datasets/pyamy/dpo-pairrm-preferences-llama3
7
- - LLM Judge dataset: https://huggingface.co/datasets/pyamy/dpo-llm-judge-preferences-llama3
8
- - DPO PairRM model: https://huggingface.co/pyamy/llama3-dpo-pairrm
9
- - DPO LLM-Judge model: https://huggingface.co/pyamy/llama3-dpo-llm-judge
10
-
11
- Uploaded from: `f:\Northeastern 2024-2025\INFO7374\Assignment 4\Final`
12
- Upload time (UTC): 2025-08-12T14:57:34Z
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # DPO Assignment 4 — Full Artifacts
2
+
3
+ All local artifacts from my run (datasets on disk, DPO adapters, CSV/TXT outputs, and the notebook).
4
+
5
+ Assignment 4
6
+
7
+ In this assignment, we will be generating a preference dataset with PairRM and fine tuning a model with DPO. This is a powerful training recipe that is behind some of the top models according to Alpaca Eval.
8
+ You may use llama-3.2 1B or llama-3.2 3B.
9
+
10
+ Preference Dataset Collection and DPO Model Training
11
+
12
+ Part 1: Dataset Generation and Judge Implementation (40 points)
13
+ Create two separate preference datasets using different collection methods:
14
+
15
+ a) LLM Judge-Based Collection (20 points)
16
+ - Implement an LLM-based judge system
17
+ - Document your reasoning for the judge's prompt design
18
+ - Explain how you ensure consistent and reliable preference judgments
19
+ - Include examples of the judge's evaluation process
20
+ - You can choose between using local inference on Colab/Lightning studio or a 3rd party provider like fireworks ai/openai/together ai/groq (kimi k2)
21
+
22
+ b) PairRM-Based Collection (20 points)
23
+ - Extract 50 instructions from the Lima dataset
24
+ - Generate 5 responses per instruction using the llama-3.2 chat template
25
+ - Apply PairRM to create preference pairs
26
+ - Upload dataset to HuggingFace
27
+ - Submit repository link
28
+
29
+ Part 2: Model Training and Evaluation (60 points)
30
+
31
+ a) DPO Fine-tuning (40 points)
32
+ - Fine-tune llama-3.2 using PairRM preference dataset
33
+ - Fine-tune llama-3.2 using LLM Judge preference dataset
34
+ - Document training parameters and process
35
+ - Upload PEFT adapters to HuggingFace
36
+ - Submit repository links
37
+
38
+ b) Comparative Analysis (20 points)
39
+ - Select 10 novel instructions (not in training data)
40
+ - Generate completions using:
41
+ * Original llama-3.2
42
+ * DPO fine-tuned model (LLM judge dataset)
43
+ * DPO fine-tuned model (PairRM dataset)
44
+ - Present results in a pandas DataFrame
45
+ - Analyze and compare the quality of completions
46
+ - Include quantitative and qualitative observations
47
+
48
+ Address the following points:
49
+ 1. Qualitative differences in model outputs
50
+ 2. Training stability across iterations
51
+ 3. Computational efficiency considerations
52
+ 4. Potential limitations and failure modes
53
+ 5. Suggestions for improvement
54
+
55
+ Grading Criteria for Free Response:
56
+ - Depth of technical understanding
57
+ - Critical analysis of results
58
+ - Clear articulation of observations
59
+ - Original insights and suggestions
60
+ - Proper technical writing style
61
+
62
+
63
+ Extra Credit: Iterative DPO Implementation and Analysis (30 points)
64
+
65
+ a) Implementation (20 points)
66
+ - Implement the iterative DPO algorithm as described in "Self Rewarding Language Models"
67
+ - Train multiple iterations of the model (minimum 2 iterations)
68
+ - Document:
69
+ * Implementation details
70
+ * Training parameters
71
+
72
+ b) Comparative Analysis (10 points)
73
+ Free Response Question (~250 words)
74
+ Compare and analyze the performance and behavioral differences against the base llama-3.2 model, the DPO-PairRM model, and DPO-LLM-judge model
75
+
76
+
77
+ Submission Links by Requirement
78
+ 1a) LLM Judge-Based Collection (20 pts)
79
+ Dataset (HF Datasets): https://huggingface.co/datasets/pyamy/dpo-llm-judge-preferences-llama3
80
+ Judge design doc filename: llm_judge_design_documentation_20250811_212607.txt (included in artifacts)
81
+ Compute: Local GPU
82
+ 1b) PairRM-Based Collection (20 pts)
83
+ Dataset (HF Datasets): https://huggingface.co/datasets/pyamy/dpo-pairrm-preferences-llama3
84
+ Spec: 50 LIMA instructions; 5 responses/instruction; 250 preference pairs
85
+
86
+
87
+ 2a) DPO Fine-tuning (40 pts)
88
+ Base model: meta-llama/Llama-3.2-1B-Instruct
89
+ Adapters (HF Models):
90
+ PairRM DPO: https://huggingface.co/pyamy/llama3-dpo-pairrm
91
+ LLM-Judge DPO: https://huggingface.co/pyamy/llama3-dpo-llm-judge
92
+ Training parameters/process: Logged in notebook output (per-step losses; LoRA adapters saved)
93
+ 2b) Comparative Analysis (20 pts)
94
+ Novelty check: 10 evaluation prompts; overlap with training = 0/10
95
+ Results table: evaluation_results.csv (saved with outputs from base + both DPO models)
96
+
97
+
98
+ Quantitative snapshot (from evaluation_results.csv):
99
+ Base - avg_words 26.1, avg_chars 153.0, bullet_like_frac 0.10
100
+ DPO-PairRM - avg_words 27.3, avg_chars 153.0, bullet_like_frac 0.30
101
+ DPO-LLM-Judge - avg_words 26.6, avg_chars 153.0, bullet_like_frac 0.10
102
+
103
+
104
+ Qualitative observation (from table):
105
+ DPO-PairRM tends to produce more stepwise, list-style answers; DPO-LLM-Judge remains more conversational while adhering to instructions.
106
+ Extra Credit - Iterative DPO (30 pts)
107
+ Iteration 1: +20 new preference pairs → model ./iterative_dpo_model_iter_1
108
+ Iteration 2: +0 new pairs → model ./iterative_dpo_model_iter_2
109
+ Analysis file: iterative_dpo_analysis.txt
110
+
111
+
112
+ Free Response
113
+ This assignment applies Direct Preference Optimization (DPO) to Llama-3.2-1B-Instruct using two preference sources: PairRM (250 pairs) and an LLM-judge dataset (150 pairs). DPO optimizes the log-odds of “chosen” over “rejected” responses while constraining divergence from the reference with a KL term (β controls that trade-off; not reported here). Evaluation on 10 novel prompts (0/10 overlap with training) compares the base model with both DPO fine-tunes. From evaluation_results.csv, corpus-level statistics show a small style shift after DPO: average words per response increase for the DPO models relative to base, and list-style formatting rises notably for DPO-PairRM (higher bullet-like fraction), indicating stronger structural bias from PairRM preferences. Qualitatively (inspecting the table), DPO-PairRM tends toward stepwise, “instructional” phrasing; DPO-LLM-judge remains more conversational while still adhering to the prompts. Training stability and runtime were not re-measured in this run (existing models were reused), so I avoid claims there. Limitations include small preference sets and automated-judge bias; these can over-reward length/format. Improvements: log β and other hyperparameters alongside results; add an automatic win-rate over the 10 prompts (e.g., a simple LLM judge sweep) to complement length/format metrics; and broaden preference diversity (e.g., more instructions or ensemble judges). Overall, DPO nudges structure and adherence in ways consistent with the active preference signal without visible degradation on these prompts.
114
+
115
+ ## Published assets
116
+ - Assignment 4 artifacts - https://huggingface.co/pyamy/dpo-assignment-4-artifacts
117
+ - PairRM dataset: https://huggingface.co/datasets/pyamy/dpo-pairrm-preferences-llama3
118
+ - LLM-Judge dataset: https://huggingface.co/datasets/pyamy/dpo-llm-judge-preferences-llama3
119
+ - DPO-PairRM adapters: https://huggingface.co/pyamy/llama3-dpo-pairrm
120
+ - DPO-LLM-Judge adapters: https://huggingface.co/pyamy/llama3-dpo-llm-judge
121
+ - Colab notebook: https://colab.research.google.com/drive/1_vgdQph7H0kO_Vx_DF4q9sPwdN8xtYvS?usp=sharing
122
+
123
+ Uploaded from: `f:\Northeastern 2024-2025\INFO7374\Assignment 4\Final`
124
+ Upload time (UTC): 2025-08-12T14:57:34Z