pyamy commited on
Commit
681f5b3
·
verified ·
1 Parent(s): 95cc487

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +70 -38
README.md CHANGED
@@ -1,7 +1,14 @@
 
 
 
 
1
  # DPO Assignment 4 — Full Artifacts
2
 
3
  All local artifacts from my run (datasets on disk, DPO adapters, CSV/TXT outputs, and the notebook).
4
 
 
 
 
5
  Assignment 4
6
 
7
  In this assignment, we will be generating a preference dataset with PairRM and fine tuning a model with DPO. This is a powerful training recipe that is behind some of the top models according to Alpaca Eval.
@@ -72,53 +79,78 @@ a) Implementation (20 points)
72
  b) Comparative Analysis (10 points)
73
  Free Response Question (~250 words)
74
  Compare and analyze the performance and behavioral differences against the base llama-3.2 model, the DPO-PairRM model, and DPO-LLM-judge model
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
75
 
 
76
 
77
- Submission Links by Requirement
78
- 1a) LLM Judge-Based Collection (20 pts)
79
- Dataset (HF Datasets): https://huggingface.co/datasets/pyamy/dpo-llm-judge-preferences-llama3
80
- Judge design doc filename: llm_judge_design_documentation_20250811_212607.txt (included in artifacts)
81
- Compute: Local GPU
82
- 1b) PairRM-Based Collection (20 pts)
83
- Dataset (HF Datasets): https://huggingface.co/datasets/pyamy/dpo-pairrm-preferences-llama3
84
- Spec: 50 LIMA instructions; 5 responses/instruction; 250 preference pairs
85
 
 
 
 
86
 
87
- 2a) DPO Fine-tuning (40 pts)
88
- Base model: meta-llama/Llama-3.2-1B-Instruct
89
- Adapters (HF Models):
90
- PairRM DPO: https://huggingface.co/pyamy/llama3-dpo-pairrm
91
- LLM-Judge DPO: https://huggingface.co/pyamy/llama3-dpo-llm-judge
92
- Training parameters/process: Logged in notebook output (per-step losses; LoRA adapters saved)
93
- 2b) Comparative Analysis (20 pts)
94
- Novelty check: 10 evaluation prompts; overlap with training = 0/10
95
- Results table: evaluation_results.csv (saved with outputs from base + both DPO models)
96
 
 
97
 
98
- Quantitative snapshot (from evaluation_results.csv):
99
- Base - avg_words 26.1, avg_chars 153.0, bullet_like_frac 0.10
100
- DPO-PairRM - avg_words 27.3, avg_chars 153.0, bullet_like_frac 0.30
101
- DPO-LLM-Judge - avg_words 26.6, avg_chars 153.0, bullet_like_frac 0.10
102
 
 
103
 
104
- Qualitative observation (from table):
105
- DPO-PairRM tends to produce more stepwise, list-style answers; DPO-LLM-Judge remains more conversational while adhering to instructions.
106
- Extra Credit - Iterative DPO (30 pts)
107
- Iteration 1: +20 new preference pairs → model ./iterative_dpo_model_iter_1
108
- Iteration 2: +0 new pairs → model ./iterative_dpo_model_iter_2
109
- Analysis file: iterative_dpo_analysis.txt
110
 
 
 
 
 
 
 
111
 
112
- Free Response
113
- This assignment applies Direct Preference Optimization (DPO) to Llama-3.2-1B-Instruct using two preference sources: PairRM (250 pairs) and an LLM-judge dataset (150 pairs). DPO optimizes the log-odds of “chosen” over “rejected” responses while constraining divergence from the reference with a KL term (β controls that trade-off; not reported here). Evaluation on 10 novel prompts (0/10 overlap with training) compares the base model with both DPO fine-tunes. From evaluation_results.csv, corpus-level statistics show a small style shift after DPO: average words per response increase for the DPO models relative to base, and list-style formatting rises notably for DPO-PairRM (higher bullet-like fraction), indicating stronger structural bias from PairRM preferences. Qualitatively (inspecting the table), DPO-PairRM tends toward stepwise, “instructional” phrasing; DPO-LLM-judge remains more conversational while still adhering to the prompts. Training stability and runtime were not re-measured in this run (existing models were reused), so I avoid claims there. Limitations include small preference sets and automated-judge bias; these can over-reward length/format. Improvements: log β and other hyperparameters alongside results; add an automatic win-rate over the 10 prompts (e.g., a simple LLM judge sweep) to complement length/format metrics; and broaden preference diversity (e.g., more instructions or ensemble judges). Overall, DPO nudges structure and adherence in ways consistent with the active preference signal without visible degradation on these prompts.
114
 
115
- ## Published assets
116
- - Assignment 4 artifacts - https://huggingface.co/pyamy/dpo-assignment-4-artifacts
117
- - PairRM dataset: https://huggingface.co/datasets/pyamy/dpo-pairrm-preferences-llama3
118
- - LLM-Judge dataset: https://huggingface.co/datasets/pyamy/dpo-llm-judge-preferences-llama3
119
- - DPO-PairRM adapters: https://huggingface.co/pyamy/llama3-dpo-pairrm
120
- - DPO-LLM-Judge adapters: https://huggingface.co/pyamy/llama3-dpo-llm-judge
121
- - Colab notebook: https://colab.research.google.com/drive/1_vgdQph7H0kO_Vx_DF4q9sPwdN8xtYvS?usp=sharing
122
 
123
- Uploaded from: `f:\Northeastern 2024-2025\INFO7374\Assignment 4\Final`
124
- Upload time (UTC): 2025-08-12T14:57:34Z
 
1
+ here’s a clean, structured **Markdown** you can paste straight into your Hugging Face artifacts README:
2
+
3
+ ---
4
+
5
  # DPO Assignment 4 — Full Artifacts
6
 
7
  All local artifacts from my run (datasets on disk, DPO adapters, CSV/TXT outputs, and the notebook).
8
 
9
+ ## Assignment 4 (verbatim prompt)
10
+
11
+ ```
12
  Assignment 4
13
 
14
  In this assignment, we will be generating a preference dataset with PairRM and fine tuning a model with DPO. This is a powerful training recipe that is behind some of the top models according to Alpaca Eval.
 
79
  b) Comparative Analysis (10 points)
80
  Free Response Question (~250 words)
81
  Compare and analyze the performance and behavioral differences against the base llama-3.2 model, the DPO-PairRM model, and DPO-LLM-judge model
82
+ ```
83
+
84
+ ---
85
+
86
+ ## Submission Links by Requirement
87
+
88
+ ### 1a) LLM Judge-Based Collection (20 pts)
89
+
90
+ * **Dataset (HF Datasets):** [https://huggingface.co/datasets/pyamy/dpo-llm-judge-preferences-llama3](https://huggingface.co/datasets/pyamy/dpo-llm-judge-preferences-llama3)
91
+ * **Judge design doc filename:** `llm_judge_design_documentation_20250811_212607.txt` (included in artifacts)
92
+ * **Compute:** Local GPU
93
+
94
+ ### 1b) PairRM-Based Collection (20 pts)
95
+
96
+ * **Dataset (HF Datasets):** [https://huggingface.co/datasets/pyamy/dpo-pairrm-preferences-llama3](https://huggingface.co/datasets/pyamy/dpo-pairrm-preferences-llama3)
97
+ * **Spec:** 50 LIMA instructions; 5 responses/instruction; 250 preference pairs
98
+
99
+ ---
100
+
101
+ ### 2a) DPO Fine-tuning (40 pts)
102
+
103
+ * **Base model:** `meta-llama/Llama-3.2-1B-Instruct`
104
+ * **Adapters (HF Models):**
105
+
106
+ * PairRM DPO: [https://huggingface.co/pyamy/llama3-dpo-pairrm](https://huggingface.co/pyamy/llama3-dpo-pairrm)
107
+ * LLM-Judge DPO: [https://huggingface.co/pyamy/llama3-dpo-llm-judge](https://huggingface.co/pyamy/llama3-dpo-llm-judge)
108
+ * **Training parameters/process:** Logged in notebook output (per-step losses; LoRA adapters saved)
109
+
110
+ ### 2b) Comparative Analysis (20 pts)
111
+
112
+ * **Novelty check:** 10 evaluation prompts; **overlap with training = 0/10**
113
+ * **Results table:** `evaluation_results.csv` (saved with outputs from base + both DPO models)
114
+
115
+ **Quantitative snapshot (from `evaluation_results.csv`):**
116
+
117
+ | Model | avg\_words | avg\_chars | bullet\_like\_frac |
118
+ | ------------- | ---------- | ---------- | ------------------ |
119
+ | Base | 26.1 | 153.0 | 0.10 |
120
+ | DPO-PairRM | 27.3 | 153.0 | 0.30 |
121
+ | DPO-LLM-Judge | 26.6 | 153.0 | 0.10 |
122
+
123
+ **Qualitative observation (from table):**
124
+ DPO-PairRM tends to produce more stepwise, list-style answers; DPO-LLM-Judge remains more conversational while adhering to instructions.
125
 
126
+ ---
127
 
128
+ ## Extra Credit — Iterative DPO (30 pts)
 
 
 
 
 
 
 
129
 
130
+ * **Iteration 1:** +20 new preference pairs → model `./iterative_dpo_model_iter_1`
131
+ * **Iteration 2:** +0 new pairs → model `./iterative_dpo_model_iter_2`
132
+ * **Analysis file:** `iterative_dpo_analysis.txt`
133
 
134
+ ---
 
 
 
 
 
 
 
 
135
 
136
+ ## Free Response (\~250 words)
137
 
138
+ This assignment applies Direct Preference Optimization (DPO) to Llama-3.2-1B-Instruct using two preference sources: PairRM (250 pairs) and an LLM-judge dataset (150 pairs). DPO optimizes the log-odds of “chosen” over “rejected” responses while constraining divergence from the reference with a KL term (β controls that trade-off; not reported here). Evaluation on 10 novel prompts (0/10 overlap with training) compares the base model with both DPO fine-tunes. From `evaluation_results.csv`, corpus-level statistics show a small style shift after DPO: average words per response increase for the DPO models relative to base, and list-style formatting rises notably for DPO-PairRM (higher bullet-like fraction), indicating stronger structural bias from PairRM preferences. Qualitatively (inspecting the table), DPO-PairRM tends toward stepwise, “instructional” phrasing; DPO-LLM-judge remains more conversational while still adhering to the prompts. Training stability and runtime were not re-measured in this run (existing models were reused), so I avoid claims there. Limitations include small preference sets and automated-judge bias; these can over-reward length/format. Improvements: log β and other hyperparameters alongside results; add an automatic win-rate over the 10 prompts (e.g., a simple LLM judge sweep) to complement length/format metrics; and broaden preference diversity (e.g., more instructions or ensemble judges). Overall, DPO nudges structure and adherence in ways consistent with the active preference signal without visible degradation on these prompts.
 
 
 
139
 
140
+ ---
141
 
142
+ ## Published Assets
 
 
 
 
 
143
 
144
+ * **Assignment 4 artifacts:** [https://huggingface.co/pyamy/dpo-assignment-4-artifacts](https://huggingface.co/pyamy/dpo-assignment-4-artifacts)
145
+ * **PairRM dataset:** [https://huggingface.co/datasets/pyamy/dpo-pairrm-preferences-llama3](https://huggingface.co/datasets/pyamy/dpo-pairrm-preferences-llama3)
146
+ * **LLM-Judge dataset:** [https://huggingface.co/datasets/pyamy/dpo-llm-judge-preferences-llama3](https://huggingface.co/datasets/pyamy/dpo-llm-judge-preferences-llama3)
147
+ * **DPO-PairRM adapters:** [https://huggingface.co/pyamy/llama3-dpo-pairrm](https://huggingface.co/pyamy/llama3-dpo-pairrm)
148
+ * **DPO-LLM-Judge adapters:** [https://huggingface.co/pyamy/llama3-dpo-llm-judge](https://huggingface.co/pyamy/llama3-dpo-llm-judge)
149
+ * **Colab notebook:** [https://colab.research.google.com/drive/1\_vgdQph7H0kO\_Vx\_DF4q9sPwdN8xtYvS?usp=sharing](https://colab.research.google.com/drive/1_vgdQph7H0kO_Vx_DF4q9sPwdN8xtYvS?usp=sharing)
150
 
151
+ ---
 
152
 
153
+ **Uploaded from:** `f:\Northeastern 2024-2025\INFO7374\Assignment 4\Final`
154
+ **Upload time (UTC):** 2025-08-12T14:57:34Z
 
 
 
 
 
155
 
156
+ ---