Update README.md
Browse files
README.md
CHANGED
|
@@ -1,7 +1,14 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
# DPO Assignment 4 — Full Artifacts
|
| 2 |
|
| 3 |
All local artifacts from my run (datasets on disk, DPO adapters, CSV/TXT outputs, and the notebook).
|
| 4 |
|
|
|
|
|
|
|
|
|
|
| 5 |
Assignment 4
|
| 6 |
|
| 7 |
In this assignment, we will be generating a preference dataset with PairRM and fine tuning a model with DPO. This is a powerful training recipe that is behind some of the top models according to Alpaca Eval.
|
|
@@ -72,53 +79,78 @@ a) Implementation (20 points)
|
|
| 72 |
b) Comparative Analysis (10 points)
|
| 73 |
Free Response Question (~250 words)
|
| 74 |
Compare and analyze the performance and behavioral differences against the base llama-3.2 model, the DPO-PairRM model, and DPO-LLM-judge model
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 75 |
|
|
|
|
| 76 |
|
| 77 |
-
|
| 78 |
-
1a) LLM Judge-Based Collection (20 pts)
|
| 79 |
-
Dataset (HF Datasets): https://huggingface.co/datasets/pyamy/dpo-llm-judge-preferences-llama3
|
| 80 |
-
Judge design doc filename: llm_judge_design_documentation_20250811_212607.txt (included in artifacts)
|
| 81 |
-
Compute: Local GPU
|
| 82 |
-
1b) PairRM-Based Collection (20 pts)
|
| 83 |
-
Dataset (HF Datasets): https://huggingface.co/datasets/pyamy/dpo-pairrm-preferences-llama3
|
| 84 |
-
Spec: 50 LIMA instructions; 5 responses/instruction; 250 preference pairs
|
| 85 |
|
|
|
|
|
|
|
|
|
|
| 86 |
|
| 87 |
-
|
| 88 |
-
Base model: meta-llama/Llama-3.2-1B-Instruct
|
| 89 |
-
Adapters (HF Models):
|
| 90 |
-
PairRM DPO: https://huggingface.co/pyamy/llama3-dpo-pairrm
|
| 91 |
-
LLM-Judge DPO: https://huggingface.co/pyamy/llama3-dpo-llm-judge
|
| 92 |
-
Training parameters/process: Logged in notebook output (per-step losses; LoRA adapters saved)
|
| 93 |
-
2b) Comparative Analysis (20 pts)
|
| 94 |
-
Novelty check: 10 evaluation prompts; overlap with training = 0/10
|
| 95 |
-
Results table: evaluation_results.csv (saved with outputs from base + both DPO models)
|
| 96 |
|
|
|
|
| 97 |
|
| 98 |
-
|
| 99 |
-
Base - avg_words 26.1, avg_chars 153.0, bullet_like_frac 0.10
|
| 100 |
-
DPO-PairRM - avg_words 27.3, avg_chars 153.0, bullet_like_frac 0.30
|
| 101 |
-
DPO-LLM-Judge - avg_words 26.6, avg_chars 153.0, bullet_like_frac 0.10
|
| 102 |
|
|
|
|
| 103 |
|
| 104 |
-
|
| 105 |
-
DPO-PairRM tends to produce more stepwise, list-style answers; DPO-LLM-Judge remains more conversational while adhering to instructions.
|
| 106 |
-
Extra Credit - Iterative DPO (30 pts)
|
| 107 |
-
Iteration 1: +20 new preference pairs → model ./iterative_dpo_model_iter_1
|
| 108 |
-
Iteration 2: +0 new pairs → model ./iterative_dpo_model_iter_2
|
| 109 |
-
Analysis file: iterative_dpo_analysis.txt
|
| 110 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 111 |
|
| 112 |
-
|
| 113 |
-
This assignment applies Direct Preference Optimization (DPO) to Llama-3.2-1B-Instruct using two preference sources: PairRM (250 pairs) and an LLM-judge dataset (150 pairs). DPO optimizes the log-odds of “chosen” over “rejected” responses while constraining divergence from the reference with a KL term (β controls that trade-off; not reported here). Evaluation on 10 novel prompts (0/10 overlap with training) compares the base model with both DPO fine-tunes. From evaluation_results.csv, corpus-level statistics show a small style shift after DPO: average words per response increase for the DPO models relative to base, and list-style formatting rises notably for DPO-PairRM (higher bullet-like fraction), indicating stronger structural bias from PairRM preferences. Qualitatively (inspecting the table), DPO-PairRM tends toward stepwise, “instructional” phrasing; DPO-LLM-judge remains more conversational while still adhering to the prompts. Training stability and runtime were not re-measured in this run (existing models were reused), so I avoid claims there. Limitations include small preference sets and automated-judge bias; these can over-reward length/format. Improvements: log β and other hyperparameters alongside results; add an automatic win-rate over the 10 prompts (e.g., a simple LLM judge sweep) to complement length/format metrics; and broaden preference diversity (e.g., more instructions or ensemble judges). Overall, DPO nudges structure and adherence in ways consistent with the active preference signal without visible degradation on these prompts.
|
| 114 |
|
| 115 |
-
|
| 116 |
-
|
| 117 |
-
- PairRM dataset: https://huggingface.co/datasets/pyamy/dpo-pairrm-preferences-llama3
|
| 118 |
-
- LLM-Judge dataset: https://huggingface.co/datasets/pyamy/dpo-llm-judge-preferences-llama3
|
| 119 |
-
- DPO-PairRM adapters: https://huggingface.co/pyamy/llama3-dpo-pairrm
|
| 120 |
-
- DPO-LLM-Judge adapters: https://huggingface.co/pyamy/llama3-dpo-llm-judge
|
| 121 |
-
- Colab notebook: https://colab.research.google.com/drive/1_vgdQph7H0kO_Vx_DF4q9sPwdN8xtYvS?usp=sharing
|
| 122 |
|
| 123 |
-
|
| 124 |
-
Upload time (UTC): 2025-08-12T14:57:34Z
|
|
|
|
| 1 |
+
here’s a clean, structured **Markdown** you can paste straight into your Hugging Face artifacts README:
|
| 2 |
+
|
| 3 |
+
---
|
| 4 |
+
|
| 5 |
# DPO Assignment 4 — Full Artifacts
|
| 6 |
|
| 7 |
All local artifacts from my run (datasets on disk, DPO adapters, CSV/TXT outputs, and the notebook).
|
| 8 |
|
| 9 |
+
## Assignment 4 (verbatim prompt)
|
| 10 |
+
|
| 11 |
+
```
|
| 12 |
Assignment 4
|
| 13 |
|
| 14 |
In this assignment, we will be generating a preference dataset with PairRM and fine tuning a model with DPO. This is a powerful training recipe that is behind some of the top models according to Alpaca Eval.
|
|
|
|
| 79 |
b) Comparative Analysis (10 points)
|
| 80 |
Free Response Question (~250 words)
|
| 81 |
Compare and analyze the performance and behavioral differences against the base llama-3.2 model, the DPO-PairRM model, and DPO-LLM-judge model
|
| 82 |
+
```
|
| 83 |
+
|
| 84 |
+
---
|
| 85 |
+
|
| 86 |
+
## Submission Links by Requirement
|
| 87 |
+
|
| 88 |
+
### 1a) LLM Judge-Based Collection (20 pts)
|
| 89 |
+
|
| 90 |
+
* **Dataset (HF Datasets):** [https://huggingface.co/datasets/pyamy/dpo-llm-judge-preferences-llama3](https://huggingface.co/datasets/pyamy/dpo-llm-judge-preferences-llama3)
|
| 91 |
+
* **Judge design doc filename:** `llm_judge_design_documentation_20250811_212607.txt` (included in artifacts)
|
| 92 |
+
* **Compute:** Local GPU
|
| 93 |
+
|
| 94 |
+
### 1b) PairRM-Based Collection (20 pts)
|
| 95 |
+
|
| 96 |
+
* **Dataset (HF Datasets):** [https://huggingface.co/datasets/pyamy/dpo-pairrm-preferences-llama3](https://huggingface.co/datasets/pyamy/dpo-pairrm-preferences-llama3)
|
| 97 |
+
* **Spec:** 50 LIMA instructions; 5 responses/instruction; 250 preference pairs
|
| 98 |
+
|
| 99 |
+
---
|
| 100 |
+
|
| 101 |
+
### 2a) DPO Fine-tuning (40 pts)
|
| 102 |
+
|
| 103 |
+
* **Base model:** `meta-llama/Llama-3.2-1B-Instruct`
|
| 104 |
+
* **Adapters (HF Models):**
|
| 105 |
+
|
| 106 |
+
* PairRM DPO: [https://huggingface.co/pyamy/llama3-dpo-pairrm](https://huggingface.co/pyamy/llama3-dpo-pairrm)
|
| 107 |
+
* LLM-Judge DPO: [https://huggingface.co/pyamy/llama3-dpo-llm-judge](https://huggingface.co/pyamy/llama3-dpo-llm-judge)
|
| 108 |
+
* **Training parameters/process:** Logged in notebook output (per-step losses; LoRA adapters saved)
|
| 109 |
+
|
| 110 |
+
### 2b) Comparative Analysis (20 pts)
|
| 111 |
+
|
| 112 |
+
* **Novelty check:** 10 evaluation prompts; **overlap with training = 0/10**
|
| 113 |
+
* **Results table:** `evaluation_results.csv` (saved with outputs from base + both DPO models)
|
| 114 |
+
|
| 115 |
+
**Quantitative snapshot (from `evaluation_results.csv`):**
|
| 116 |
+
|
| 117 |
+
| Model | avg\_words | avg\_chars | bullet\_like\_frac |
|
| 118 |
+
| ------------- | ---------- | ---------- | ------------------ |
|
| 119 |
+
| Base | 26.1 | 153.0 | 0.10 |
|
| 120 |
+
| DPO-PairRM | 27.3 | 153.0 | 0.30 |
|
| 121 |
+
| DPO-LLM-Judge | 26.6 | 153.0 | 0.10 |
|
| 122 |
+
|
| 123 |
+
**Qualitative observation (from table):**
|
| 124 |
+
DPO-PairRM tends to produce more stepwise, list-style answers; DPO-LLM-Judge remains more conversational while adhering to instructions.
|
| 125 |
|
| 126 |
+
---
|
| 127 |
|
| 128 |
+
## Extra Credit — Iterative DPO (30 pts)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 129 |
|
| 130 |
+
* **Iteration 1:** +20 new preference pairs → model `./iterative_dpo_model_iter_1`
|
| 131 |
+
* **Iteration 2:** +0 new pairs → model `./iterative_dpo_model_iter_2`
|
| 132 |
+
* **Analysis file:** `iterative_dpo_analysis.txt`
|
| 133 |
|
| 134 |
+
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 135 |
|
| 136 |
+
## Free Response (\~250 words)
|
| 137 |
|
| 138 |
+
This assignment applies Direct Preference Optimization (DPO) to Llama-3.2-1B-Instruct using two preference sources: PairRM (250 pairs) and an LLM-judge dataset (150 pairs). DPO optimizes the log-odds of “chosen” over “rejected” responses while constraining divergence from the reference with a KL term (β controls that trade-off; not reported here). Evaluation on 10 novel prompts (0/10 overlap with training) compares the base model with both DPO fine-tunes. From `evaluation_results.csv`, corpus-level statistics show a small style shift after DPO: average words per response increase for the DPO models relative to base, and list-style formatting rises notably for DPO-PairRM (higher bullet-like fraction), indicating stronger structural bias from PairRM preferences. Qualitatively (inspecting the table), DPO-PairRM tends toward stepwise, “instructional” phrasing; DPO-LLM-judge remains more conversational while still adhering to the prompts. Training stability and runtime were not re-measured in this run (existing models were reused), so I avoid claims there. Limitations include small preference sets and automated-judge bias; these can over-reward length/format. Improvements: log β and other hyperparameters alongside results; add an automatic win-rate over the 10 prompts (e.g., a simple LLM judge sweep) to complement length/format metrics; and broaden preference diversity (e.g., more instructions or ensemble judges). Overall, DPO nudges structure and adherence in ways consistent with the active preference signal without visible degradation on these prompts.
|
|
|
|
|
|
|
|
|
|
| 139 |
|
| 140 |
+
---
|
| 141 |
|
| 142 |
+
## Published Assets
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 143 |
|
| 144 |
+
* **Assignment 4 artifacts:** [https://huggingface.co/pyamy/dpo-assignment-4-artifacts](https://huggingface.co/pyamy/dpo-assignment-4-artifacts)
|
| 145 |
+
* **PairRM dataset:** [https://huggingface.co/datasets/pyamy/dpo-pairrm-preferences-llama3](https://huggingface.co/datasets/pyamy/dpo-pairrm-preferences-llama3)
|
| 146 |
+
* **LLM-Judge dataset:** [https://huggingface.co/datasets/pyamy/dpo-llm-judge-preferences-llama3](https://huggingface.co/datasets/pyamy/dpo-llm-judge-preferences-llama3)
|
| 147 |
+
* **DPO-PairRM adapters:** [https://huggingface.co/pyamy/llama3-dpo-pairrm](https://huggingface.co/pyamy/llama3-dpo-pairrm)
|
| 148 |
+
* **DPO-LLM-Judge adapters:** [https://huggingface.co/pyamy/llama3-dpo-llm-judge](https://huggingface.co/pyamy/llama3-dpo-llm-judge)
|
| 149 |
+
* **Colab notebook:** [https://colab.research.google.com/drive/1\_vgdQph7H0kO\_Vx\_DF4q9sPwdN8xtYvS?usp=sharing](https://colab.research.google.com/drive/1_vgdQph7H0kO_Vx_DF4q9sPwdN8xtYvS?usp=sharing)
|
| 150 |
|
| 151 |
+
---
|
|
|
|
| 152 |
|
| 153 |
+
**Uploaded from:** `f:\Northeastern 2024-2025\INFO7374\Assignment 4\Final`
|
| 154 |
+
**Upload time (UTC):** 2025-08-12T14:57:34Z
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 155 |
|
| 156 |
+
---
|
|
|