pyamy
/

dpo-assignment-4-artifacts

Safetensors

Model card Files Files and versions

xet

Community

pyamy commited on Aug 12, 2025

Commit

681f5b3

verified ·

1 Parent(s): 95cc487

Update README.md

Browse files

Files changed (1) hide show

README.md +70 -38

README.md CHANGED Viewed

@@ -1,7 +1,14 @@
 # DPO Assignment 4 — Full Artifacts
 All local artifacts from my run (datasets on disk, DPO adapters, CSV/TXT outputs, and the notebook).
 Assignment 4
 In this assignment, we will be generating a preference dataset with PairRM and fine tuning a model with DPO. This is a powerful training recipe that is behind some of the top models according to Alpaca Eval.
@@ -72,53 +79,78 @@ a) Implementation (20 points)
 b) Comparative Analysis (10 points)
 Free Response Question (~250 words)
 Compare and analyze the performance and behavioral differences against the base llama-3.2 model, the DPO-PairRM model, and DPO-LLM-judge model
-Submission Links by Requirement
-1a) LLM Judge-Based Collection (20 pts)
-Dataset (HF Datasets): https://huggingface.co/datasets/pyamy/dpo-llm-judge-preferences-llama3
-Judge design doc filename: llm_judge_design_documentation_20250811_212607.txt (included in artifacts)
-Compute: Local GPU
-1b) PairRM-Based Collection (20 pts)
-Dataset (HF Datasets): https://huggingface.co/datasets/pyamy/dpo-pairrm-preferences-llama3
-Spec: 50 LIMA instructions; 5 responses/instruction; 250 preference pairs
-2a) DPO Fine-tuning (40 pts)
-Base model: meta-llama/Llama-3.2-1B-Instruct
-Adapters (HF Models):
-PairRM DPO: https://huggingface.co/pyamy/llama3-dpo-pairrm
-LLM-Judge DPO: https://huggingface.co/pyamy/llama3-dpo-llm-judge
-Training parameters/process: Logged in notebook output (per-step losses; LoRA adapters saved)
-2b) Comparative Analysis (20 pts)
-Novelty check: 10 evaluation prompts; overlap with training = 0/10
-Results table: evaluation_results.csv (saved with outputs from base + both DPO models)
-Quantitative snapshot (from evaluation_results.csv):
-Base - avg_words 26.1, avg_chars 153.0, bullet_like_frac 0.10
-DPO-PairRM - avg_words 27.3, avg_chars 153.0, bullet_like_frac 0.30
-DPO-LLM-Judge - avg_words 26.6, avg_chars 153.0, bullet_like_frac 0.10
-Qualitative observation (from table):
- DPO-PairRM tends to produce more stepwise, list-style answers; DPO-LLM-Judge remains more conversational while adhering to instructions.
-Extra Credit - Iterative DPO (30 pts)
-Iteration 1: +20 new preference pairs → model ./iterative_dpo_model_iter_1
-Iteration 2: +0 new pairs → model ./iterative_dpo_model_iter_2
-Analysis file: iterative_dpo_analysis.txt
-Free Response
-This assignment applies Direct Preference Optimization (DPO) to Llama-3.2-1B-Instruct using two preference sources: PairRM (250 pairs) and an LLM-judge dataset (150 pairs). DPO optimizes the log-odds of “chosen” over “rejected” responses while constraining divergence from the reference with a KL term (β controls that trade-off; not reported here). Evaluation on 10 novel prompts (0/10 overlap with training) compares the base model with both DPO fine-tunes. From evaluation_results.csv, corpus-level statistics show a small style shift after DPO: average words per response increase for the DPO models relative to base, and list-style formatting rises notably for DPO-PairRM (higher bullet-like fraction), indicating stronger structural bias from PairRM preferences. Qualitatively (inspecting the table), DPO-PairRM tends toward stepwise, “instructional” phrasing; DPO-LLM-judge remains more conversational while still adhering to the prompts. Training stability and runtime were not re-measured in this run (existing models were reused), so I avoid claims there. Limitations include small preference sets and automated-judge bias; these can over-reward length/format. Improvements: log β and other hyperparameters alongside results; add an automatic win-rate over the 10 prompts (e.g., a simple LLM judge sweep) to complement length/format metrics; and broaden preference diversity (e.g., more instructions or ensemble judges). Overall, DPO nudges structure and adherence in ways consistent with the active preference signal without visible degradation on these prompts.
-## Published assets
-- Assignment 4 artifacts - https://huggingface.co/pyamy/dpo-assignment-4-artifacts
-- PairRM dataset: https://huggingface.co/datasets/pyamy/dpo-pairrm-preferences-llama3
-- LLM-Judge dataset: https://huggingface.co/datasets/pyamy/dpo-llm-judge-preferences-llama3
-- DPO-PairRM adapters: https://huggingface.co/pyamy/llama3-dpo-pairrm
-- DPO-LLM-Judge adapters: https://huggingface.co/pyamy/llama3-dpo-llm-judge
-- Colab notebook: https://colab.research.google.com/drive/1_vgdQph7H0kO_Vx_DF4q9sPwdN8xtYvS?usp=sharing
-Uploaded from: `f:\Northeastern 2024-2025\INFO7374\Assignment 4\Final`
-Upload time (UTC): 2025-08-12T14:57:34Z

+here’s a clean, structured **Markdown** you can paste straight into your Hugging Face artifacts README:
+---
 # DPO Assignment 4 — Full Artifacts
 All local artifacts from my run (datasets on disk, DPO adapters, CSV/TXT outputs, and the notebook).
+## Assignment 4 (verbatim prompt)
+```
 Assignment 4
 In this assignment, we will be generating a preference dataset with PairRM and fine tuning a model with DPO. This is a powerful training recipe that is behind some of the top models according to Alpaca Eval.
 b) Comparative Analysis (10 points)
 Free Response Question (~250 words)
 Compare and analyze the performance and behavioral differences against the base llama-3.2 model, the DPO-PairRM model, and DPO-LLM-judge model
+```
+---
+## Submission Links by Requirement
+### 1a) LLM Judge-Based Collection (20 pts)
+* **Dataset (HF Datasets):** [https://huggingface.co/datasets/pyamy/dpo-llm-judge-preferences-llama3](https://huggingface.co/datasets/pyamy/dpo-llm-judge-preferences-llama3)
+* **Judge design doc filename:** `llm_judge_design_documentation_20250811_212607.txt` (included in artifacts)
+* **Compute:** Local GPU
+### 1b) PairRM-Based Collection (20 pts)
+* **Dataset (HF Datasets):** [https://huggingface.co/datasets/pyamy/dpo-pairrm-preferences-llama3](https://huggingface.co/datasets/pyamy/dpo-pairrm-preferences-llama3)
+* **Spec:** 50 LIMA instructions; 5 responses/instruction; 250 preference pairs
+---
+### 2a) DPO Fine-tuning (40 pts)
+* **Base model:** `meta-llama/Llama-3.2-1B-Instruct`
+* **Adapters (HF Models):**
+  * PairRM DPO: [https://huggingface.co/pyamy/llama3-dpo-pairrm](https://huggingface.co/pyamy/llama3-dpo-pairrm)
+  * LLM-Judge DPO: [https://huggingface.co/pyamy/llama3-dpo-llm-judge](https://huggingface.co/pyamy/llama3-dpo-llm-judge)
+* **Training parameters/process:** Logged in notebook output (per-step losses; LoRA adapters saved)
+### 2b) Comparative Analysis (20 pts)
+* **Novelty check:** 10 evaluation prompts; **overlap with training = 0/10**
+* **Results table:** `evaluation_results.csv` (saved with outputs from base + both DPO models)
+**Quantitative snapshot (from `evaluation_results.csv`):**
+| Model         | avg\_words | avg\_chars | bullet\_like\_frac |
+| ------------- | ---------- | ---------- | ------------------ |
+| Base          | 26.1       | 153.0      | 0.10               |
+| DPO-PairRM    | 27.3       | 153.0      | 0.30               |
+| DPO-LLM-Judge | 26.6       | 153.0      | 0.10               |
+**Qualitative observation (from table):**
+DPO-PairRM tends to produce more stepwise, list-style answers; DPO-LLM-Judge remains more conversational while adhering to instructions.
+---
+## Extra Credit — Iterative DPO (30 pts)
+* **Iteration 1:** +20 new preference pairs → model `./iterative_dpo_model_iter_1`
+* **Iteration 2:** +0 new pairs → model `./iterative_dpo_model_iter_2`
+* **Analysis file:** `iterative_dpo_analysis.txt`
+---
+## Free Response (\~250 words)
+This assignment applies Direct Preference Optimization (DPO) to Llama-3.2-1B-Instruct using two preference sources: PairRM (250 pairs) and an LLM-judge dataset (150 pairs). DPO optimizes the log-odds of “chosen” over “rejected” responses while constraining divergence from the reference with a KL term (β controls that trade-off; not reported here). Evaluation on 10 novel prompts (0/10 overlap with training) compares the base model with both DPO fine-tunes. From `evaluation_results.csv`, corpus-level statistics show a small style shift after DPO: average words per response increase for the DPO models relative to base, and list-style formatting rises notably for DPO-PairRM (higher bullet-like fraction), indicating stronger structural bias from PairRM preferences. Qualitatively (inspecting the table), DPO-PairRM tends toward stepwise, “instructional” phrasing; DPO-LLM-judge remains more conversational while still adhering to the prompts. Training stability and runtime were not re-measured in this run (existing models were reused), so I avoid claims there. Limitations include small preference sets and automated-judge bias; these can over-reward length/format. Improvements: log β and other hyperparameters alongside results; add an automatic win-rate over the 10 prompts (e.g., a simple LLM judge sweep) to complement length/format metrics; and broaden preference diversity (e.g., more instructions or ensemble judges). Overall, DPO nudges structure and adherence in ways consistent with the active preference signal without visible degradation on these prompts.
+---
+## Published Assets
+* **Assignment 4 artifacts:** [https://huggingface.co/pyamy/dpo-assignment-4-artifacts](https://huggingface.co/pyamy/dpo-assignment-4-artifacts)
+* **PairRM dataset:** [https://huggingface.co/datasets/pyamy/dpo-pairrm-preferences-llama3](https://huggingface.co/datasets/pyamy/dpo-pairrm-preferences-llama3)
+* **LLM-Judge dataset:** [https://huggingface.co/datasets/pyamy/dpo-llm-judge-preferences-llama3](https://huggingface.co/datasets/pyamy/dpo-llm-judge-preferences-llama3)
+* **DPO-PairRM adapters:** [https://huggingface.co/pyamy/llama3-dpo-pairrm](https://huggingface.co/pyamy/llama3-dpo-pairrm)
+* **DPO-LLM-Judge adapters:** [https://huggingface.co/pyamy/llama3-dpo-llm-judge](https://huggingface.co/pyamy/llama3-dpo-llm-judge)
+* **Colab notebook:** [https://colab.research.google.com/drive/1\_vgdQph7H0kO\_Vx\_DF4q9sPwdN8xtYvS?usp=sharing](https://colab.research.google.com/drive/1_vgdQph7H0kO_Vx_DF4q9sPwdN8xtYvS?usp=sharing)
+---
+**Uploaded from:** `f:\Northeastern 2024-2025\INFO7374\Assignment 4\Final`
+**Upload time (UTC):** 2025-08-12T14:57:34Z
+---