pyamy
/

dpo-assignment-4-artifacts

Safetensors

Model card Files Files and versions

xet

Community

pyamy commited on Aug 12, 2025

Commit

95cc487

verified ·

1 Parent(s): 360fa61

Update README.md

Browse files

Files changed (1) hide show

README.md +124 -12

README.md CHANGED Viewed

@@ -1,12 +1,124 @@
-# DPO Assignment 4 — Full Artifacts
-All local artifacts from my run (datasets on disk, DPO adapters, CSV/TXT outputs, and the notebook).
-## Published assets
-- PairRM dataset: https://huggingface.co/datasets/pyamy/dpo-pairrm-preferences-llama3
-- LLM Judge dataset: https://huggingface.co/datasets/pyamy/dpo-llm-judge-preferences-llama3
-- DPO PairRM model: https://huggingface.co/pyamy/llama3-dpo-pairrm
-- DPO LLM-Judge model: https://huggingface.co/pyamy/llama3-dpo-llm-judge
-Uploaded from: `f:\Northeastern 2024-2025\INFO7374\Assignment 4\Final`
-Upload time (UTC): 2025-08-12T14:57:34Z

+# DPO Assignment 4 — Full Artifacts
+All local artifacts from my run (datasets on disk, DPO adapters, CSV/TXT outputs, and the notebook).
+Assignment 4
+In this assignment, we will be generating a preference dataset with PairRM and fine tuning a model with DPO. This is a powerful training recipe that is behind some of the top models according to Alpaca Eval.
+You may use llama-3.2 1B or llama-3.2 3B.
+Preference Dataset Collection and DPO Model Training
+Part 1: Dataset Generation and Judge Implementation (40 points)
+Create two separate preference datasets using different collection methods:
+a) LLM Judge-Based Collection (20 points)
+- Implement an LLM-based judge system
+- Document your reasoning for the judge's prompt design
+- Explain how you ensure consistent and reliable preference judgments
+- Include examples of the judge's evaluation process
+- You can choose between using local inference on Colab/Lightning studio or a 3rd party provider like fireworks ai/openai/together ai/groq (kimi k2)
+b) PairRM-Based Collection (20 points)
+- Extract 50 instructions from the Lima dataset
+- Generate 5 responses per instruction using the llama-3.2 chat template
+- Apply PairRM to create preference pairs
+- Upload dataset to HuggingFace
+- Submit repository link
+Part 2: Model Training and Evaluation (60 points)
+a) DPO Fine-tuning (40 points)
+- Fine-tune llama-3.2 using PairRM preference dataset
+- Fine-tune llama-3.2 using LLM Judge preference dataset
+- Document training parameters and process
+- Upload PEFT adapters to HuggingFace
+- Submit repository links
+b) Comparative Analysis (20 points)
+- Select 10 novel instructions (not in training data)
+- Generate completions using:
+  * Original llama-3.2
+  * DPO fine-tuned model (LLM judge dataset)
+  * DPO fine-tuned model (PairRM dataset)
+- Present results in a pandas DataFrame
+- Analyze and compare the quality of completions
+- Include quantitative and qualitative observations
+Address the following points:
+1. Qualitative differences in model outputs
+2. Training stability across iterations
+3. Computational efficiency considerations
+4. Potential limitations and failure modes
+5. Suggestions for improvement
+Grading Criteria for Free Response:
+- Depth of technical understanding
+- Critical analysis of results
+- Clear articulation of observations
+- Original insights and suggestions
+- Proper technical writing style
+Extra Credit: Iterative DPO Implementation and Analysis (30 points)
+a) Implementation (20 points)
+- Implement the iterative DPO algorithm as described in "Self Rewarding Language Models"
+- Train multiple iterations of the model (minimum 2 iterations)
+- Document:
+  * Implementation details
+  * Training parameters
+b) Comparative Analysis (10 points)
+Free Response Question (~250 words)
+Compare and analyze the performance and behavioral differences against the base llama-3.2 model, the DPO-PairRM model, and DPO-LLM-judge model
+Submission Links by Requirement
+1a) LLM Judge-Based Collection (20 pts)
+Dataset (HF Datasets): https://huggingface.co/datasets/pyamy/dpo-llm-judge-preferences-llama3
+Judge design doc filename: llm_judge_design_documentation_20250811_212607.txt (included in artifacts)
+Compute: Local GPU
+1b) PairRM-Based Collection (20 pts)
+Dataset (HF Datasets): https://huggingface.co/datasets/pyamy/dpo-pairrm-preferences-llama3
+Spec: 50 LIMA instructions; 5 responses/instruction; 250 preference pairs
+2a) DPO Fine-tuning (40 pts)
+Base model: meta-llama/Llama-3.2-1B-Instruct
+Adapters (HF Models):
+PairRM DPO: https://huggingface.co/pyamy/llama3-dpo-pairrm
+LLM-Judge DPO: https://huggingface.co/pyamy/llama3-dpo-llm-judge
+Training parameters/process: Logged in notebook output (per-step losses; LoRA adapters saved)
+2b) Comparative Analysis (20 pts)
+Novelty check: 10 evaluation prompts; overlap with training = 0/10
+Results table: evaluation_results.csv (saved with outputs from base + both DPO models)
+Quantitative snapshot (from evaluation_results.csv):
+Base - avg_words 26.1, avg_chars 153.0, bullet_like_frac 0.10
+DPO-PairRM - avg_words 27.3, avg_chars 153.0, bullet_like_frac 0.30
+DPO-LLM-Judge - avg_words 26.6, avg_chars 153.0, bullet_like_frac 0.10
+Qualitative observation (from table):
+ DPO-PairRM tends to produce more stepwise, list-style answers; DPO-LLM-Judge remains more conversational while adhering to instructions.
+Extra Credit - Iterative DPO (30 pts)
+Iteration 1: +20 new preference pairs → model ./iterative_dpo_model_iter_1
+Iteration 2: +0 new pairs → model ./iterative_dpo_model_iter_2
+Analysis file: iterative_dpo_analysis.txt
+Free Response
+This assignment applies Direct Preference Optimization (DPO) to Llama-3.2-1B-Instruct using two preference sources: PairRM (250 pairs) and an LLM-judge dataset (150 pairs). DPO optimizes the log-odds of “chosen” over “rejected” responses while constraining divergence from the reference with a KL term (β controls that trade-off; not reported here). Evaluation on 10 novel prompts (0/10 overlap with training) compares the base model with both DPO fine-tunes. From evaluation_results.csv, corpus-level statistics show a small style shift after DPO: average words per response increase for the DPO models relative to base, and list-style formatting rises notably for DPO-PairRM (higher bullet-like fraction), indicating stronger structural bias from PairRM preferences. Qualitatively (inspecting the table), DPO-PairRM tends toward stepwise, “instructional” phrasing; DPO-LLM-judge remains more conversational while still adhering to the prompts. Training stability and runtime were not re-measured in this run (existing models were reused), so I avoid claims there. Limitations include small preference sets and automated-judge bias; these can over-reward length/format. Improvements: log β and other hyperparameters alongside results; add an automatic win-rate over the 10 prompts (e.g., a simple LLM judge sweep) to complement length/format metrics; and broaden preference diversity (e.g., more instructions or ensemble judges). Overall, DPO nudges structure and adherence in ways consistent with the active preference signal without visible degradation on these prompts.
+## Published assets
+- Assignment 4 artifacts - https://huggingface.co/pyamy/dpo-assignment-4-artifacts
+- PairRM dataset: https://huggingface.co/datasets/pyamy/dpo-pairrm-preferences-llama3
+- LLM-Judge dataset: https://huggingface.co/datasets/pyamy/dpo-llm-judge-preferences-llama3
+- DPO-PairRM adapters: https://huggingface.co/pyamy/llama3-dpo-pairrm
+- DPO-LLM-Judge adapters: https://huggingface.co/pyamy/llama3-dpo-llm-judge
+- Colab notebook: https://colab.research.google.com/drive/1_vgdQph7H0kO_Vx_DF4q9sPwdN8xtYvS?usp=sharing
+Uploaded from: `f:\Northeastern 2024-2025\INFO7374\Assignment 4\Final`
+Upload time (UTC): 2025-08-12T14:57:34Z