here’s a clean, structured **Markdown** you can paste straight into your Hugging Face artifacts README: --- # DPO Assignment 4 — Full Artifacts All local artifacts from my run (datasets on disk, DPO adapters, CSV/TXT outputs, and the notebook). ## Assignment 4 (verbatim prompt) ``` Assignment 4 In this assignment, we will be generating a preference dataset with PairRM and fine tuning a model with DPO. This is a powerful training recipe that is behind some of the top models according to Alpaca Eval. You may use llama-3.2 1B or llama-3.2 3B. Preference Dataset Collection and DPO Model Training Part 1: Dataset Generation and Judge Implementation (40 points) Create two separate preference datasets using different collection methods: a) LLM Judge-Based Collection (20 points) - Implement an LLM-based judge system - Document your reasoning for the judge's prompt design - Explain how you ensure consistent and reliable preference judgments - Include examples of the judge's evaluation process - You can choose between using local inference on Colab/Lightning studio or a 3rd party provider like fireworks ai/openai/together ai/groq (kimi k2) b) PairRM-Based Collection (20 points) - Extract 50 instructions from the Lima dataset - Generate 5 responses per instruction using the llama-3.2 chat template - Apply PairRM to create preference pairs - Upload dataset to HuggingFace - Submit repository link Part 2: Model Training and Evaluation (60 points) a) DPO Fine-tuning (40 points) - Fine-tune llama-3.2 using PairRM preference dataset - Fine-tune llama-3.2 using LLM Judge preference dataset - Document training parameters and process - Upload PEFT adapters to HuggingFace - Submit repository links b) Comparative Analysis (20 points) - Select 10 novel instructions (not in training data) - Generate completions using: * Original llama-3.2 * DPO fine-tuned model (LLM judge dataset) * DPO fine-tuned model (PairRM dataset) - Present results in a pandas DataFrame - Analyze and compare the quality of completions - Include quantitative and qualitative observations Address the following points: 1. Qualitative differences in model outputs 2. Training stability across iterations 3. Computational efficiency considerations 4. Potential limitations and failure modes 5. Suggestions for improvement Grading Criteria for Free Response: - Depth of technical understanding - Critical analysis of results - Clear articulation of observations - Original insights and suggestions - Proper technical writing style Extra Credit: Iterative DPO Implementation and Analysis (30 points) a) Implementation (20 points) - Implement the iterative DPO algorithm as described in "Self Rewarding Language Models" - Train multiple iterations of the model (minimum 2 iterations) - Document: * Implementation details * Training parameters b) Comparative Analysis (10 points) Free Response Question (~250 words) Compare and analyze the performance and behavioral differences against the base llama-3.2 model, the DPO-PairRM model, and DPO-LLM-judge model ``` --- ## Submission Links by Requirement ### 1a) LLM Judge-Based Collection (20 pts) * **Dataset (HF Datasets):** [https://huggingface.co/datasets/pyamy/dpo-llm-judge-preferences-llama3](https://huggingface.co/datasets/pyamy/dpo-llm-judge-preferences-llama3) * **Judge design doc filename:** `llm_judge_design_documentation_20250811_212607.txt` (included in artifacts) * **Compute:** Local GPU ### 1b) PairRM-Based Collection (20 pts) * **Dataset (HF Datasets):** [https://huggingface.co/datasets/pyamy/dpo-pairrm-preferences-llama3](https://huggingface.co/datasets/pyamy/dpo-pairrm-preferences-llama3) * **Spec:** 50 LIMA instructions; 5 responses/instruction; 250 preference pairs --- ### 2a) DPO Fine-tuning (40 pts) * **Base model:** `meta-llama/Llama-3.2-1B-Instruct` * **Adapters (HF Models):** * PairRM DPO: [https://huggingface.co/pyamy/llama3-dpo-pairrm](https://huggingface.co/pyamy/llama3-dpo-pairrm) * LLM-Judge DPO: [https://huggingface.co/pyamy/llama3-dpo-llm-judge](https://huggingface.co/pyamy/llama3-dpo-llm-judge) * **Training parameters/process:** Logged in notebook output (per-step losses; LoRA adapters saved) ### 2b) Comparative Analysis (20 pts) * **Novelty check:** 10 evaluation prompts; **overlap with training = 0/10** * **Results table:** `evaluation_results.csv` (saved with outputs from base + both DPO models) **Quantitative snapshot (from `evaluation_results.csv`):** | Model | avg\_words | avg\_chars | bullet\_like\_frac | | ------------- | ---------- | ---------- | ------------------ | | Base | 26.1 | 153.0 | 0.10 | | DPO-PairRM | 27.3 | 153.0 | 0.30 | | DPO-LLM-Judge | 26.6 | 153.0 | 0.10 | **Qualitative observation (from table):** DPO-PairRM tends to produce more stepwise, list-style answers; DPO-LLM-Judge remains more conversational while adhering to instructions. --- ## Extra Credit — Iterative DPO (30 pts) * **Iteration 1:** +20 new preference pairs → model `./iterative_dpo_model_iter_1` * **Iteration 2:** +0 new pairs → model `./iterative_dpo_model_iter_2` * **Analysis file:** `iterative_dpo_analysis.txt` --- ## Free Response (\~250 words) This assignment applies Direct Preference Optimization (DPO) to Llama-3.2-1B-Instruct using two preference sources: PairRM (250 pairs) and an LLM-judge dataset (150 pairs). DPO optimizes the log-odds of “chosen” over “rejected” responses while constraining divergence from the reference with a KL term (β controls that trade-off; not reported here). Evaluation on 10 novel prompts (0/10 overlap with training) compares the base model with both DPO fine-tunes. From `evaluation_results.csv`, corpus-level statistics show a small style shift after DPO: average words per response increase for the DPO models relative to base, and list-style formatting rises notably for DPO-PairRM (higher bullet-like fraction), indicating stronger structural bias from PairRM preferences. Qualitatively (inspecting the table), DPO-PairRM tends toward stepwise, “instructional” phrasing; DPO-LLM-judge remains more conversational while still adhering to the prompts. Training stability and runtime were not re-measured in this run (existing models were reused), so I avoid claims there. Limitations include small preference sets and automated-judge bias; these can over-reward length/format. Improvements: log β and other hyperparameters alongside results; add an automatic win-rate over the 10 prompts (e.g., a simple LLM judge sweep) to complement length/format metrics; and broaden preference diversity (e.g., more instructions or ensemble judges). Overall, DPO nudges structure and adherence in ways consistent with the active preference signal without visible degradation on these prompts. --- ## All Links * **Assignment 4 artifacts:** [https://huggingface.co/pyamy/dpo-assignment-4-artifacts](https://huggingface.co/pyamy/dpo-assignment-4-artifacts) * **PairRM dataset:** [https://huggingface.co/datasets/pyamy/dpo-pairrm-preferences-llama3](https://huggingface.co/datasets/pyamy/dpo-pairrm-preferences-llama3) * **LLM-Judge dataset:** [https://huggingface.co/datasets/pyamy/dpo-llm-judge-preferences-llama3](https://huggingface.co/datasets/pyamy/dpo-llm-judge-preferences-llama3) * **DPO-PairRM adapters:** [https://huggingface.co/pyamy/llama3-dpo-pairrm](https://huggingface.co/pyamy/llama3-dpo-pairrm) * **DPO-LLM-Judge adapters:** [https://huggingface.co/pyamy/llama3-dpo-llm-judge](https://huggingface.co/pyamy/llama3-dpo-llm-judge) * **Colab notebook:** [https://colab.research.google.com/drive/1\_vgdQph7H0kO\_Vx\_DF4q9sPwdN8xtYvS?usp=sharing](https://colab.research.google.com/drive/1_vgdQph7H0kO_Vx_DF4q9sPwdN8xtYvS?usp=sharing) --- **Uploaded from:** `f:\Northeastern 2024-2025\INFO7374\Assignment 4\Final` **Upload time (UTC):** 2025-08-12T14:57:34Z ---