| here’s a clean, structured **Markdown** you can paste straight into your Hugging Face artifacts README: | |
| --- | |
| # DPO Assignment 4 — Full Artifacts | |
| All local artifacts from my run (datasets on disk, DPO adapters, CSV/TXT outputs, and the notebook). | |
| ## Assignment 4 (verbatim prompt) | |
| ``` | |
| Assignment 4 | |
| In this assignment, we will be generating a preference dataset with PairRM and fine tuning a model with DPO. This is a powerful training recipe that is behind some of the top models according to Alpaca Eval. | |
| You may use llama-3.2 1B or llama-3.2 3B. | |
| Preference Dataset Collection and DPO Model Training | |
| Part 1: Dataset Generation and Judge Implementation (40 points) | |
| Create two separate preference datasets using different collection methods: | |
| a) LLM Judge-Based Collection (20 points) | |
| - Implement an LLM-based judge system | |
| - Document your reasoning for the judge's prompt design | |
| - Explain how you ensure consistent and reliable preference judgments | |
| - Include examples of the judge's evaluation process | |
| - You can choose between using local inference on Colab/Lightning studio or a 3rd party provider like fireworks ai/openai/together ai/groq (kimi k2) | |
| b) PairRM-Based Collection (20 points) | |
| - Extract 50 instructions from the Lima dataset | |
| - Generate 5 responses per instruction using the llama-3.2 chat template | |
| - Apply PairRM to create preference pairs | |
| - Upload dataset to HuggingFace | |
| - Submit repository link | |
| Part 2: Model Training and Evaluation (60 points) | |
| a) DPO Fine-tuning (40 points) | |
| - Fine-tune llama-3.2 using PairRM preference dataset | |
| - Fine-tune llama-3.2 using LLM Judge preference dataset | |
| - Document training parameters and process | |
| - Upload PEFT adapters to HuggingFace | |
| - Submit repository links | |
| b) Comparative Analysis (20 points) | |
| - Select 10 novel instructions (not in training data) | |
| - Generate completions using: | |
| * Original llama-3.2 | |
| * DPO fine-tuned model (LLM judge dataset) | |
| * DPO fine-tuned model (PairRM dataset) | |
| - Present results in a pandas DataFrame | |
| - Analyze and compare the quality of completions | |
| - Include quantitative and qualitative observations | |
| Address the following points: | |
| 1. Qualitative differences in model outputs | |
| 2. Training stability across iterations | |
| 3. Computational efficiency considerations | |
| 4. Potential limitations and failure modes | |
| 5. Suggestions for improvement | |
| Grading Criteria for Free Response: | |
| - Depth of technical understanding | |
| - Critical analysis of results | |
| - Clear articulation of observations | |
| - Original insights and suggestions | |
| - Proper technical writing style | |
| Extra Credit: Iterative DPO Implementation and Analysis (30 points) | |
| a) Implementation (20 points) | |
| - Implement the iterative DPO algorithm as described in "Self Rewarding Language Models" | |
| - Train multiple iterations of the model (minimum 2 iterations) | |
| - Document: | |
| * Implementation details | |
| * Training parameters | |
| b) Comparative Analysis (10 points) | |
| Free Response Question (~250 words) | |
| Compare and analyze the performance and behavioral differences against the base llama-3.2 model, the DPO-PairRM model, and DPO-LLM-judge model | |
| ``` | |
| --- | |
| ## Submission Links by Requirement | |
| ### 1a) LLM Judge-Based Collection (20 pts) | |
| * **Dataset (HF Datasets):** [https://huggingface.co/datasets/pyamy/dpo-llm-judge-preferences-llama3](https://huggingface.co/datasets/pyamy/dpo-llm-judge-preferences-llama3) | |
| * **Judge design doc filename:** `llm_judge_design_documentation_20250811_212607.txt` (included in artifacts) | |
| * **Compute:** Local GPU | |
| ### 1b) PairRM-Based Collection (20 pts) | |
| * **Dataset (HF Datasets):** [https://huggingface.co/datasets/pyamy/dpo-pairrm-preferences-llama3](https://huggingface.co/datasets/pyamy/dpo-pairrm-preferences-llama3) | |
| * **Spec:** 50 LIMA instructions; 5 responses/instruction; 250 preference pairs | |
| --- | |
| ### 2a) DPO Fine-tuning (40 pts) | |
| * **Base model:** `meta-llama/Llama-3.2-1B-Instruct` | |
| * **Adapters (HF Models):** | |
| * PairRM DPO: [https://huggingface.co/pyamy/llama3-dpo-pairrm](https://huggingface.co/pyamy/llama3-dpo-pairrm) | |
| * LLM-Judge DPO: [https://huggingface.co/pyamy/llama3-dpo-llm-judge](https://huggingface.co/pyamy/llama3-dpo-llm-judge) | |
| * **Training parameters/process:** Logged in notebook output (per-step losses; LoRA adapters saved) | |
| ### 2b) Comparative Analysis (20 pts) | |
| * **Novelty check:** 10 evaluation prompts; **overlap with training = 0/10** | |
| * **Results table:** `evaluation_results.csv` (saved with outputs from base + both DPO models) | |
| **Quantitative snapshot (from `evaluation_results.csv`):** | |
| | Model | avg\_words | avg\_chars | bullet\_like\_frac | | |
| | ------------- | ---------- | ---------- | ------------------ | | |
| | Base | 26.1 | 153.0 | 0.10 | | |
| | DPO-PairRM | 27.3 | 153.0 | 0.30 | | |
| | DPO-LLM-Judge | 26.6 | 153.0 | 0.10 | | |
| **Qualitative observation (from table):** | |
| DPO-PairRM tends to produce more stepwise, list-style answers; DPO-LLM-Judge remains more conversational while adhering to instructions. | |
| --- | |
| ## Extra Credit — Iterative DPO (30 pts) | |
| * **Iteration 1:** +20 new preference pairs → model `./iterative_dpo_model_iter_1` | |
| * **Iteration 2:** +0 new pairs → model `./iterative_dpo_model_iter_2` | |
| * **Analysis file:** `iterative_dpo_analysis.txt` | |
| --- | |
| ## Free Response (\~250 words) | |
| This assignment applies Direct Preference Optimization (DPO) to Llama-3.2-1B-Instruct using two preference sources: PairRM (250 pairs) and an LLM-judge dataset (150 pairs). DPO optimizes the log-odds of “chosen” over “rejected” responses while constraining divergence from the reference with a KL term (β controls that trade-off; not reported here). Evaluation on 10 novel prompts (0/10 overlap with training) compares the base model with both DPO fine-tunes. From `evaluation_results.csv`, corpus-level statistics show a small style shift after DPO: average words per response increase for the DPO models relative to base, and list-style formatting rises notably for DPO-PairRM (higher bullet-like fraction), indicating stronger structural bias from PairRM preferences. Qualitatively (inspecting the table), DPO-PairRM tends toward stepwise, “instructional” phrasing; DPO-LLM-judge remains more conversational while still adhering to the prompts. Training stability and runtime were not re-measured in this run (existing models were reused), so I avoid claims there. Limitations include small preference sets and automated-judge bias; these can over-reward length/format. Improvements: log β and other hyperparameters alongside results; add an automatic win-rate over the 10 prompts (e.g., a simple LLM judge sweep) to complement length/format metrics; and broaden preference diversity (e.g., more instructions or ensemble judges). Overall, DPO nudges structure and adherence in ways consistent with the active preference signal without visible degradation on these prompts. | |
| --- | |
| ## All Links | |
| * **Assignment 4 artifacts:** [https://huggingface.co/pyamy/dpo-assignment-4-artifacts](https://huggingface.co/pyamy/dpo-assignment-4-artifacts) | |
| * **PairRM dataset:** [https://huggingface.co/datasets/pyamy/dpo-pairrm-preferences-llama3](https://huggingface.co/datasets/pyamy/dpo-pairrm-preferences-llama3) | |
| * **LLM-Judge dataset:** [https://huggingface.co/datasets/pyamy/dpo-llm-judge-preferences-llama3](https://huggingface.co/datasets/pyamy/dpo-llm-judge-preferences-llama3) | |
| * **DPO-PairRM adapters:** [https://huggingface.co/pyamy/llama3-dpo-pairrm](https://huggingface.co/pyamy/llama3-dpo-pairrm) | |
| * **DPO-LLM-Judge adapters:** [https://huggingface.co/pyamy/llama3-dpo-llm-judge](https://huggingface.co/pyamy/llama3-dpo-llm-judge) | |
| * **Colab notebook:** [https://colab.research.google.com/drive/1\_vgdQph7H0kO\_Vx\_DF4q9sPwdN8xtYvS?usp=sharing](https://colab.research.google.com/drive/1_vgdQph7H0kO_Vx_DF4q9sPwdN8xtYvS?usp=sharing) | |
| --- | |
| **Uploaded from:** `f:\Northeastern 2024-2025\INFO7374\Assignment 4\Final` | |
| **Upload time (UTC):** 2025-08-12T14:57:34Z | |
| --- | |