pyamy's picture
Update README.md
3f6fbd5 verified
here’s a clean, structured **Markdown** you can paste straight into your Hugging Face artifacts README:
---
# DPO Assignment 4 — Full Artifacts
All local artifacts from my run (datasets on disk, DPO adapters, CSV/TXT outputs, and the notebook).
## Assignment 4 (verbatim prompt)
```
Assignment 4
In this assignment, we will be generating a preference dataset with PairRM and fine tuning a model with DPO. This is a powerful training recipe that is behind some of the top models according to Alpaca Eval.
You may use llama-3.2 1B or llama-3.2 3B.
Preference Dataset Collection and DPO Model Training
Part 1: Dataset Generation and Judge Implementation (40 points)
Create two separate preference datasets using different collection methods:
a) LLM Judge-Based Collection (20 points)
- Implement an LLM-based judge system
- Document your reasoning for the judge's prompt design
- Explain how you ensure consistent and reliable preference judgments
- Include examples of the judge's evaluation process
- You can choose between using local inference on Colab/Lightning studio or a 3rd party provider like fireworks ai/openai/together ai/groq (kimi k2)
b) PairRM-Based Collection (20 points)
- Extract 50 instructions from the Lima dataset
- Generate 5 responses per instruction using the llama-3.2 chat template
- Apply PairRM to create preference pairs
- Upload dataset to HuggingFace
- Submit repository link
Part 2: Model Training and Evaluation (60 points)
a) DPO Fine-tuning (40 points)
- Fine-tune llama-3.2 using PairRM preference dataset
- Fine-tune llama-3.2 using LLM Judge preference dataset
- Document training parameters and process
- Upload PEFT adapters to HuggingFace
- Submit repository links
b) Comparative Analysis (20 points)
- Select 10 novel instructions (not in training data)
- Generate completions using:
* Original llama-3.2
* DPO fine-tuned model (LLM judge dataset)
* DPO fine-tuned model (PairRM dataset)
- Present results in a pandas DataFrame
- Analyze and compare the quality of completions
- Include quantitative and qualitative observations
Address the following points:
1. Qualitative differences in model outputs
2. Training stability across iterations
3. Computational efficiency considerations
4. Potential limitations and failure modes
5. Suggestions for improvement
Grading Criteria for Free Response:
- Depth of technical understanding
- Critical analysis of results
- Clear articulation of observations
- Original insights and suggestions
- Proper technical writing style
Extra Credit: Iterative DPO Implementation and Analysis (30 points)
a) Implementation (20 points)
- Implement the iterative DPO algorithm as described in "Self Rewarding Language Models"
- Train multiple iterations of the model (minimum 2 iterations)
- Document:
* Implementation details
* Training parameters
b) Comparative Analysis (10 points)
Free Response Question (~250 words)
Compare and analyze the performance and behavioral differences against the base llama-3.2 model, the DPO-PairRM model, and DPO-LLM-judge model
```
---
## Submission Links by Requirement
### 1a) LLM Judge-Based Collection (20 pts)
* **Dataset (HF Datasets):** [https://huggingface.co/datasets/pyamy/dpo-llm-judge-preferences-llama3](https://huggingface.co/datasets/pyamy/dpo-llm-judge-preferences-llama3)
* **Judge design doc filename:** `llm_judge_design_documentation_20250811_212607.txt` (included in artifacts)
* **Compute:** Local GPU
### 1b) PairRM-Based Collection (20 pts)
* **Dataset (HF Datasets):** [https://huggingface.co/datasets/pyamy/dpo-pairrm-preferences-llama3](https://huggingface.co/datasets/pyamy/dpo-pairrm-preferences-llama3)
* **Spec:** 50 LIMA instructions; 5 responses/instruction; 250 preference pairs
---
### 2a) DPO Fine-tuning (40 pts)
* **Base model:** `meta-llama/Llama-3.2-1B-Instruct`
* **Adapters (HF Models):**
* PairRM DPO: [https://huggingface.co/pyamy/llama3-dpo-pairrm](https://huggingface.co/pyamy/llama3-dpo-pairrm)
* LLM-Judge DPO: [https://huggingface.co/pyamy/llama3-dpo-llm-judge](https://huggingface.co/pyamy/llama3-dpo-llm-judge)
* **Training parameters/process:** Logged in notebook output (per-step losses; LoRA adapters saved)
### 2b) Comparative Analysis (20 pts)
* **Novelty check:** 10 evaluation prompts; **overlap with training = 0/10**
* **Results table:** `evaluation_results.csv` (saved with outputs from base + both DPO models)
**Quantitative snapshot (from `evaluation_results.csv`):**
| Model | avg\_words | avg\_chars | bullet\_like\_frac |
| ------------- | ---------- | ---------- | ------------------ |
| Base | 26.1 | 153.0 | 0.10 |
| DPO-PairRM | 27.3 | 153.0 | 0.30 |
| DPO-LLM-Judge | 26.6 | 153.0 | 0.10 |
**Qualitative observation (from table):**
DPO-PairRM tends to produce more stepwise, list-style answers; DPO-LLM-Judge remains more conversational while adhering to instructions.
---
## Extra Credit — Iterative DPO (30 pts)
* **Iteration 1:** +20 new preference pairs → model `./iterative_dpo_model_iter_1`
* **Iteration 2:** +0 new pairs → model `./iterative_dpo_model_iter_2`
* **Analysis file:** `iterative_dpo_analysis.txt`
---
## Free Response (\~250 words)
This assignment applies Direct Preference Optimization (DPO) to Llama-3.2-1B-Instruct using two preference sources: PairRM (250 pairs) and an LLM-judge dataset (150 pairs). DPO optimizes the log-odds of “chosen” over “rejected” responses while constraining divergence from the reference with a KL term (β controls that trade-off; not reported here). Evaluation on 10 novel prompts (0/10 overlap with training) compares the base model with both DPO fine-tunes. From `evaluation_results.csv`, corpus-level statistics show a small style shift after DPO: average words per response increase for the DPO models relative to base, and list-style formatting rises notably for DPO-PairRM (higher bullet-like fraction), indicating stronger structural bias from PairRM preferences. Qualitatively (inspecting the table), DPO-PairRM tends toward stepwise, “instructional” phrasing; DPO-LLM-judge remains more conversational while still adhering to the prompts. Training stability and runtime were not re-measured in this run (existing models were reused), so I avoid claims there. Limitations include small preference sets and automated-judge bias; these can over-reward length/format. Improvements: log β and other hyperparameters alongside results; add an automatic win-rate over the 10 prompts (e.g., a simple LLM judge sweep) to complement length/format metrics; and broaden preference diversity (e.g., more instructions or ensemble judges). Overall, DPO nudges structure and adherence in ways consistent with the active preference signal without visible degradation on these prompts.
---
## All Links
* **Assignment 4 artifacts:** [https://huggingface.co/pyamy/dpo-assignment-4-artifacts](https://huggingface.co/pyamy/dpo-assignment-4-artifacts)
* **PairRM dataset:** [https://huggingface.co/datasets/pyamy/dpo-pairrm-preferences-llama3](https://huggingface.co/datasets/pyamy/dpo-pairrm-preferences-llama3)
* **LLM-Judge dataset:** [https://huggingface.co/datasets/pyamy/dpo-llm-judge-preferences-llama3](https://huggingface.co/datasets/pyamy/dpo-llm-judge-preferences-llama3)
* **DPO-PairRM adapters:** [https://huggingface.co/pyamy/llama3-dpo-pairrm](https://huggingface.co/pyamy/llama3-dpo-pairrm)
* **DPO-LLM-Judge adapters:** [https://huggingface.co/pyamy/llama3-dpo-llm-judge](https://huggingface.co/pyamy/llama3-dpo-llm-judge)
* **Colab notebook:** [https://colab.research.google.com/drive/1\_vgdQph7H0kO\_Vx\_DF4q9sPwdN8xtYvS?usp=sharing](https://colab.research.google.com/drive/1_vgdQph7H0kO_Vx_DF4q9sPwdN8xtYvS?usp=sharing)
---
**Uploaded from:** `f:\Northeastern 2024-2025\INFO7374\Assignment 4\Final`
**Upload time (UTC):** 2025-08-12T14:57:34Z
---