here’s a clean, structured **Markdown** you can paste straight into your Hugging Face artifacts README:

---

# DPO Assignment 4 — Full Artifacts

All local artifacts from my run (datasets on disk, DPO adapters, CSV/TXT outputs, and the notebook).

## Assignment 4 (verbatim prompt)

```
Assignment 4

In this assignment, we will be generating a preference dataset with PairRM and fine tuning a model with DPO. This is a powerful training recipe that is behind some of the top models according to Alpaca Eval.
You may use llama-3.2 1B or llama-3.2 3B.

Preference Dataset Collection and DPO Model Training

Part 1: Dataset Generation and Judge Implementation (40 points)
Create two separate preference datasets using different collection methods:

a) LLM Judge-Based Collection (20 points)
- Implement an LLM-based judge system
- Document your reasoning for the judge's prompt design
- Explain how you ensure consistent and reliable preference judgments
- Include examples of the judge's evaluation process
- You can choose between using local inference on Colab/Lightning studio or a 3rd party provider like fireworks ai/openai/together ai/groq (kimi k2)

b) PairRM-Based Collection (20 points)
- Extract 50 instructions from the Lima dataset
- Generate 5 responses per instruction using the llama-3.2 chat template
- Apply PairRM to create preference pairs
- Upload dataset to HuggingFace
- Submit repository link

Part 2: Model Training and Evaluation (60 points)

a) DPO Fine-tuning (40 points)
- Fine-tune llama-3.2 using PairRM preference dataset
- Fine-tune llama-3.2 using LLM Judge preference dataset
- Document training parameters and process
- Upload PEFT adapters to HuggingFace
- Submit repository links

b) Comparative Analysis (20 points)
- Select 10 novel instructions (not in training data)
- Generate completions using:
  * Original llama-3.2
  * DPO fine-tuned model (LLM judge dataset)
  * DPO fine-tuned model (PairRM dataset)
- Present results in a pandas DataFrame
- Analyze and compare the quality of completions
- Include quantitative and qualitative observations

Address the following points:
1. Qualitative differences in model outputs
2. Training stability across iterations
3. Computational efficiency considerations
4. Potential limitations and failure modes
5. Suggestions for improvement

Grading Criteria for Free Response:
- Depth of technical understanding
- Critical analysis of results
- Clear articulation of observations
- Original insights and suggestions
- Proper technical writing style


Extra Credit: Iterative DPO Implementation and Analysis (30 points)

a) Implementation (20 points)
- Implement the iterative DPO algorithm as described in "Self Rewarding Language Models"
- Train multiple iterations of the model (minimum 2 iterations)
- Document:
  * Implementation details
  * Training parameters

b) Comparative Analysis (10 points)
Free Response Question (~250 words)
Compare and analyze the performance and behavioral differences against the base llama-3.2 model, the DPO-PairRM model, and DPO-LLM-judge model
```

---

## Submission Links by Requirement

### 1a) LLM Judge-Based Collection (20 pts)

* **Dataset (HF Datasets):** [https://huggingface.co/datasets/pyamy/dpo-llm-judge-preferences-llama3](https://huggingface.co/datasets/pyamy/dpo-llm-judge-preferences-llama3)
* **Judge design doc filename:** `llm_judge_design_documentation_20250811_212607.txt` (included in artifacts)
* **Compute:** Local GPU

### 1b) PairRM-Based Collection (20 pts)

* **Dataset (HF Datasets):** [https://huggingface.co/datasets/pyamy/dpo-pairrm-preferences-llama3](https://huggingface.co/datasets/pyamy/dpo-pairrm-preferences-llama3)
* **Spec:** 50 LIMA instructions; 5 responses/instruction; 250 preference pairs

---

### 2a) DPO Fine-tuning (40 pts)

* **Base model:** `meta-llama/Llama-3.2-1B-Instruct`
* **Adapters (HF Models):**

  * PairRM DPO: [https://huggingface.co/pyamy/llama3-dpo-pairrm](https://huggingface.co/pyamy/llama3-dpo-pairrm)
  * LLM-Judge DPO: [https://huggingface.co/pyamy/llama3-dpo-llm-judge](https://huggingface.co/pyamy/llama3-dpo-llm-judge)
* **Training parameters/process:** Logged in notebook output (per-step losses; LoRA adapters saved)

### 2b) Comparative Analysis (20 pts)

* **Novelty check:** 10 evaluation prompts; **overlap with training = 0/10**
* **Results table:** `evaluation_results.csv` (saved with outputs from base + both DPO models)

**Quantitative snapshot (from `evaluation_results.csv`):**

| Model         | avg\_words | avg\_chars | bullet\_like\_frac |
| ------------- | ---------- | ---------- | ------------------ |
| Base          | 26.1       | 153.0      | 0.10               |
| DPO-PairRM    | 27.3       | 153.0      | 0.30               |
| DPO-LLM-Judge | 26.6       | 153.0      | 0.10               |

**Qualitative observation (from table):**
DPO-PairRM tends to produce more stepwise, list-style answers; DPO-LLM-Judge remains more conversational while adhering to instructions.

---

## Extra Credit — Iterative DPO (30 pts)

* **Iteration 1:** +20 new preference pairs → model `./iterative_dpo_model_iter_1`
* **Iteration 2:** +0 new pairs → model `./iterative_dpo_model_iter_2`
* **Analysis file:** `iterative_dpo_analysis.txt`

---

## Free Response (\~250 words)

This assignment applies Direct Preference Optimization (DPO) to Llama-3.2-1B-Instruct using two preference sources: PairRM (250 pairs) and an LLM-judge dataset (150 pairs). DPO optimizes the log-odds of “chosen” over “rejected” responses while constraining divergence from the reference with a KL term (β controls that trade-off; not reported here). Evaluation on 10 novel prompts (0/10 overlap with training) compares the base model with both DPO fine-tunes. From `evaluation_results.csv`, corpus-level statistics show a small style shift after DPO: average words per response increase for the DPO models relative to base, and list-style formatting rises notably for DPO-PairRM (higher bullet-like fraction), indicating stronger structural bias from PairRM preferences. Qualitatively (inspecting the table), DPO-PairRM tends toward stepwise, “instructional” phrasing; DPO-LLM-judge remains more conversational while still adhering to the prompts. Training stability and runtime were not re-measured in this run (existing models were reused), so I avoid claims there. Limitations include small preference sets and automated-judge bias; these can over-reward length/format. Improvements: log β and other hyperparameters alongside results; add an automatic win-rate over the 10 prompts (e.g., a simple LLM judge sweep) to complement length/format metrics; and broaden preference diversity (e.g., more instructions or ensemble judges). Overall, DPO nudges structure and adherence in ways consistent with the active preference signal without visible degradation on these prompts.

---

## All Links

* **Assignment 4 artifacts:** [https://huggingface.co/pyamy/dpo-assignment-4-artifacts](https://huggingface.co/pyamy/dpo-assignment-4-artifacts)
* **PairRM dataset:** [https://huggingface.co/datasets/pyamy/dpo-pairrm-preferences-llama3](https://huggingface.co/datasets/pyamy/dpo-pairrm-preferences-llama3)
* **LLM-Judge dataset:** [https://huggingface.co/datasets/pyamy/dpo-llm-judge-preferences-llama3](https://huggingface.co/datasets/pyamy/dpo-llm-judge-preferences-llama3)
* **DPO-PairRM adapters:** [https://huggingface.co/pyamy/llama3-dpo-pairrm](https://huggingface.co/pyamy/llama3-dpo-pairrm)
* **DPO-LLM-Judge adapters:** [https://huggingface.co/pyamy/llama3-dpo-llm-judge](https://huggingface.co/pyamy/llama3-dpo-llm-judge)
* **Colab notebook:** [https://colab.research.google.com/drive/1\_vgdQph7H0kO\_Vx\_DF4q9sPwdN8xtYvS?usp=sharing](https://colab.research.google.com/drive/1_vgdQph7H0kO_Vx_DF4q9sPwdN8xtYvS?usp=sharing)

---

**Uploaded from:** `f:\Northeastern 2024-2025\INFO7374\Assignment 4\Final`
**Upload time (UTC):** 2025-08-12T14:57:34Z

---