Update README.md

3f6fbd5 verified 6 months ago

7.96 kB

	here’s a clean, structured Markdown you can paste straight into your Hugging Face artifacts README:

	---

	# DPO Assignment 4 — Full Artifacts

	All local artifacts from my run (datasets on disk, DPO adapters, CSV/TXT outputs, and the notebook).

	## Assignment 4 (verbatim prompt)

	```
	Assignment 4

	In this assignment, we will be generating a preference dataset with PairRM and fine tuning a model with DPO. This is a powerful training recipe that is behind some of the top models according to Alpaca Eval.
	You may use llama-3.2 1B or llama-3.2 3B.

	Preference Dataset Collection and DPO Model Training

	Part 1: Dataset Generation and Judge Implementation (40 points)
	Create two separate preference datasets using different collection methods:

	a) LLM Judge-Based Collection (20 points)
	- Implement an LLM-based judge system
	- Document your reasoning for the judge's prompt design
	- Explain how you ensure consistent and reliable preference judgments
	- Include examples of the judge's evaluation process
	- You can choose between using local inference on Colab/Lightning studio or a 3rd party provider like fireworks ai/openai/together ai/groq (kimi k2)

	b) PairRM-Based Collection (20 points)
	- Extract 50 instructions from the Lima dataset
	- Generate 5 responses per instruction using the llama-3.2 chat template
	- Apply PairRM to create preference pairs
	- Upload dataset to HuggingFace
	- Submit repository link

	Part 2: Model Training and Evaluation (60 points)

	a) DPO Fine-tuning (40 points)
	- Fine-tune llama-3.2 using PairRM preference dataset
	- Fine-tune llama-3.2 using LLM Judge preference dataset
	- Document training parameters and process
	- Upload PEFT adapters to HuggingFace
	- Submit repository links

	b) Comparative Analysis (20 points)
	- Select 10 novel instructions (not in training data)
	- Generate completions using:
	* Original llama-3.2
	* DPO fine-tuned model (LLM judge dataset)
	* DPO fine-tuned model (PairRM dataset)
	- Present results in a pandas DataFrame
	- Analyze and compare the quality of completions
	- Include quantitative and qualitative observations

	Address the following points:
	1. Qualitative differences in model outputs
	2. Training stability across iterations
	3. Computational efficiency considerations
	4. Potential limitations and failure modes
	5. Suggestions for improvement

	Grading Criteria for Free Response:
	- Depth of technical understanding
	- Critical analysis of results
	- Clear articulation of observations
	- Original insights and suggestions
	- Proper technical writing style


	Extra Credit: Iterative DPO Implementation and Analysis (30 points)

	a) Implementation (20 points)
	- Implement the iterative DPO algorithm as described in "Self Rewarding Language Models"
	- Train multiple iterations of the model (minimum 2 iterations)
	- Document:
	* Implementation details
	* Training parameters

	b) Comparative Analysis (10 points)
	Free Response Question (~250 words)
	Compare and analyze the performance and behavioral differences against the base llama-3.2 model, the DPO-PairRM model, and DPO-LLM-judge model
	```

	---

	## Submission Links by Requirement

	### 1a) LLM Judge-Based Collection (20 pts)

	* Dataset (HF Datasets): [https://huggingface.co/datasets/pyamy/dpo-llm-judge-preferences-llama3](https://huggingface.co/datasets/pyamy/dpo-llm-judge-preferences-llama3)
	* Judge design doc filename: `llm_judge_design_documentation_20250811_212607.txt` (included in artifacts)
	* Compute: Local GPU

	### 1b) PairRM-Based Collection (20 pts)

	* Dataset (HF Datasets): [https://huggingface.co/datasets/pyamy/dpo-pairrm-preferences-llama3](https://huggingface.co/datasets/pyamy/dpo-pairrm-preferences-llama3)
	* Spec: 50 LIMA instructions; 5 responses/instruction; 250 preference pairs

	---

	### 2a) DPO Fine-tuning (40 pts)

	* Base model: `meta-llama/Llama-3.2-1B-Instruct`
	* Adapters (HF Models):

	* PairRM DPO: [https://huggingface.co/pyamy/llama3-dpo-pairrm](https://huggingface.co/pyamy/llama3-dpo-pairrm)
	* LLM-Judge DPO: [https://huggingface.co/pyamy/llama3-dpo-llm-judge](https://huggingface.co/pyamy/llama3-dpo-llm-judge)
	* Training parameters/process: Logged in notebook output (per-step losses; LoRA adapters saved)

	### 2b) Comparative Analysis (20 pts)

	* Novelty check: 10 evaluation prompts; overlap with training = 0/10
	* Results table: `evaluation_results.csv` (saved with outputs from base + both DPO models)

	Quantitative snapshot (from `evaluation_results.csv`):

	\| Model \| avg\_words \| avg\_chars \| bullet\_like\_frac \|
	\| ------------- \| ---------- \| ---------- \| ------------------ \|
	\| Base \| 26.1 \| 153.0 \| 0.10 \|
	\| DPO-PairRM \| 27.3 \| 153.0 \| 0.30 \|
	\| DPO-LLM-Judge \| 26.6 \| 153.0 \| 0.10 \|

	Qualitative observation (from table):
	DPO-PairRM tends to produce more stepwise, list-style answers; DPO-LLM-Judge remains more conversational while adhering to instructions.

	---

	## Extra Credit — Iterative DPO (30 pts)

	* Iteration 1: +20 new preference pairs → model `./iterative_dpo_model_iter_1`
	* Iteration 2: +0 new pairs → model `./iterative_dpo_model_iter_2`
	* Analysis file: `iterative_dpo_analysis.txt`

	---

	## Free Response (\~250 words)

	This assignment applies Direct Preference Optimization (DPO) to Llama-3.2-1B-Instruct using two preference sources: PairRM (250 pairs) and an LLM-judge dataset (150 pairs). DPO optimizes the log-odds of “chosen” over “rejected” responses while constraining divergence from the reference with a KL term (β controls that trade-off; not reported here). Evaluation on 10 novel prompts (0/10 overlap with training) compares the base model with both DPO fine-tunes. From `evaluation_results.csv`, corpus-level statistics show a small style shift after DPO: average words per response increase for the DPO models relative to base, and list-style formatting rises notably for DPO-PairRM (higher bullet-like fraction), indicating stronger structural bias from PairRM preferences. Qualitatively (inspecting the table), DPO-PairRM tends toward stepwise, “instructional” phrasing; DPO-LLM-judge remains more conversational while still adhering to the prompts. Training stability and runtime were not re-measured in this run (existing models were reused), so I avoid claims there. Limitations include small preference sets and automated-judge bias; these can over-reward length/format. Improvements: log β and other hyperparameters alongside results; add an automatic win-rate over the 10 prompts (e.g., a simple LLM judge sweep) to complement length/format metrics; and broaden preference diversity (e.g., more instructions or ensemble judges). Overall, DPO nudges structure and adherence in ways consistent with the active preference signal without visible degradation on these prompts.

	---

	## All Links

	* Assignment 4 artifacts: [https://huggingface.co/pyamy/dpo-assignment-4-artifacts](https://huggingface.co/pyamy/dpo-assignment-4-artifacts)
	* PairRM dataset: [https://huggingface.co/datasets/pyamy/dpo-pairrm-preferences-llama3](https://huggingface.co/datasets/pyamy/dpo-pairrm-preferences-llama3)
	* LLM-Judge dataset: [https://huggingface.co/datasets/pyamy/dpo-llm-judge-preferences-llama3](https://huggingface.co/datasets/pyamy/dpo-llm-judge-preferences-llama3)
	* DPO-PairRM adapters: [https://huggingface.co/pyamy/llama3-dpo-pairrm](https://huggingface.co/pyamy/llama3-dpo-pairrm)
	* DPO-LLM-Judge adapters: [https://huggingface.co/pyamy/llama3-dpo-llm-judge](https://huggingface.co/pyamy/llama3-dpo-llm-judge)
	* Colab notebook: [https://colab.research.google.com/drive/1\_vgdQph7H0kO\_Vx\_DF4q9sPwdN8xtYvS?usp=sharing](https://colab.research.google.com/drive/1_vgdQph7H0kO_Vx_DF4q9sPwdN8xtYvS?usp=sharing)

	---

	Uploaded from: `f:\Northeastern 2024-2025\INFO7374\Assignment 4\Final`
	Upload time (UTC): 2025-08-12T14:57:34Z

	---