YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

here’s a clean, structured Markdown you can paste straight into your Hugging Face artifacts README:

DPO Assignment 4 — Full Artifacts

All local artifacts from my run (datasets on disk, DPO adapters, CSV/TXT outputs, and the notebook).

Assignment 4 (verbatim prompt)

Assignment 4

In this assignment, we will be generating a preference dataset with PairRM and fine tuning a model with DPO. This is a powerful training recipe that is behind some of the top models according to Alpaca Eval.
You may use llama-3.2 1B or llama-3.2 3B.

Preference Dataset Collection and DPO Model Training

Part 1: Dataset Generation and Judge Implementation (40 points)
Create two separate preference datasets using different collection methods:

a) LLM Judge-Based Collection (20 points)
- Implement an LLM-based judge system
- Document your reasoning for the judge's prompt design
- Explain how you ensure consistent and reliable preference judgments
- Include examples of the judge's evaluation process
- You can choose between using local inference on Colab/Lightning studio or a 3rd party provider like fireworks ai/openai/together ai/groq (kimi k2)

b) PairRM-Based Collection (20 points)
- Extract 50 instructions from the Lima dataset
- Generate 5 responses per instruction using the llama-3.2 chat template
- Apply PairRM to create preference pairs
- Upload dataset to HuggingFace
- Submit repository link

Part 2: Model Training and Evaluation (60 points)

a) DPO Fine-tuning (40 points)
- Fine-tune llama-3.2 using PairRM preference dataset
- Fine-tune llama-3.2 using LLM Judge preference dataset
- Document training parameters and process
- Upload PEFT adapters to HuggingFace
- Submit repository links

b) Comparative Analysis (20 points)
- Select 10 novel instructions (not in training data)
- Generate completions using:
  * Original llama-3.2
  * DPO fine-tuned model (LLM judge dataset)
  * DPO fine-tuned model (PairRM dataset)
- Present results in a pandas DataFrame
- Analyze and compare the quality of completions
- Include quantitative and qualitative observations

Address the following points:
1. Qualitative differences in model outputs
2. Training stability across iterations
3. Computational efficiency considerations
4. Potential limitations and failure modes
5. Suggestions for improvement

Grading Criteria for Free Response:
- Depth of technical understanding
- Critical analysis of results
- Clear articulation of observations
- Original insights and suggestions
- Proper technical writing style


Extra Credit: Iterative DPO Implementation and Analysis (30 points)

a) Implementation (20 points)
- Implement the iterative DPO algorithm as described in "Self Rewarding Language Models"
- Train multiple iterations of the model (minimum 2 iterations)
- Document:
  * Implementation details
  * Training parameters

b) Comparative Analysis (10 points)
Free Response Question (~250 words)
Compare and analyze the performance and behavioral differences against the base llama-3.2 model, the DPO-PairRM model, and DPO-LLM-judge model

Submission Links by Requirement

1a) LLM Judge-Based Collection (20 pts)

Dataset (HF Datasets): https://huggingface.co/datasets/pyamy/dpo-llm-judge-preferences-llama3
Judge design doc filename: llm_judge_design_documentation_20250811_212607.txt (included in artifacts)
Compute: Local GPU

1b) PairRM-Based Collection (20 pts)

Dataset (HF Datasets): https://huggingface.co/datasets/pyamy/dpo-pairrm-preferences-llama3
Spec: 50 LIMA instructions; 5 responses/instruction; 250 preference pairs

2a) DPO Fine-tuning (40 pts)

Base model: meta-llama/Llama-3.2-1B-Instruct
Adapters (HF Models):
- PairRM DPO: https://huggingface.co/pyamy/llama3-dpo-pairrm
- LLM-Judge DPO: https://huggingface.co/pyamy/llama3-dpo-llm-judge
Training parameters/process: Logged in notebook output (per-step losses; LoRA adapters saved)

2b) Comparative Analysis (20 pts)

Novelty check: 10 evaluation prompts; overlap with training = 0/10
Results table: evaluation_results.csv (saved with outputs from base + both DPO models)

Quantitative snapshot (from evaluation_results.csv):

Model	avg_words	avg_chars	bullet_like_frac
Base	26.1	153.0	0.10
DPO-PairRM	27.3	153.0	0.30
DPO-LLM-Judge	26.6	153.0	0.10

Qualitative observation (from table): DPO-PairRM tends to produce more stepwise, list-style answers; DPO-LLM-Judge remains more conversational while adhering to instructions.

Extra Credit — Iterative DPO (30 pts)

Iteration 1: +20 new preference pairs → model ./iterative_dpo_model_iter_1
Iteration 2: +0 new pairs → model ./iterative_dpo_model_iter_2
Analysis file: iterative_dpo_analysis.txt

Free Response (~250 words)

This assignment applies Direct Preference Optimization (DPO) to Llama-3.2-1B-Instruct using two preference sources: PairRM (250 pairs) and an LLM-judge dataset (150 pairs). DPO optimizes the log-odds of “chosen” over “rejected” responses while constraining divergence from the reference with a KL term (β controls that trade-off; not reported here). Evaluation on 10 novel prompts (0/10 overlap with training) compares the base model with both DPO fine-tunes. From evaluation_results.csv, corpus-level statistics show a small style shift after DPO: average words per response increase for the DPO models relative to base, and list-style formatting rises notably for DPO-PairRM (higher bullet-like fraction), indicating stronger structural bias from PairRM preferences. Qualitatively (inspecting the table), DPO-PairRM tends toward stepwise, “instructional” phrasing; DPO-LLM-judge remains more conversational while still adhering to the prompts. Training stability and runtime were not re-measured in this run (existing models were reused), so I avoid claims there. Limitations include small preference sets and automated-judge bias; these can over-reward length/format. Improvements: log β and other hyperparameters alongside results; add an automatic win-rate over the 10 prompts (e.g., a simple LLM judge sweep) to complement length/format metrics; and broaden preference diversity (e.g., more instructions or ensemble judges). Overall, DPO nudges structure and adherence in ways consistent with the active preference signal without visible degradation on these prompts.

All Links

Assignment 4 artifacts: https://huggingface.co/pyamy/dpo-assignment-4-artifacts
PairRM dataset: https://huggingface.co/datasets/pyamy/dpo-pairrm-preferences-llama3
LLM-Judge dataset: https://huggingface.co/datasets/pyamy/dpo-llm-judge-preferences-llama3
DPO-PairRM adapters: https://huggingface.co/pyamy/llama3-dpo-pairrm
DPO-LLM-Judge adapters: https://huggingface.co/pyamy/llama3-dpo-llm-judge
Colab notebook: https://colab.research.google.com/drive/1_vgdQph7H0kO_Vx_DF4q9sPwdN8xtYvS?usp=sharing

Uploaded from: f:\Northeastern 2024-2025\INFO7374\Assignment 4\Final Upload time (UTC): 2025-08-12T14:57:34Z

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support