YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

here’s a clean, structured Markdown you can paste straight into your Hugging Face artifacts README:


DPO Assignment 4 — Full Artifacts

All local artifacts from my run (datasets on disk, DPO adapters, CSV/TXT outputs, and the notebook).

Assignment 4 (verbatim prompt)

Assignment 4

In this assignment, we will be generating a preference dataset with PairRM and fine tuning a model with DPO. This is a powerful training recipe that is behind some of the top models according to Alpaca Eval.
You may use llama-3.2 1B or llama-3.2 3B.

Preference Dataset Collection and DPO Model Training

Part 1: Dataset Generation and Judge Implementation (40 points)
Create two separate preference datasets using different collection methods:

a) LLM Judge-Based Collection (20 points)
- Implement an LLM-based judge system
- Document your reasoning for the judge's prompt design
- Explain how you ensure consistent and reliable preference judgments
- Include examples of the judge's evaluation process
- You can choose between using local inference on Colab/Lightning studio or a 3rd party provider like fireworks ai/openai/together ai/groq (kimi k2)

b) PairRM-Based Collection (20 points)
- Extract 50 instructions from the Lima dataset
- Generate 5 responses per instruction using the llama-3.2 chat template
- Apply PairRM to create preference pairs
- Upload dataset to HuggingFace
- Submit repository link

Part 2: Model Training and Evaluation (60 points)

a) DPO Fine-tuning (40 points)
- Fine-tune llama-3.2 using PairRM preference dataset
- Fine-tune llama-3.2 using LLM Judge preference dataset
- Document training parameters and process
- Upload PEFT adapters to HuggingFace
- Submit repository links

b) Comparative Analysis (20 points)
- Select 10 novel instructions (not in training data)
- Generate completions using:
  * Original llama-3.2
  * DPO fine-tuned model (LLM judge dataset)
  * DPO fine-tuned model (PairRM dataset)
- Present results in a pandas DataFrame
- Analyze and compare the quality of completions
- Include quantitative and qualitative observations

Address the following points:
1. Qualitative differences in model outputs
2. Training stability across iterations
3. Computational efficiency considerations
4. Potential limitations and failure modes
5. Suggestions for improvement

Grading Criteria for Free Response:
- Depth of technical understanding
- Critical analysis of results
- Clear articulation of observations
- Original insights and suggestions
- Proper technical writing style


Extra Credit: Iterative DPO Implementation and Analysis (30 points)

a) Implementation (20 points)
- Implement the iterative DPO algorithm as described in "Self Rewarding Language Models"
- Train multiple iterations of the model (minimum 2 iterations)
- Document:
  * Implementation details
  * Training parameters

b) Comparative Analysis (10 points)
Free Response Question (~250 words)
Compare and analyze the performance and behavioral differences against the base llama-3.2 model, the DPO-PairRM model, and DPO-LLM-judge model

Submission Links by Requirement

1a) LLM Judge-Based Collection (20 pts)

1b) PairRM-Based Collection (20 pts)


2a) DPO Fine-tuning (40 pts)

2b) Comparative Analysis (20 pts)

  • Novelty check: 10 evaluation prompts; overlap with training = 0/10
  • Results table: evaluation_results.csv (saved with outputs from base + both DPO models)

Quantitative snapshot (from evaluation_results.csv):

Model avg_words avg_chars bullet_like_frac
Base 26.1 153.0 0.10
DPO-PairRM 27.3 153.0 0.30
DPO-LLM-Judge 26.6 153.0 0.10

Qualitative observation (from table): DPO-PairRM tends to produce more stepwise, list-style answers; DPO-LLM-Judge remains more conversational while adhering to instructions.


Extra Credit — Iterative DPO (30 pts)

  • Iteration 1: +20 new preference pairs → model ./iterative_dpo_model_iter_1
  • Iteration 2: +0 new pairs → model ./iterative_dpo_model_iter_2
  • Analysis file: iterative_dpo_analysis.txt

Free Response (~250 words)

This assignment applies Direct Preference Optimization (DPO) to Llama-3.2-1B-Instruct using two preference sources: PairRM (250 pairs) and an LLM-judge dataset (150 pairs). DPO optimizes the log-odds of “chosen” over “rejected” responses while constraining divergence from the reference with a KL term (β controls that trade-off; not reported here). Evaluation on 10 novel prompts (0/10 overlap with training) compares the base model with both DPO fine-tunes. From evaluation_results.csv, corpus-level statistics show a small style shift after DPO: average words per response increase for the DPO models relative to base, and list-style formatting rises notably for DPO-PairRM (higher bullet-like fraction), indicating stronger structural bias from PairRM preferences. Qualitatively (inspecting the table), DPO-PairRM tends toward stepwise, “instructional” phrasing; DPO-LLM-judge remains more conversational while still adhering to the prompts. Training stability and runtime were not re-measured in this run (existing models were reused), so I avoid claims there. Limitations include small preference sets and automated-judge bias; these can over-reward length/format. Improvements: log β and other hyperparameters alongside results; add an automatic win-rate over the 10 prompts (e.g., a simple LLM judge sweep) to complement length/format metrics; and broaden preference diversity (e.g., more instructions or ensemble judges). Overall, DPO nudges structure and adherence in ways consistent with the active preference signal without visible degradation on these prompts.


All Links


Uploaded from: f:\Northeastern 2024-2025\INFO7374\Assignment 4\Final Upload time (UTC): 2025-08-12T14:57:34Z


Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support