Phi-2_DPO_M3_Base / README.md
ESmike's picture
chore: update model card
baeb781 verified
---
library_name: transformers
license: mit
language:
- en
base_model:
- microsoft/phi-2
---
# Model Card for Phi-2_DPO_M3_Base
A LoRA-finetuned variant of **microsoft/phi-2** targeting STEM multiple-choice question answering (MCQA). The model was first trained with SFT on mixed STEM MCQA datasets, then aligned via DPO using human preference data (EPFL exam MCQAs). This **Base** checkpoint is the standard (non-quantized) version intended for highest fidelity before any 4/8-bit compression.
## Model Details
### Model Description
This model adapts Phi-2 (≈2.78B params, 2,048 context length) for MCQA, especially STEM. Training used LoRA adapters (rank=16, α=16, dropout=0.05) with the TRL library for SFT and DPO; checkpoints focus on adapter weights for compactness. This Base release loads in full precision (fp16/bf16 capable) and is recommended for evaluation and further finetuning.
* **Developed by:** ShAIkespear team
* **Shared by:** ShAIkespear team
* **Model type:** Causal decoder-only LM (Phi-2) with LoRA adapters; DPO-aligned MCQA assistant
* **Language(s) (NLP):** English (training/eval datasets primarily EN)
* **License:** MIT (per repository)
* **Finetuned from model:** microsoft/phi-2
### Model Sources
* **Repository:** [2.8B-Phi-2-LLM-QA](https://github.com/EricSaikali/2.8B-Phi-2-LLM-QA)
* **Report:** “ShAIkerspear – How to replace TAs: A comprehensive study on letting LLMs answer your questions”
## Uses
### Direct Use
* MCQA answering for STEM and general knowledge benchmarks (e.g., MMLU, OpenBookQA).
* Educational assistants/tutors for multiple-choice reasoning with short explanation-then-answer prompts.
### Out-of-Scope Use
* High-stakes domains (medical, legal, safety-critical) without human oversight.
* Generative tasks outside MCQA chat format may underperform (e.g., long-form proofs).
* Any use that violates exam integrity or leaks copyrighted/confidential test content.
## Bias, Risks, and Limitations
* **STEM difficulty:** Performance on harder math/science MCQA can hover near chance on some sets, indicating limited reliability for difficult reasoning.
* **Alignment drift:** DPO after SFT can affect strict letter-only answer formatting; the model may add extra content or follow-ups.
* **Data risk:** Exam-derived prompts/answers may raise confidentiality/fairness concerns if reused exams are included.
### Recommendations
* Keep a human in the loop for grading/teaching.
* Prefer balanced MCQA data; use explicit “### Question / ### Explanation / ### Answer” formatting to stabilize outputs.
* Add guardrails to discourage cheating or policy-violating requests.
## How to Get Started with the Model
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "ShAIkespear/Phi-2_DPO_M3_Base" # replace with your Hub ID
tok = AutoTokenizer.from_pretrained(model_id, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
prompt = "### Question: What is 2+2?\n### Explanation: Add the integers.\n### Answer:"
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=10)
print(tok.decode(out[0], skip_special_tokens=True))
```
## Training Details
### Training Data
Mixed SFT on MathQA, OpenBookQA, ScienceQA, TAL-SCQ5K, plus balanced/shuffled merged MCQA sets; DPO on HelpSteer and a student-curated EPFL preference dataset (~20–30k pairs; subsets used for SFT/DPO). Long items (>512 tokens) dropped; large datasets clipped to 20k samples. Example split: train 50%, test_overfit 25%, test_comparison 10%, test_quantization 15% (quant split retained for comparability, though this is the Base model).
### Training Procedure
#### Preprocessing
Unified MCQA schema.
SFT format: `id, subject, question, answer/answer_text, choices`.
DPO format: `prompt, rejected, chosen`.
Prompts used a structured header:
`### Question ... ### Explanation ... ### Answer`
#### Training Hyperparameters
* **Regime:** Mixed precision typical for TRL (fp16/bf16 depending on hardware); LoRA rank 16, α 16, dropout 0.05.
* **Batch sizes:** SFT train/eval = 4; DPO = 1 (to avoid OOM).
* **Learning rate:** 1e-5 for public datasets; 1e-4 for EPFL data; cosine schedule with warmup.
* **Frameworks:** Hugging Face TRL + PEFT/LoRA, Transformers.
## Evaluation
### Testing Data, Factors & Metrics
#### Testing Data
Per-dataset held-out test sets (per splits), plus MMLU converted to the SFT schema.
#### Factors
Task domain (math vs. general science vs. open-domain), data balancing, and SFT→DPO ordering.
#### Metrics
MCQA accuracy; DPO pairwise preference accuracy.
### Results
Across ablations, the **balanced-then-DPO** configuration (M3) performed best overall on the team’s benchmark suite. The Base model serves as the reference for subsequent quantized variants.
#### Summary
* Balanced MCQA SFT improved robustness.
* DPO on EPFL preferences improved alignment and EPFL-style accuracy.
* Use this Base checkpoint when you prioritize maximum fidelity or plan additional finetuning; switch to quantized variants for memory-constrained inference.
## Technical Specifications
### Model Architecture and Objective
Phi-2 transformer decoder LM (≈2.78B params) with next-token prediction objective; LoRA adapters for parameter-efficient finetuning; DPO for preference alignment.
#### Software
Hugging Face TRL, PEFT/LoRA, Transformers.
## Glossary
* **MCQA:** Multiple-choice question answering.
* **SFT:** Supervised finetuning with gold answers.
* **DPO:** Direct Preference Optimization (pairwise preference alignment).
* **LoRA:** Low-Rank Adaptation for parameter-efficient finetuning.