---
library_name: transformers
license: mit
language:
- en
base_model:
- microsoft/phi-2
---


# Model Card for Phi-2_DPO_M3_Base

A LoRA-finetuned variant of **microsoft/phi-2** targeting STEM multiple-choice question answering (MCQA). The model was first trained with SFT on mixed STEM MCQA datasets, then aligned via DPO using human preference data (EPFL exam MCQAs). This **Base** checkpoint is the standard (non-quantized) version intended for highest fidelity before any 4/8-bit compression.

## Model Details

### Model Description

This model adapts Phi-2 (≈2.78B params, 2,048 context length) for MCQA, especially STEM. Training used LoRA adapters (rank=16, α=16, dropout=0.05) with the TRL library for SFT and DPO; checkpoints focus on adapter weights for compactness. This Base release loads in full precision (fp16/bf16 capable) and is recommended for evaluation and further finetuning.

* **Developed by:** ShAIkespear team
* **Shared by:** ShAIkespear team
* **Model type:** Causal decoder-only LM (Phi-2) with LoRA adapters; DPO-aligned MCQA assistant
* **Language(s) (NLP):** English (training/eval datasets primarily EN)
* **License:** MIT (per repository)
* **Finetuned from model:** microsoft/phi-2

### Model Sources

* **Repository:** [2.8B-Phi-2-LLM-QA](https://github.com/EricSaikali/2.8B-Phi-2-LLM-QA)
* **Report:** “ShAIkerspear – How to replace TAs: A comprehensive study on letting LLMs answer your questions”

## Uses

### Direct Use

* MCQA answering for STEM and general knowledge benchmarks (e.g., MMLU, OpenBookQA).
* Educational assistants/tutors for multiple-choice reasoning with short explanation-then-answer prompts.

### Out-of-Scope Use

* High-stakes domains (medical, legal, safety-critical) without human oversight.
* Generative tasks outside MCQA chat format may underperform (e.g., long-form proofs).
* Any use that violates exam integrity or leaks copyrighted/confidential test content.

## Bias, Risks, and Limitations

* **STEM difficulty:** Performance on harder math/science MCQA can hover near chance on some sets, indicating limited reliability for difficult reasoning.
* **Alignment drift:** DPO after SFT can affect strict letter-only answer formatting; the model may add extra content or follow-ups.
* **Data risk:** Exam-derived prompts/answers may raise confidentiality/fairness concerns if reused exams are included.

### Recommendations

* Keep a human in the loop for grading/teaching.
* Prefer balanced MCQA data; use explicit “### Question / ### Explanation / ### Answer” formatting to stabilize outputs.
* Add guardrails to discourage cheating or policy-violating requests.

## How to Get Started with the Model

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "ShAIkespear/Phi-2_DPO_M3_Base"  # replace with your Hub ID

tok = AutoTokenizer.from_pretrained(model_id, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

prompt = "### Question: What is 2+2?\n### Explanation: Add the integers.\n### Answer:"
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=10)
print(tok.decode(out[0], skip_special_tokens=True))
```

## Training Details

### Training Data

Mixed SFT on MathQA, OpenBookQA, ScienceQA, TAL-SCQ5K, plus balanced/shuffled merged MCQA sets; DPO on HelpSteer and a student-curated EPFL preference dataset (~20–30k pairs; subsets used for SFT/DPO). Long items (>512 tokens) dropped; large datasets clipped to 20k samples. Example split: train 50%, test_overfit 25%, test_comparison 10%, test_quantization 15% (quant split retained for comparability, though this is the Base model).

### Training Procedure

#### Preprocessing

Unified MCQA schema.
SFT format: `id, subject, question, answer/answer_text, choices`.
DPO format: `prompt, rejected, chosen`.
Prompts used a structured header:
`### Question ... ### Explanation ... ### Answer`

#### Training Hyperparameters

* **Regime:** Mixed precision typical for TRL (fp16/bf16 depending on hardware); LoRA rank 16, α 16, dropout 0.05.
* **Batch sizes:** SFT train/eval = 4; DPO = 1 (to avoid OOM).
* **Learning rate:** 1e-5 for public datasets; 1e-4 for EPFL data; cosine schedule with warmup.
* **Frameworks:** Hugging Face TRL + PEFT/LoRA, Transformers.

## Evaluation

### Testing Data, Factors & Metrics

#### Testing Data

Per-dataset held-out test sets (per splits), plus MMLU converted to the SFT schema.

#### Factors

Task domain (math vs. general science vs. open-domain), data balancing, and SFT→DPO ordering.

#### Metrics

MCQA accuracy; DPO pairwise preference accuracy.

### Results

Across ablations, the **balanced-then-DPO** configuration (M3) performed best overall on the team’s benchmark suite. The Base model serves as the reference for subsequent quantized variants.

#### Summary

* Balanced MCQA SFT improved robustness.
* DPO on EPFL preferences improved alignment and EPFL-style accuracy.
* Use this Base checkpoint when you prioritize maximum fidelity or plan additional finetuning; switch to quantized variants for memory-constrained inference.

## Technical Specifications

### Model Architecture and Objective

Phi-2 transformer decoder LM (≈2.78B params) with next-token prediction objective; LoRA adapters for parameter-efficient finetuning; DPO for preference alignment.

#### Software

Hugging Face TRL, PEFT/LoRA, Transformers.

## Glossary

* **MCQA:** Multiple-choice question answering.
* **SFT:** Supervised finetuning with gold answers.
* **DPO:** Direct Preference Optimization (pairwise preference alignment).
* **LoRA:** Low-Rank Adaptation for parameter-efficient finetuning.