| --- |
| library_name: transformers |
| license: mit |
| language: |
| - en |
| base_model: |
| - microsoft/phi-2 |
| --- |
| |
|
|
| # Model Card for Phi-2_DPO_M3_Base |
| |
| A LoRA-finetuned variant of **microsoft/phi-2** targeting STEM multiple-choice question answering (MCQA). The model was first trained with SFT on mixed STEM MCQA datasets, then aligned via DPO using human preference data (EPFL exam MCQAs). This **Base** checkpoint is the standard (non-quantized) version intended for highest fidelity before any 4/8-bit compression. |
| |
| ## Model Details |
| |
| ### Model Description |
| |
| This model adapts Phi-2 (≈2.78B params, 2,048 context length) for MCQA, especially STEM. Training used LoRA adapters (rank=16, α=16, dropout=0.05) with the TRL library for SFT and DPO; checkpoints focus on adapter weights for compactness. This Base release loads in full precision (fp16/bf16 capable) and is recommended for evaluation and further finetuning. |
| |
| * **Developed by:** ShAIkespear team |
| * **Shared by:** ShAIkespear team |
| * **Model type:** Causal decoder-only LM (Phi-2) with LoRA adapters; DPO-aligned MCQA assistant |
| * **Language(s) (NLP):** English (training/eval datasets primarily EN) |
| * **License:** MIT (per repository) |
| * **Finetuned from model:** microsoft/phi-2 |
| |
| ### Model Sources |
| |
| * **Repository:** [2.8B-Phi-2-LLM-QA](https://github.com/EricSaikali/2.8B-Phi-2-LLM-QA) |
| * **Report:** “ShAIkerspear – How to replace TAs: A comprehensive study on letting LLMs answer your questions” |
| |
| ## Uses |
| |
| ### Direct Use |
| |
| * MCQA answering for STEM and general knowledge benchmarks (e.g., MMLU, OpenBookQA). |
| * Educational assistants/tutors for multiple-choice reasoning with short explanation-then-answer prompts. |
| |
| ### Out-of-Scope Use |
| |
| * High-stakes domains (medical, legal, safety-critical) without human oversight. |
| * Generative tasks outside MCQA chat format may underperform (e.g., long-form proofs). |
| * Any use that violates exam integrity or leaks copyrighted/confidential test content. |
| |
| ## Bias, Risks, and Limitations |
| |
| * **STEM difficulty:** Performance on harder math/science MCQA can hover near chance on some sets, indicating limited reliability for difficult reasoning. |
| * **Alignment drift:** DPO after SFT can affect strict letter-only answer formatting; the model may add extra content or follow-ups. |
| * **Data risk:** Exam-derived prompts/answers may raise confidentiality/fairness concerns if reused exams are included. |
| |
| ### Recommendations |
| |
| * Keep a human in the loop for grading/teaching. |
| * Prefer balanced MCQA data; use explicit “### Question / ### Explanation / ### Answer” formatting to stabilize outputs. |
| * Add guardrails to discourage cheating or policy-violating requests. |
| |
| ## How to Get Started with the Model |
| |
| ```python |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| import torch |
| |
| model_id = "ShAIkespear/Phi-2_DPO_M3_Base" # replace with your Hub ID |
| |
| tok = AutoTokenizer.from_pretrained(model_id, use_fast=True) |
| model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto") |
| |
| prompt = "### Question: What is 2+2?\n### Explanation: Add the integers.\n### Answer:" |
| inputs = tok(prompt, return_tensors="pt").to(model.device) |
| out = model.generate(**inputs, max_new_tokens=10) |
| print(tok.decode(out[0], skip_special_tokens=True)) |
| ``` |
| |
| ## Training Details |
| |
| ### Training Data |
| |
| Mixed SFT on MathQA, OpenBookQA, ScienceQA, TAL-SCQ5K, plus balanced/shuffled merged MCQA sets; DPO on HelpSteer and a student-curated EPFL preference dataset (~20–30k pairs; subsets used for SFT/DPO). Long items (>512 tokens) dropped; large datasets clipped to 20k samples. Example split: train 50%, test_overfit 25%, test_comparison 10%, test_quantization 15% (quant split retained for comparability, though this is the Base model). |
| |
| ### Training Procedure |
| |
| #### Preprocessing |
| |
| Unified MCQA schema. |
| SFT format: `id, subject, question, answer/answer_text, choices`. |
| DPO format: `prompt, rejected, chosen`. |
| Prompts used a structured header: |
| `### Question ... ### Explanation ... ### Answer` |
| |
| #### Training Hyperparameters |
| |
| * **Regime:** Mixed precision typical for TRL (fp16/bf16 depending on hardware); LoRA rank 16, α 16, dropout 0.05. |
| * **Batch sizes:** SFT train/eval = 4; DPO = 1 (to avoid OOM). |
| * **Learning rate:** 1e-5 for public datasets; 1e-4 for EPFL data; cosine schedule with warmup. |
| * **Frameworks:** Hugging Face TRL + PEFT/LoRA, Transformers. |
|
|
| ## Evaluation |
|
|
| ### Testing Data, Factors & Metrics |
|
|
| #### Testing Data |
|
|
| Per-dataset held-out test sets (per splits), plus MMLU converted to the SFT schema. |
|
|
| #### Factors |
|
|
| Task domain (math vs. general science vs. open-domain), data balancing, and SFT→DPO ordering. |
|
|
| #### Metrics |
|
|
| MCQA accuracy; DPO pairwise preference accuracy. |
|
|
| ### Results |
|
|
| Across ablations, the **balanced-then-DPO** configuration (M3) performed best overall on the team’s benchmark suite. The Base model serves as the reference for subsequent quantized variants. |
|
|
| #### Summary |
|
|
| * Balanced MCQA SFT improved robustness. |
| * DPO on EPFL preferences improved alignment and EPFL-style accuracy. |
| * Use this Base checkpoint when you prioritize maximum fidelity or plan additional finetuning; switch to quantized variants for memory-constrained inference. |
|
|
| ## Technical Specifications |
|
|
| ### Model Architecture and Objective |
|
|
| Phi-2 transformer decoder LM (≈2.78B params) with next-token prediction objective; LoRA adapters for parameter-efficient finetuning; DPO for preference alignment. |
|
|
| #### Software |
|
|
| Hugging Face TRL, PEFT/LoRA, Transformers. |
|
|
| ## Glossary |
|
|
| * **MCQA:** Multiple-choice question answering. |
| * **SFT:** Supervised finetuning with gold answers. |
| * **DPO:** Direct Preference Optimization (pairwise preference alignment). |
| * **LoRA:** Low-Rank Adaptation for parameter-efficient finetuning. |