--- library_name: transformers license: mit language: - en base_model: - microsoft/phi-2 --- # Model Card for Phi-2_DPO_M3_Base A LoRA-finetuned variant of **microsoft/phi-2** targeting STEM multiple-choice question answering (MCQA). The model was first trained with SFT on mixed STEM MCQA datasets, then aligned via DPO using human preference data (EPFL exam MCQAs). This **Base** checkpoint is the standard (non-quantized) version intended for highest fidelity before any 4/8-bit compression. ## Model Details ### Model Description This model adapts Phi-2 (≈2.78B params, 2,048 context length) for MCQA, especially STEM. Training used LoRA adapters (rank=16, α=16, dropout=0.05) with the TRL library for SFT and DPO; checkpoints focus on adapter weights for compactness. This Base release loads in full precision (fp16/bf16 capable) and is recommended for evaluation and further finetuning. * **Developed by:** ShAIkespear team * **Shared by:** ShAIkespear team * **Model type:** Causal decoder-only LM (Phi-2) with LoRA adapters; DPO-aligned MCQA assistant * **Language(s) (NLP):** English (training/eval datasets primarily EN) * **License:** MIT (per repository) * **Finetuned from model:** microsoft/phi-2 ### Model Sources * **Repository:** [2.8B-Phi-2-LLM-QA](https://github.com/EricSaikali/2.8B-Phi-2-LLM-QA) * **Report:** “ShAIkerspear – How to replace TAs: A comprehensive study on letting LLMs answer your questions” ## Uses ### Direct Use * MCQA answering for STEM and general knowledge benchmarks (e.g., MMLU, OpenBookQA). * Educational assistants/tutors for multiple-choice reasoning with short explanation-then-answer prompts. ### Out-of-Scope Use * High-stakes domains (medical, legal, safety-critical) without human oversight. * Generative tasks outside MCQA chat format may underperform (e.g., long-form proofs). * Any use that violates exam integrity or leaks copyrighted/confidential test content. ## Bias, Risks, and Limitations * **STEM difficulty:** Performance on harder math/science MCQA can hover near chance on some sets, indicating limited reliability for difficult reasoning. * **Alignment drift:** DPO after SFT can affect strict letter-only answer formatting; the model may add extra content or follow-ups. * **Data risk:** Exam-derived prompts/answers may raise confidentiality/fairness concerns if reused exams are included. ### Recommendations * Keep a human in the loop for grading/teaching. * Prefer balanced MCQA data; use explicit “### Question / ### Explanation / ### Answer” formatting to stabilize outputs. * Add guardrails to discourage cheating or policy-violating requests. ## How to Get Started with the Model ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch model_id = "ShAIkespear/Phi-2_DPO_M3_Base" # replace with your Hub ID tok = AutoTokenizer.from_pretrained(model_id, use_fast=True) model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto") prompt = "### Question: What is 2+2?\n### Explanation: Add the integers.\n### Answer:" inputs = tok(prompt, return_tensors="pt").to(model.device) out = model.generate(**inputs, max_new_tokens=10) print(tok.decode(out[0], skip_special_tokens=True)) ``` ## Training Details ### Training Data Mixed SFT on MathQA, OpenBookQA, ScienceQA, TAL-SCQ5K, plus balanced/shuffled merged MCQA sets; DPO on HelpSteer and a student-curated EPFL preference dataset (~20–30k pairs; subsets used for SFT/DPO). Long items (>512 tokens) dropped; large datasets clipped to 20k samples. Example split: train 50%, test_overfit 25%, test_comparison 10%, test_quantization 15% (quant split retained for comparability, though this is the Base model). ### Training Procedure #### Preprocessing Unified MCQA schema. SFT format: `id, subject, question, answer/answer_text, choices`. DPO format: `prompt, rejected, chosen`. Prompts used a structured header: `### Question ... ### Explanation ... ### Answer` #### Training Hyperparameters * **Regime:** Mixed precision typical for TRL (fp16/bf16 depending on hardware); LoRA rank 16, α 16, dropout 0.05. * **Batch sizes:** SFT train/eval = 4; DPO = 1 (to avoid OOM). * **Learning rate:** 1e-5 for public datasets; 1e-4 for EPFL data; cosine schedule with warmup. * **Frameworks:** Hugging Face TRL + PEFT/LoRA, Transformers. ## Evaluation ### Testing Data, Factors & Metrics #### Testing Data Per-dataset held-out test sets (per splits), plus MMLU converted to the SFT schema. #### Factors Task domain (math vs. general science vs. open-domain), data balancing, and SFT→DPO ordering. #### Metrics MCQA accuracy; DPO pairwise preference accuracy. ### Results Across ablations, the **balanced-then-DPO** configuration (M3) performed best overall on the team’s benchmark suite. The Base model serves as the reference for subsequent quantized variants. #### Summary * Balanced MCQA SFT improved robustness. * DPO on EPFL preferences improved alignment and EPFL-style accuracy. * Use this Base checkpoint when you prioritize maximum fidelity or plan additional finetuning; switch to quantized variants for memory-constrained inference. ## Technical Specifications ### Model Architecture and Objective Phi-2 transformer decoder LM (≈2.78B params) with next-token prediction objective; LoRA adapters for parameter-efficient finetuning; DPO for preference alignment. #### Software Hugging Face TRL, PEFT/LoRA, Transformers. ## Glossary * **MCQA:** Multiple-choice question answering. * **SFT:** Supervised finetuning with gold answers. * **DPO:** Direct Preference Optimization (pairwise preference alignment). * **LoRA:** Low-Rank Adaptation for parameter-efficient finetuning.