|
|
--- |
|
|
library_name: transformers |
|
|
license: mit |
|
|
language: |
|
|
- en |
|
|
base_model: |
|
|
- microsoft/phi-2 |
|
|
--- |
|
|
|
|
|
# Model Card for ShAIkespear/Phi-2_DPO_M3_Base_Alt |
|
|
|
|
|
A **LoRA-finetuned** and **Direct Preference Optimization (DPO)**–aligned variant of **microsoft/phi-2**, specialized for **multiple-choice question answering (MCQA)** with an emphasis on **STEM and general knowledge** domains. |
|
|
This model represents the *alternative base configuration* of the final **M3 (balanced-then-DPO)** training pipeline from the *ShAIkespear* project. It preserves full precision for highest fidelity and further fine-tuning, without 8-bit quantization. |
|
|
|
|
|
--- |
|
|
|
|
|
## Model Details |
|
|
|
|
|
* **Developed by:** ShAIkespear team |
|
|
* **Shared by:** ShAIkespear team |
|
|
* **Model type:** Causal LM (Phi-2) with LoRA adapters; DPO-aligned |
|
|
* **Languages:** English |
|
|
* **License:** MIT |
|
|
* **Finetuned from:** microsoft/phi-2 |
|
|
|
|
|
### Model Sources |
|
|
|
|
|
* **Repository:** [2.8B-Phi-2-LLM-QA](https://github.com/EricSaikali/2.8B-Phi-2-LLM-QA) |
|
|
* **Report:** *“ShAIkespear – How to replace TAs: A comprehensive study on letting LLMs answer your questions”* |
|
|
|
|
|
--- |
|
|
|
|
|
## Uses |
|
|
|
|
|
### Direct Use |
|
|
|
|
|
* MCQA and educational Q&A (MMLU, OpenBookQA, ScienceQA). |
|
|
* Alignment research — comparison between DPO training setups (Base vs. Quantized). |
|
|
* As a **high-fidelity reference checkpoint** for quantized and downstream variants. |
|
|
|
|
|
### Out-of-Scope Use |
|
|
|
|
|
* High-stakes or safety-critical applications (medical, legal, policy). |
|
|
* Generative tasks outside multiple-choice reasoning. |
|
|
* Misuse in automated exam solving or confidential data leakage. |
|
|
|
|
|
--- |
|
|
|
|
|
## Bias, Risks, and Limitations |
|
|
|
|
|
* **Domain bias:** Stronger on factual MCQA, weaker on advanced reasoning tasks. |
|
|
* **Answer drift:** May occasionally produce verbose or follow-up answers without explicit formatting. |
|
|
* **Data source risks:** EPFL-derived preferences may encode narrow style biases. |
|
|
|
|
|
### Recommendations |
|
|
|
|
|
* Maintain the structured prompt format: |
|
|
|
|
|
``` |
|
|
### Question ... |
|
|
### Explanation ... |
|
|
### Answer: |
|
|
``` |
|
|
* Keep human supervision in any educational or grading use. |
|
|
* Prefer this full-precision model for fine-tuning or evaluation; use quantized versions for deployment. |
|
|
|
|
|
--- |
|
|
|
|
|
## How to Get Started |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
import torch |
|
|
|
|
|
model_id = "ShAIkespear/Phi-2_DPO_M3_Base_Alt" |
|
|
|
|
|
tok = AutoTokenizer.from_pretrained(model_id, use_fast=True) |
|
|
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto") |
|
|
|
|
|
prompt = "### Question: Which element has the chemical symbol 'O'?\n### Explanation: The symbol 'O' represents this essential gas.\n### Answer:" |
|
|
inputs = tok(prompt, return_tensors="pt").to(model.device) |
|
|
out = model.generate(**inputs, max_new_tokens=15) |
|
|
print(tok.decode(out[0], skip_special_tokens=True)) |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Training Data |
|
|
|
|
|
* **SFT stage:** Balanced MCQA mix — MathQA, OpenBookQA, ScienceQA, TAL-SCQ5K, and EPFL question sets. |
|
|
* **DPO stage:** Human preference pairs (EPFL exams + public feedback datasets like HelpSteer). |
|
|
* **Schema:** Unified “### Question / ### Explanation / ### Answer” format. |
|
|
* **Filtering:** ≤512 tokens, balanced sample caps (~20k per dataset). |
|
|
|
|
|
### Training Procedure |
|
|
|
|
|
* **Pipeline:** SFT → DPO (M3 configuration). |
|
|
* **LoRA parameters:** rank = 16, α = 16, dropout = 0.05. |
|
|
* **Batch sizes:** SFT = 4; DPO = 1. |
|
|
* **Learning rates:** 1e-5 (public) / 1e-4 (EPFL). |
|
|
* **Scheduler:** Cosine with warmup. |
|
|
* **Frameworks:** Hugging Face Transformers + TRL + PEFT (LoRA). |
|
|
|
|
|
--- |
|
|
|
|
|
## Evaluation Summary |
|
|
|
|
|
* **Configuration:** *M3 Base (Alt)* is the unquantized reference model for the quantized 8-bit variant. |
|
|
* **Performance:** Balanced dataset improves cross-domain consistency; DPO enhances answer formatting and style alignment. |
|
|
* **Accuracy:** Similar to quantized model (~0.61 MMLU avg.), slightly higher on reasoning subtasks. |
|
|
* **Use case:** For experimentation, evaluation, or further domain-specific fine-tuning. |
|
|
|
|
|
--- |
|
|
|
|
|
## Technical Specifications |
|
|
|
|
|
* **Architecture:** Phi-2 (~2.78B parameters), decoder-only transformer. |
|
|
* **Objective:** SFT next-token prediction + DPO preference alignment. |
|
|
* **Precision:** Full precision (fp16/bf16). |
|
|
* **Software:** Hugging Face Transformers, TRL, PEFT. |
|
|
|
|
|
--- |
|
|
|
|
|
## Glossary |
|
|
|
|
|
* **MCQA:** Multiple-Choice Question Answering |
|
|
* **SFT:** Supervised Finetuning |
|
|
* **DPO:** Direct Preference Optimization |
|
|
* **LoRA:** Low-Rank Adaptation |
|
|
* **Alt (Alternative):** Internal naming for the alternate full-precision checkpoint variant of M3 |
|
|
|