File size: 4,500 Bytes
f735ef4 14bc7b7 f735ef4 14bc7b7 f735ef4 14bc7b7 f735ef4 14bc7b7 f735ef4 14bc7b7 f735ef4 14bc7b7 f735ef4 14bc7b7 f735ef4 14bc7b7 f735ef4 14bc7b7 f735ef4 14bc7b7 f735ef4 14bc7b7 f735ef4 14bc7b7 f735ef4 14bc7b7 f735ef4 14bc7b7 f735ef4 14bc7b7 f735ef4 14bc7b7 f735ef4 14bc7b7 f735ef4 14bc7b7 f735ef4 14bc7b7 f735ef4 14bc7b7 f735ef4 14bc7b7 f735ef4 14bc7b7 f735ef4 14bc7b7 f735ef4 14bc7b7 f735ef4 14bc7b7 f735ef4 14bc7b7 f735ef4 14bc7b7 f735ef4 14bc7b7 f735ef4 14bc7b7 f735ef4 14bc7b7 f735ef4 14bc7b7 f735ef4 14bc7b7 f735ef4 14bc7b7 f735ef4 14bc7b7 f735ef4 14bc7b7 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 | ---
library_name: transformers
license: mit
language:
- en
base_model:
- microsoft/phi-2
---
# Model Card for ShAIkespear/Phi-2_DPO_M3_Base_Alt
A **LoRA-finetuned** and **Direct Preference Optimization (DPO)**–aligned variant of **microsoft/phi-2**, specialized for **multiple-choice question answering (MCQA)** with an emphasis on **STEM and general knowledge** domains.
This model represents the *alternative base configuration* of the final **M3 (balanced-then-DPO)** training pipeline from the *ShAIkespear* project. It preserves full precision for highest fidelity and further fine-tuning, without 8-bit quantization.
---
## Model Details
* **Developed by:** ShAIkespear team
* **Shared by:** ShAIkespear team
* **Model type:** Causal LM (Phi-2) with LoRA adapters; DPO-aligned
* **Languages:** English
* **License:** MIT
* **Finetuned from:** microsoft/phi-2
### Model Sources
* **Repository:** [2.8B-Phi-2-LLM-QA](https://github.com/EricSaikali/2.8B-Phi-2-LLM-QA)
* **Report:** *“ShAIkespear – How to replace TAs: A comprehensive study on letting LLMs answer your questions”*
---
## Uses
### Direct Use
* MCQA and educational Q&A (MMLU, OpenBookQA, ScienceQA).
* Alignment research — comparison between DPO training setups (Base vs. Quantized).
* As a **high-fidelity reference checkpoint** for quantized and downstream variants.
### Out-of-Scope Use
* High-stakes or safety-critical applications (medical, legal, policy).
* Generative tasks outside multiple-choice reasoning.
* Misuse in automated exam solving or confidential data leakage.
---
## Bias, Risks, and Limitations
* **Domain bias:** Stronger on factual MCQA, weaker on advanced reasoning tasks.
* **Answer drift:** May occasionally produce verbose or follow-up answers without explicit formatting.
* **Data source risks:** EPFL-derived preferences may encode narrow style biases.
### Recommendations
* Maintain the structured prompt format:
```
### Question ...
### Explanation ...
### Answer:
```
* Keep human supervision in any educational or grading use.
* Prefer this full-precision model for fine-tuning or evaluation; use quantized versions for deployment.
---
## How to Get Started
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "ShAIkespear/Phi-2_DPO_M3_Base_Alt"
tok = AutoTokenizer.from_pretrained(model_id, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
prompt = "### Question: Which element has the chemical symbol 'O'?\n### Explanation: The symbol 'O' represents this essential gas.\n### Answer:"
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=15)
print(tok.decode(out[0], skip_special_tokens=True))
```
---
## Training Details
### Training Data
* **SFT stage:** Balanced MCQA mix — MathQA, OpenBookQA, ScienceQA, TAL-SCQ5K, and EPFL question sets.
* **DPO stage:** Human preference pairs (EPFL exams + public feedback datasets like HelpSteer).
* **Schema:** Unified “### Question / ### Explanation / ### Answer” format.
* **Filtering:** ≤512 tokens, balanced sample caps (~20k per dataset).
### Training Procedure
* **Pipeline:** SFT → DPO (M3 configuration).
* **LoRA parameters:** rank = 16, α = 16, dropout = 0.05.
* **Batch sizes:** SFT = 4; DPO = 1.
* **Learning rates:** 1e-5 (public) / 1e-4 (EPFL).
* **Scheduler:** Cosine with warmup.
* **Frameworks:** Hugging Face Transformers + TRL + PEFT (LoRA).
---
## Evaluation Summary
* **Configuration:** *M3 Base (Alt)* is the unquantized reference model for the quantized 8-bit variant.
* **Performance:** Balanced dataset improves cross-domain consistency; DPO enhances answer formatting and style alignment.
* **Accuracy:** Similar to quantized model (~0.61 MMLU avg.), slightly higher on reasoning subtasks.
* **Use case:** For experimentation, evaluation, or further domain-specific fine-tuning.
---
## Technical Specifications
* **Architecture:** Phi-2 (~2.78B parameters), decoder-only transformer.
* **Objective:** SFT next-token prediction + DPO preference alignment.
* **Precision:** Full precision (fp16/bf16).
* **Software:** Hugging Face Transformers, TRL, PEFT.
---
## Glossary
* **MCQA:** Multiple-Choice Question Answering
* **SFT:** Supervised Finetuning
* **DPO:** Direct Preference Optimization
* **LoRA:** Low-Rank Adaptation
* **Alt (Alternative):** Internal naming for the alternate full-precision checkpoint variant of M3
|