File size: 4,500 Bytes

f735ef4
 
14bc7b7
 
 
 
 
f735ef4
 
14bc7b7
f735ef4
14bc7b7
 
f735ef4
14bc7b7
f735ef4
 
 
14bc7b7
 
 
 
 
 
f735ef4
14bc7b7
f735ef4
14bc7b7
 
f735ef4
14bc7b7
f735ef4
 
 
 
 
14bc7b7
 
 
f735ef4
 
 
14bc7b7
 
 
f735ef4
14bc7b7
f735ef4
 
 
14bc7b7
 
 
f735ef4
 
 
14bc7b7
f735ef4
14bc7b7
 
 
 
 
 
 
f735ef4
14bc7b7
f735ef4
14bc7b7
f735ef4
14bc7b7
 
 
f735ef4
14bc7b7
f735ef4
14bc7b7
 
f735ef4
14bc7b7
 
 
 
 
f735ef4
14bc7b7
f735ef4
14bc7b7
f735ef4
14bc7b7
f735ef4
14bc7b7
 
 
 
f735ef4
14bc7b7
f735ef4
14bc7b7
 
 
 
 
 
f735ef4
14bc7b7
f735ef4
14bc7b7
f735ef4
14bc7b7
 
 
 
f735ef4
14bc7b7
f735ef4
14bc7b7
f735ef4
14bc7b7
 
 
 
f735ef4
14bc7b7
f735ef4
14bc7b7
f735ef4
14bc7b7

---
library_name: transformers
license: mit
language:
- en
base_model:
- microsoft/phi-2
---

# Model Card for ShAIkespear/Phi-2_DPO_M3_Base_Alt

A **LoRA-finetuned** and **Direct Preference Optimization (DPO)**–aligned variant of **microsoft/phi-2**, specialized for **multiple-choice question answering (MCQA)** with an emphasis on **STEM and general knowledge** domains.
This model represents the *alternative base configuration* of the final **M3 (balanced-then-DPO)** training pipeline from the *ShAIkespear* project. It preserves full precision for highest fidelity and further fine-tuning, without 8-bit quantization.

---

## Model Details

* **Developed by:** ShAIkespear team
* **Shared by:** ShAIkespear team
* **Model type:** Causal LM (Phi-2) with LoRA adapters; DPO-aligned
* **Languages:** English
* **License:** MIT
* **Finetuned from:** microsoft/phi-2

### Model Sources

* **Repository:** [2.8B-Phi-2-LLM-QA](https://github.com/EricSaikali/2.8B-Phi-2-LLM-QA)
* **Report:** *“ShAIkespear – How to replace TAs: A comprehensive study on letting LLMs answer your questions”*

---

## Uses

### Direct Use

* MCQA and educational Q&A (MMLU, OpenBookQA, ScienceQA).
* Alignment research — comparison between DPO training setups (Base vs. Quantized).
* As a **high-fidelity reference checkpoint** for quantized and downstream variants.

### Out-of-Scope Use

* High-stakes or safety-critical applications (medical, legal, policy).
* Generative tasks outside multiple-choice reasoning.
* Misuse in automated exam solving or confidential data leakage.

---

## Bias, Risks, and Limitations

* **Domain bias:** Stronger on factual MCQA, weaker on advanced reasoning tasks.
* **Answer drift:** May occasionally produce verbose or follow-up answers without explicit formatting.
* **Data source risks:** EPFL-derived preferences may encode narrow style biases.

### Recommendations

* Maintain the structured prompt format:

  ```
  ### Question ...
  ### Explanation ...
  ### Answer:
  ```
* Keep human supervision in any educational or grading use.
* Prefer this full-precision model for fine-tuning or evaluation; use quantized versions for deployment.

---

## How to Get Started

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "ShAIkespear/Phi-2_DPO_M3_Base_Alt"

tok = AutoTokenizer.from_pretrained(model_id, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

prompt = "### Question: Which element has the chemical symbol 'O'?\n### Explanation: The symbol 'O' represents this essential gas.\n### Answer:"
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=15)
print(tok.decode(out[0], skip_special_tokens=True))
```

---

## Training Details

### Training Data

* **SFT stage:** Balanced MCQA mix — MathQA, OpenBookQA, ScienceQA, TAL-SCQ5K, and EPFL question sets.
* **DPO stage:** Human preference pairs (EPFL exams + public feedback datasets like HelpSteer).
* **Schema:** Unified “### Question / ### Explanation / ### Answer” format.
* **Filtering:** ≤512 tokens, balanced sample caps (~20k per dataset).

### Training Procedure

* **Pipeline:** SFT → DPO (M3 configuration).
* **LoRA parameters:** rank = 16, α = 16, dropout = 0.05.
* **Batch sizes:** SFT = 4; DPO = 1.
* **Learning rates:** 1e-5 (public) / 1e-4 (EPFL).
* **Scheduler:** Cosine with warmup.
* **Frameworks:** Hugging Face Transformers + TRL + PEFT (LoRA).

---

## Evaluation Summary

* **Configuration:** *M3 Base (Alt)* is the unquantized reference model for the quantized 8-bit variant.
* **Performance:** Balanced dataset improves cross-domain consistency; DPO enhances answer formatting and style alignment.
* **Accuracy:** Similar to quantized model (~0.61 MMLU avg.), slightly higher on reasoning subtasks.
* **Use case:** For experimentation, evaluation, or further domain-specific fine-tuning.

---

## Technical Specifications

* **Architecture:** Phi-2 (~2.78B parameters), decoder-only transformer.
* **Objective:** SFT next-token prediction + DPO preference alignment.
* **Precision:** Full precision (fp16/bf16).
* **Software:** Hugging Face Transformers, TRL, PEFT.

---

## Glossary

* **MCQA:** Multiple-Choice Question Answering
* **SFT:** Supervised Finetuning
* **DPO:** Direct Preference Optimization
* **LoRA:** Low-Rank Adaptation
* **Alt (Alternative):** Internal naming for the alternate full-precision checkpoint variant of M3