File size: 4,500 Bytes
f735ef4
 
14bc7b7
 
 
 
 
f735ef4
 
14bc7b7
f735ef4
14bc7b7
 
f735ef4
14bc7b7
f735ef4
 
 
14bc7b7
 
 
 
 
 
f735ef4
14bc7b7
f735ef4
14bc7b7
 
f735ef4
14bc7b7
f735ef4
 
 
 
 
14bc7b7
 
 
f735ef4
 
 
14bc7b7
 
 
f735ef4
14bc7b7
f735ef4
 
 
14bc7b7
 
 
f735ef4
 
 
14bc7b7
f735ef4
14bc7b7
 
 
 
 
 
 
f735ef4
14bc7b7
f735ef4
14bc7b7
f735ef4
14bc7b7
 
 
f735ef4
14bc7b7
f735ef4
14bc7b7
 
f735ef4
14bc7b7
 
 
 
 
f735ef4
14bc7b7
f735ef4
14bc7b7
f735ef4
14bc7b7
f735ef4
14bc7b7
 
 
 
f735ef4
14bc7b7
f735ef4
14bc7b7
 
 
 
 
 
f735ef4
14bc7b7
f735ef4
14bc7b7
f735ef4
14bc7b7
 
 
 
f735ef4
14bc7b7
f735ef4
14bc7b7
f735ef4
14bc7b7
 
 
 
f735ef4
14bc7b7
f735ef4
14bc7b7
f735ef4
14bc7b7
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
---
library_name: transformers
license: mit
language:
- en
base_model:
- microsoft/phi-2
---

# Model Card for ShAIkespear/Phi-2_DPO_M3_Base_Alt

A **LoRA-finetuned** and **Direct Preference Optimization (DPO)**–aligned variant of **microsoft/phi-2**, specialized for **multiple-choice question answering (MCQA)** with an emphasis on **STEM and general knowledge** domains.
This model represents the *alternative base configuration* of the final **M3 (balanced-then-DPO)** training pipeline from the *ShAIkespear* project. It preserves full precision for highest fidelity and further fine-tuning, without 8-bit quantization.

---

## Model Details

* **Developed by:** ShAIkespear team
* **Shared by:** ShAIkespear team
* **Model type:** Causal LM (Phi-2) with LoRA adapters; DPO-aligned
* **Languages:** English
* **License:** MIT
* **Finetuned from:** microsoft/phi-2

### Model Sources

* **Repository:** [2.8B-Phi-2-LLM-QA](https://github.com/EricSaikali/2.8B-Phi-2-LLM-QA)
* **Report:** *“ShAIkespear – How to replace TAs: A comprehensive study on letting LLMs answer your questions”*

---

## Uses

### Direct Use

* MCQA and educational Q&A (MMLU, OpenBookQA, ScienceQA).
* Alignment research — comparison between DPO training setups (Base vs. Quantized).
* As a **high-fidelity reference checkpoint** for quantized and downstream variants.

### Out-of-Scope Use

* High-stakes or safety-critical applications (medical, legal, policy).
* Generative tasks outside multiple-choice reasoning.
* Misuse in automated exam solving or confidential data leakage.

---

## Bias, Risks, and Limitations

* **Domain bias:** Stronger on factual MCQA, weaker on advanced reasoning tasks.
* **Answer drift:** May occasionally produce verbose or follow-up answers without explicit formatting.
* **Data source risks:** EPFL-derived preferences may encode narrow style biases.

### Recommendations

* Maintain the structured prompt format:

  ```
  ### Question ...
  ### Explanation ...
  ### Answer:
  ```
* Keep human supervision in any educational or grading use.
* Prefer this full-precision model for fine-tuning or evaluation; use quantized versions for deployment.

---

## How to Get Started

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "ShAIkespear/Phi-2_DPO_M3_Base_Alt"

tok = AutoTokenizer.from_pretrained(model_id, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

prompt = "### Question: Which element has the chemical symbol 'O'?\n### Explanation: The symbol 'O' represents this essential gas.\n### Answer:"
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=15)
print(tok.decode(out[0], skip_special_tokens=True))
```

---

## Training Details

### Training Data

* **SFT stage:** Balanced MCQA mix — MathQA, OpenBookQA, ScienceQA, TAL-SCQ5K, and EPFL question sets.
* **DPO stage:** Human preference pairs (EPFL exams + public feedback datasets like HelpSteer).
* **Schema:** Unified “### Question / ### Explanation / ### Answer” format.
* **Filtering:** ≤512 tokens, balanced sample caps (~20k per dataset).

### Training Procedure

* **Pipeline:** SFT → DPO (M3 configuration).
* **LoRA parameters:** rank = 16, α = 16, dropout = 0.05.
* **Batch sizes:** SFT = 4; DPO = 1.
* **Learning rates:** 1e-5 (public) / 1e-4 (EPFL).
* **Scheduler:** Cosine with warmup.
* **Frameworks:** Hugging Face Transformers + TRL + PEFT (LoRA).

---

## Evaluation Summary

* **Configuration:** *M3 Base (Alt)* is the unquantized reference model for the quantized 8-bit variant.
* **Performance:** Balanced dataset improves cross-domain consistency; DPO enhances answer formatting and style alignment.
* **Accuracy:** Similar to quantized model (~0.61 MMLU avg.), slightly higher on reasoning subtasks.
* **Use case:** For experimentation, evaluation, or further domain-specific fine-tuning.

---

## Technical Specifications

* **Architecture:** Phi-2 (~2.78B parameters), decoder-only transformer.
* **Objective:** SFT next-token prediction + DPO preference alignment.
* **Precision:** Full precision (fp16/bf16).
* **Software:** Hugging Face Transformers, TRL, PEFT.

---

## Glossary

* **MCQA:** Multiple-Choice Question Answering
* **SFT:** Supervised Finetuning
* **DPO:** Direct Preference Optimization
* **LoRA:** Low-Rank Adaptation
* **Alt (Alternative):** Internal naming for the alternate full-precision checkpoint variant of M3