Update README.md

14bc7b7 verified 2 months ago

4.5 kB

	---
	library_name: transformers
	license: mit
	language:
	- en
	base_model:
	- microsoft/phi-2
	---

	# Model Card for ShAIkespear/Phi-2_DPO_M3_Base_Alt

	A LoRA-finetuned and Direct Preference Optimization (DPO)–aligned variant of microsoft/phi-2, specialized for multiple-choice question answering (MCQA) with an emphasis on STEM and general knowledge domains.
	This model represents the alternative base configuration of the final M3 (balanced-then-DPO) training pipeline from the ShAIkespear project. It preserves full precision for highest fidelity and further fine-tuning, without 8-bit quantization.

	---

	## Model Details

	* Developed by: ShAIkespear team
	* Shared by: ShAIkespear team
	* Model type: Causal LM (Phi-2) with LoRA adapters; DPO-aligned
	* Languages: English
	* License: MIT
	* Finetuned from: microsoft/phi-2

	### Model Sources

	* Repository: [2.8B-Phi-2-LLM-QA](https://github.com/EricSaikali/2.8B-Phi-2-LLM-QA)
	* Report: “ShAIkespear – How to replace TAs: A comprehensive study on letting LLMs answer your questions”

	---

	## Uses

	### Direct Use

	* MCQA and educational Q&A (MMLU, OpenBookQA, ScienceQA).
	* Alignment research — comparison between DPO training setups (Base vs. Quantized).
	* As a high-fidelity reference checkpoint for quantized and downstream variants.

	### Out-of-Scope Use

	* High-stakes or safety-critical applications (medical, legal, policy).
	* Generative tasks outside multiple-choice reasoning.
	* Misuse in automated exam solving or confidential data leakage.

	---

	## Bias, Risks, and Limitations

	* Domain bias: Stronger on factual MCQA, weaker on advanced reasoning tasks.
	* Answer drift: May occasionally produce verbose or follow-up answers without explicit formatting.
	* Data source risks: EPFL-derived preferences may encode narrow style biases.

	### Recommendations

	* Maintain the structured prompt format:

	```
	### Question ...
	### Explanation ...
	### Answer:
	```
	* Keep human supervision in any educational or grading use.
	* Prefer this full-precision model for fine-tuning or evaluation; use quantized versions for deployment.

	---

	## How to Get Started

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch

	model_id = "ShAIkespear/Phi-2_DPO_M3_Base_Alt"

	tok = AutoTokenizer.from_pretrained(model_id, use_fast=True)
	model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

	prompt = "### Question: Which element has the chemical symbol 'O'?\n### Explanation: The symbol 'O' represents this essential gas.\n### Answer:"
	inputs = tok(prompt, return_tensors="pt").to(model.device)
	out = model.generate(**inputs, max_new_tokens=15)
	print(tok.decode(out[0], skip_special_tokens=True))
	```

	---

	## Training Details

	### Training Data

	* SFT stage: Balanced MCQA mix — MathQA, OpenBookQA, ScienceQA, TAL-SCQ5K, and EPFL question sets.
	* DPO stage: Human preference pairs (EPFL exams + public feedback datasets like HelpSteer).
	* Schema: Unified “### Question / ### Explanation / ### Answer” format.
	* Filtering: ≤512 tokens, balanced sample caps (~20k per dataset).

	### Training Procedure

	* Pipeline: SFT → DPO (M3 configuration).
	* LoRA parameters: rank = 16, α = 16, dropout = 0.05.
	* Batch sizes: SFT = 4; DPO = 1.
	* Learning rates: 1e-5 (public) / 1e-4 (EPFL).
	* Scheduler: Cosine with warmup.
	* Frameworks: Hugging Face Transformers + TRL + PEFT (LoRA).

	---

	## Evaluation Summary

	* Configuration: M3 Base (Alt) is the unquantized reference model for the quantized 8-bit variant.
	* Performance: Balanced dataset improves cross-domain consistency; DPO enhances answer formatting and style alignment.
	* Accuracy: Similar to quantized model (~0.61 MMLU avg.), slightly higher on reasoning subtasks.
	* Use case: For experimentation, evaluation, or further domain-specific fine-tuning.

	---

	## Technical Specifications

	* Architecture: Phi-2 (~2.78B parameters), decoder-only transformer.
	* Objective: SFT next-token prediction + DPO preference alignment.
	* Precision: Full precision (fp16/bf16).
	* Software: Hugging Face Transformers, TRL, PEFT.

	---

	## Glossary

	* MCQA: Multiple-Choice Question Answering
	* SFT: Supervised Finetuning
	* DPO: Direct Preference Optimization
	* LoRA: Low-Rank Adaptation
	* Alt (Alternative): Internal naming for the alternate full-precision checkpoint variant of M3