Text Generation
PEFT
Safetensors
Transformers
English
phi-2
qlora
chat
chatml
conversational
english
instruction-following
nlp
alpaca
squad
bitsandbytes
fastapi
adhafajp
zero-ai
Instructions to use adhafajp/phi2-qlora-zero-chat with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use adhafajp/phi2-qlora-zero-chat with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("E:\AI\model-base\phi-2-2.7B") model = PeftModel.from_pretrained(base_model, "adhafajp/phi2-qlora-zero-chat") - Transformers
How to use adhafajp/phi2-qlora-zero-chat with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="adhafajp/phi2-qlora-zero-chat") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("adhafajp/phi2-qlora-zero-chat", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use adhafajp/phi2-qlora-zero-chat with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "adhafajp/phi2-qlora-zero-chat" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "adhafajp/phi2-qlora-zero-chat", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/adhafajp/phi2-qlora-zero-chat
- SGLang
How to use adhafajp/phi2-qlora-zero-chat with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "adhafajp/phi2-qlora-zero-chat" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "adhafajp/phi2-qlora-zero-chat", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "adhafajp/phi2-qlora-zero-chat" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "adhafajp/phi2-qlora-zero-chat", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use adhafajp/phi2-qlora-zero-chat with Docker Model Runner:
docker model run hf.co/adhafajp/phi2-qlora-zero-chat
About and Licence
Browse files
README.md
CHANGED
|
@@ -1,3 +1,211 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
base_model: microsoft/phi-2
|
| 4 |
+
pipeline_tag: text-generation
|
| 5 |
+
datasets:
|
| 6 |
+
- yahma/alpaca-cleaned
|
| 7 |
+
- rajpurkar/squad_v2
|
| 8 |
+
language:
|
| 9 |
+
- en
|
| 10 |
+
tags:
|
| 11 |
+
- phi-2
|
| 12 |
+
- qlora
|
| 13 |
+
- chat
|
| 14 |
+
- chatml
|
| 15 |
+
- conversational
|
| 16 |
+
- english
|
| 17 |
+
- instruction-following
|
| 18 |
+
- nlp
|
| 19 |
+
- text-generation
|
| 20 |
+
- alpaca
|
| 21 |
+
- squad
|
| 22 |
+
- bitsandbytes
|
| 23 |
+
- fastapi
|
| 24 |
+
- peft
|
| 25 |
+
- transformers
|
| 26 |
+
- adhafajp
|
| 27 |
+
- zero-ai
|
| 28 |
+
---
|
| 29 |
+
|
| 30 |
+
# π¬ Chat Model "Zero" (Phi-2 2.7B + QLoRA Adapter)
|
| 31 |
+
|
| 32 |
+
This repository contains the **QLoRA adapter** for creating **"Zero"**, a specialized instruction-following AI assistant fine-tuned from [`microsoft/phi-2`](https://huggingface.co/microsoft/phi-2).
|
| 33 |
+
|
| 34 |
+
This model is the core component of a **full-stack MLOps portfolio project**, demonstrating capabilities from **fine-tuning** to **production-ready deployment**.
|
| 35 |
+
|
| 36 |
+
- π§© **Model Adapter:** [adhafajp/phi2-qlora-zero-chat](https://huggingface.co/adhafajp/phi2-qlora-zero-chat)
|
| 37 |
+
- βοΈ **Full FastAPI Project (Main Portfolio):** [GitHub β ZeroChat](https://github.com/adhafajp/ZeroChat)
|
| 38 |
+
|
| 39 |
+
---
|
| 40 |
+
|
| 41 |
+
## π Project Overview
|
| 42 |
+
|
| 43 |
+
**Zero** is designed to serve as a fast, memory-efficient conversational model optimized for reasoning, instruction-following, and question-answering tasks.
|
| 44 |
+
|
| 45 |
+
### Key Features
|
| 46 |
+
- π§ **Fine-tuned using QLoRA** β efficient, low-resource adaptation of Phi-2.
|
| 47 |
+
- β‘ **Backend:** Asynchronous **FastAPI** server with streaming responses.
|
| 48 |
+
- π¬ **Frontend:** Interactive chat interface built with **HTML**, **TailwindCSS**, and **JavaScript** (via Server-Sent Events).
|
| 49 |
+
- π **Deployment-ready:** Lightweight and easy to containerize.
|
| 50 |
+
|
| 51 |
+
---
|
| 52 |
+
|
| 53 |
+
## π§© Training Details
|
| 54 |
+
|
| 55 |
+
| Component | Description |
|
| 56 |
+
|------------|-------------|
|
| 57 |
+
| **Base Model** | `microsoft/phi-2` |
|
| 58 |
+
| **Method** | QLoRA (Quantized LoRA Fine-Tuning) |
|
| 59 |
+
| **Language** | English only |
|
| 60 |
+
| **Precision** | 4-bit (NF4) |
|
| 61 |
+
| **Frameworks** | `transformers`, `peft`, `bitsandbytes`, `fastapi` |
|
| 62 |
+
|
| 63 |
+
### Dataset Composition
|
| 64 |
+
The adapter was trained on a curated blend of English datasets:
|
| 65 |
+
- **alpaca_cleaned** β general-purpose instruction-following
|
| 66 |
+
- **squad_v2** β question answering and reading comprehension
|
| 67 |
+
- **custom_persona (283 samples)** β gives *Zero* its distinct assistant identity
|
| 68 |
+
|
| 69 |
+
---
|
| 70 |
+
|
| 71 |
+
## π§ Integration Example
|
| 72 |
+
|
| 73 |
+
A complete **local deployment example** (FastAPI backend + chat frontend) is available at the main project repository:
|
| 74 |
+
π [**GitHub β ZeroChat**](https://github.com/adhafajp/ZeroChat)
|
| 75 |
+
|
| 76 |
+
This repository demonstrates how to integrate this adapter with:
|
| 77 |
+
- πΉ A FastAPI inference server (supports streaming responses)
|
| 78 |
+
- πΉ A lightweight HTML/Tailwind chat UI
|
| 79 |
+
- πΉ Simple local setup and environment configuration for experimentation or portfolio demonstration
|
| 80 |
+
|
| 81 |
+
---
|
| 82 |
+
|
| 83 |
+
## π Training Phases Summary
|
| 84 |
+
|
| 85 |
+
The fine-tuning process consisted of multiple experimental stages:
|
| 86 |
+
|
| 87 |
+
| Phase | Summary | Runtime |
|
| 88 |
+
|--------|----------|----------|
|
| 89 |
+
| **1A** | Initial fine-tune (canceled at 11h 50m due to val overfit) | 11h 50m |
|
| 90 |
+
| **1B** | Full 2-epoch fine-tune on Alpaca + SQuADv2 + persona (main baseline) | 5d 11h 50m |
|
| 91 |
+
| **1C** | Small re-train on reduced subset (underfit) | 19h |
|
| 92 |
+
| **1D / 1D-A / 1E** | Refinement attempts with packing and oversampling | ~3d total |
|
| 93 |
+
| **1F** | Final adapter re-train from **1B** (expanded persona dataset, balanced oversampling) | 1d 5h |
|
| 94 |
+
|
| 95 |
+
The released adapter corresponds to **Phase 1F**, which achieved balanced performance across **instruction-following**, **reasoning**, and **identity consistency**.
|
| 96 |
+
|
| 97 |
+
π W&B Log (Phase 1F):
|
| 98 |
+
[wandb.ai/VoidNova/phi-2-2.7B_qlora_alpaca-51.8k_identity-model-232_squadv2-15k/runs/bpju3d09](https://wandb.ai/VoidNova/phi-2-2.7B_qlora_alpaca-51.8k_identity-model-232_squadv2-15k/runs/bpju3d09?nw=nwuseradhafajp)
|
| 99 |
+
|
| 100 |
+
---
|
| 101 |
+
|
| 102 |
+
## π§ How to Use
|
| 103 |
+
|
| 104 |
+
> β οΈ This is a **LoRA adapter**, not a full model.
|
| 105 |
+
> You must load the base model (`microsoft/phi-2`) and apply this adapter on top of it.
|
| 106 |
+
|
| 107 |
+
```python
|
| 108 |
+
import torch
|
| 109 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
|
| 110 |
+
from peft import PeftModel
|
| 111 |
+
|
| 112 |
+
adapter_path = "adhafajp/phi2-qlora-zero-chat"
|
| 113 |
+
base_model_path = "microsoft/phi-2"
|
| 114 |
+
|
| 115 |
+
# Quantization configuration
|
| 116 |
+
compute_dtype = getattr(torch, "float16")
|
| 117 |
+
bnb_config = BitsAndBytesConfig(
|
| 118 |
+
load_in_4bit=True,
|
| 119 |
+
bnb_4bit_quant_type="nf4",
|
| 120 |
+
bnb_4bit_compute_dtype=compute_dtype,
|
| 121 |
+
bnb_4bit_use_double_quant=True,
|
| 122 |
+
)
|
| 123 |
+
|
| 124 |
+
print(f"Loading base model from: {base_model_path}")
|
| 125 |
+
base_model = AutoModelForCausalLM.from_pretrained(
|
| 126 |
+
base_model_path,
|
| 127 |
+
quantization_config=bnb_config,
|
| 128 |
+
device_map="auto",
|
| 129 |
+
trust_remote_code=True
|
| 130 |
+
)
|
| 131 |
+
|
| 132 |
+
print(f"Loading tokenizer from: {adapter_path}")
|
| 133 |
+
tokenizer = AutoTokenizer.from_pretrained(
|
| 134 |
+
adapter_path,
|
| 135 |
+
trust_remote_code=True
|
| 136 |
+
)
|
| 137 |
+
if tokenizer.pad_token is None:
|
| 138 |
+
tokenizer.pad_token = tokenizer.eos_token
|
| 139 |
+
tokenizer.pad_token_id = tokenizer.eos_token_id
|
| 140 |
+
|
| 141 |
+
base_model.resize_token_embeddings(len(tokenizer))
|
| 142 |
+
|
| 143 |
+
print(f"Applying QLoRA adapter from: {adapter_path}...")
|
| 144 |
+
model = PeftModel.from_pretrained(base_model, adapter_path)
|
| 145 |
+
model.eval()
|
| 146 |
+
|
| 147 |
+
print("Model is ready to use!")
|
| 148 |
+
|
| 149 |
+
# --- INFERENCE EXAMPLE ---
|
| 150 |
+
|
| 151 |
+
DEFAULT_SYSTEM = "You are Zero, a helpful assistant."
|
| 152 |
+
PROMPT_FORMAT = """<|im_start|>system
|
| 153 |
+
{system_prompt}<|im_end|>
|
| 154 |
+
<|im_start|>user
|
| 155 |
+
{instruction}<|im_end|>
|
| 156 |
+
<|im_start|>assistant
|
| 157 |
+
"""
|
| 158 |
+
|
| 159 |
+
instruction = "What is QLoRA and how does it work?"
|
| 160 |
+
prompt_text = PROMPT_FORMAT.format(
|
| 161 |
+
system_prompt=DEFAULT_SYSTEM,
|
| 162 |
+
instruction=instruction
|
| 163 |
+
)
|
| 164 |
+
|
| 165 |
+
inputs = tokenizer(prompt_text, return_tensors="pt").to(model.device)
|
| 166 |
+
prompt_token_count = inputs["input_ids"].shape[1]
|
| 167 |
+
|
| 168 |
+
print(f"\nGenerating response for: '{instruction}'")
|
| 169 |
+
|
| 170 |
+
with torch.no_grad():
|
| 171 |
+
outputs = model.generate(
|
| 172 |
+
**inputs,
|
| 173 |
+
max_new_tokens=768,
|
| 174 |
+
repetition_penalty=1.1,
|
| 175 |
+
do_sample=False,
|
| 176 |
+
eos_token_id=tokenizer.convert_tokens_to_ids("<|endoftext|>"),
|
| 177 |
+
pad_token_id=tokenizer.pad_token_id,
|
| 178 |
+
)
|
| 179 |
+
|
| 180 |
+
generated_tokens = outputs[0][prompt_token_count:]
|
| 181 |
+
generated_text = tokenizer.decode(generated_tokens, skip_special_tokens=False)
|
| 182 |
+
|
| 183 |
+
cut_index = len(generated_text)
|
| 184 |
+
for stop_token in ["<|endoftext|>", "<|im_end|>"]:
|
| 185 |
+
if stop_token in generated_text:
|
| 186 |
+
cut_index = min(cut_index, generated_text.index(stop_token))
|
| 187 |
+
|
| 188 |
+
final_answer = generated_text[:cut_index].strip()
|
| 189 |
+
|
| 190 |
+
print(f"Model response:\n{final_answer}")
|
| 191 |
+
```
|
| 192 |
+
---
|
| 193 |
+
|
| 194 |
+
### πͺΆ Example Prompts
|
| 195 |
+
"Who are you?"
|
| 196 |
+
"How to be success?"
|
| 197 |
+
|
| 198 |
+
---
|
| 199 |
+
|
| 200 |
+
### π§ Example with RAG Context
|
| 201 |
+
"CONTEXT:---Zinc is an essential mineral perceived by the public today as being of ''exceptional biologic and public health importance'', especially regarding prenatal and postnatal development. Zinc deficiency affects about two billion people in the developing world and is associated with many diseases. In children it causes growth retardation, delayed sexual maturation, infection susceptibility, and diarrhea. Enzymes with a zinc atom in the reactive center are widespread in biochemistry, such as alcohol dehydrogenase in humans. Consumption of excess zinc can cause ataxia, lethargy and copper deficiency.---QUESTION:How many people are affected by zinc deficiency?"
|
| 202 |
+
|
| 203 |
+
|
| 204 |
+
## Acknowledgements & Licenses
|
| 205 |
+
|
| 206 |
+
This project builds upon several outstanding open-source contributions:
|
| 207 |
+
|
| 208 |
+
* **Base Model:** This work is a fine-tuned adapter of `microsoft/phi-2`. The `phi-2` model is licensed under the **MIT License**.
|
| 209 |
+
* `Copyright (c) 2023 Microsoft`
|
| 210 |
+
* **Libraries:** This project is powered by `transformers`, `peft`, and `bitsandbytes` by Hugging Face π€, as well as `torch` by PyTorch. These libraries are generally available under the Apache 2.0 or similar permissive licenses.
|
| 211 |
+
* **This Adapter & Code:** The original code for this repository (including the adapter weights) is licensed under the **Apache 2.0 License**.
|