File size: 8,655 Bytes
77fc812
2fc4a1a
77fc812
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2fc4a1a
77fc812
 
 
 
 
 
 
 
2fc4a1a
77fc812
2fc4a1a
 
 
 
 
 
77fc812
 
 
 
 
 
 
 
 
 
 
2fc4a1a
77fc812
 
 
 
 
 
 
 
 
 
2fc4a1a
 
55a9aaa
 
 
 
 
 
d2fa0db
2fc4a1a
 
 
 
77fc812
 
 
 
 
 
 
 
 
 
 
 
 
 
2fc4a1a
 
5934c08
77fc812
 
2fc4a1a
 
 
 
77fc812
 
5934c08
 
2fc4a1a
 
 
 
5934c08
77fc812
55a9aaa
 
 
77fc812
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2fc4a1a
77fc812
 
 
 
 
 
 
 
2fc4a1a
77fc812
 
 
2fc4a1a
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
---
license: mit
base_model: microsoft/phi-2
pipeline_tag: text-generation
datasets:
- yahma/alpaca-cleaned
- rajpurkar/squad_v2
language:
- en
tags:
- phi-2
- qlora
- chat
- chatml
- conversational
- english
- instruction-following
- nlp
- text-generation
- alpaca
- squad
- bitsandbytes
- fastapi
- peft
- transformers
- adhafajp
- zero-ai
---

# πŸ’¬ Chat Model "Zero" (Phi-2 2.7B + QLoRA Adapter)

This repository contains the **QLoRA adapter** for creating **"Zero"**, a specialized instruction-following AI assistant fine-tuned from [`microsoft/phi-2`](https://huggingface.co/microsoft/phi-2).

This model serves as the **core component** of a full-stack **AI engineering and MLOps workflow project**, covering the complete lifecycle from **fine-tuning (with W&B tracking)** to **local inference and system integration** using FastAPI.

- 🧩 **Model Adapter:** [adhafajp/phi2-qlora-zero-chat](https://huggingface.co/adhafajp/phi2-qlora-zero-chat)
- βš™οΈ **Full FastAPI Project (Main Portfolio):** [GitHub – ZeroChat](https://github.com/adhafajp/ZeroChat)

---

## πŸš€ Project Overview

**Zero** is designed as a lightweight, memory-efficient conversational model optimized for reasoning, instruction-following, and question-answering tasks.

### Key Features:
- 🧠 **Fine-tuned using QLoRA** β€” efficient, low-resource adaptation of Phi-2  
- βš™οΈ **Backend:** Asynchronous **FastAPI** inference server with streaming responses  
- πŸ’¬ **Frontend:** Interactive chat interface built with **HTML**, **TailwindCSS**, and **JavaScript** (via Server-Sent Events)  
- πŸ” **Experiment tracking:** Integrated **Weights & Biases (W&B)** logging for training runs  
- πŸ” **Local deployment-ready:** Lightweight, easily containerized for offline use  

---

## 🧩 Training Details

| Component | Description |
|------------|-------------|
| **Base Model** | `microsoft/phi-2` |
| **Method** | QLoRA (Quantized LoRA Fine-Tuning) |
| **Language** | English only |
| **Precision** | 4-bit (NF4) |
| **Optimizer** | Paged AdamW 8-bit |
| **Frameworks** | `transformers`, `peft`, `bitsandbytes`, `fastapi` |

### Dataset Composition
The adapter was trained on a curated blend of English datasets:
- **alpaca_cleaned** β†’ general-purpose instruction-following  
- **squad_v2** β†’ question answering and reading comprehension  
- **custom_persona (283 samples)** β†’ gives *Zero* its distinct assistant identity  

---

## πŸ–₯️ Training Hardware
Fine-tuning was performed entirely on a consumer-grade laptop:
- **Laptop:** Acer Nitro V15
- **GPU:** NVIDIA RTX 2050 Mobile (4 GB VRAM)
- **CPU:** Intel Core i5-13420H
- **RAM:** 16 GB
- **Quantization:** 4-bit NF4
- **Strategy:** Low VRAM setup using gradient accumulation, packing, and LoRA adapters

This demonstrates that Phi-2 can be fine-tuned effectively even on low-VRAM devices.

---

## πŸ”§ Integration Example

A complete **local deployment example** (FastAPI backend + chat frontend) is available at the main project repository:
πŸ‘‰ [**GitHub – ZeroChat**](https://github.com/adhafajp/ZeroChat)

This repository demonstrates how to integrate this adapter with:
- πŸ”Ή A FastAPI inference server (supports streaming responses)
- πŸ”Ή A lightweight HTML/Tailwind chat UI
- πŸ”Ή Simple local setup and environment configuration for experimentation or portfolio demonstration

---

## πŸ“ˆ Training Phases Summary

The fine tuning consist of multiple stage experiment

#### Stage 1:
| Phase | Summary | Runtime |
|--------|----------|----------|
| **1A** | Initial fine-tune (canceled due to overfitting) | 11h 50m |
| **1B** | Full 2-epoch fine-tune on Alpaca + SQuADv2 + persona | 5d 11h 50m |
| **1C** | Small re-train (underfit) | 19h |
| **1D / 1D-A / 1E** | Refinement attempts with packing & oversampling | ~3d total |
| **1F** | Final adapter re-train from **1B** (expanded persona dataset, balanced oversampling) | 1d 5h |

#### Stage 2:

After gathering all the insights from the initial experiments (1A-1F), fine-tuning was restarted completely from scratch. By applying all the lessons learned, this new training process achieved better and more balanced performance in just 1s 21h.
The adapter released in this repository is the result of this final, optimized training.
| Phase | Summary | Runtime |
|--------|----------|----------|
| **1** | Fine-tune again from scratch(from base model) by applying all the insights from previous experiments. | 1d 21h |

πŸ“Š **W&B Log (Phase 1F):** [wandb.ai/VoidNova/.../](https://wandb.ai/VoidNova/phi-2-2.7B_qlora_alpaca-51.8k_identity-model-232_squadv2-15k/)

πŸ“Š **W&B Log (Final):** [wandb.ai/VoidNova/.../runs/rx5fih5v](https://wandb.ai/VoidNova/phi-2_qlora_ZeroChat/)

---

## 🧠 How to Use

> ⚠️ This is a **LoRA adapter**, not a full model.  
> You must load the base model (`microsoft/phi-2`) and apply this adapter on top of it.

```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel

adapter_path = "adhafajp/phi2-qlora-zero-chat"
base_model_path = "microsoft/phi-2"

# Quantization configuration
compute_dtype = getattr(torch, "float16")
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=True,
)

print(f"Loading base model from: {base_model_path}")
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_path,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True
)

print(f"Loading tokenizer from: {adapter_path}")
tokenizer = AutoTokenizer.from_pretrained(
    adapter_path,
    trust_remote_code=True
)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.pad_token_id = tokenizer.eos_token_id

base_model.resize_token_embeddings(len(tokenizer))

print(f"Applying QLoRA adapter from: {adapter_path}...")
model = PeftModel.from_pretrained(base_model, adapter_path)
model.eval()

print("Model is ready to use!")

# --- INFERENCE EXAMPLE ---

DEFAULT_SYSTEM = "You are Zero, a helpful assistant."
PROMPT_FORMAT = """<|im_start|>system
{system_prompt}<|im_end|>
<|im_start|>user
{instruction}<|im_end|>
<|im_start|>assistant
"""

instruction = "What is QLoRA and how does it work?"
prompt_text = PROMPT_FORMAT.format(
    system_prompt=DEFAULT_SYSTEM,
    instruction=instruction
)

inputs = tokenizer(prompt_text, return_tensors="pt").to(model.device)
prompt_token_count = inputs["input_ids"].shape[1]

print(f"\nGenerating response for: '{instruction}'")

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=768,
        repetition_penalty=1.1,
        do_sample=False,
        eos_token_id=tokenizer.convert_tokens_to_ids("<|endoftext|>"),
        pad_token_id=tokenizer.pad_token_id,
    )

generated_tokens = outputs[0][prompt_token_count:]
generated_text = tokenizer.decode(generated_tokens, skip_special_tokens=False)

cut_index = len(generated_text)
for stop_token in ["<|endoftext|>", "<|im_end|>"]:
    if stop_token in generated_text:
        cut_index = min(cut_index, generated_text.index(stop_token))

final_answer = generated_text[:cut_index].strip()

print(f"Model response:\n{final_answer}")
```
---

### πŸͺΆ Example Prompts
"Who are you?" 
"How to be success?"

---

### 🧠 Example with RAG Context
"CONTEXT:---Zinc is an essential mineral perceived by the public today as being of ''exceptional biologic and public health importance'', especially regarding prenatal and postnatal development. Zinc deficiency affects about two billion people in the developing world and is associated with many diseases. In children it causes growth retardation, delayed sexual maturation, infection susceptibility, and diarrhea. Enzymes with a zinc atom in the reactive center are widespread in biochemistry, such as alcohol dehydrogenase in humans. Consumption of excess zinc can cause ataxia, lethargy and copper deficiency.---QUESTION:How many people are affected by zinc deficiency?"


## Acknowledgements & License

This project builds upon several outstanding open-source contributions:

* **Base Model:** This adapter is fine-tuned from [`microsoft/phi-2`](https://huggingface.co/microsoft/phi-2), licensed under the **MIT License**.  
  *Copyright (c) 2023 Microsoft.*

* **Libraries:** Powered by `transformers`, `peft`, and `bitsandbytes` from Hugging Face πŸ€—, as well as `torch` from PyTorch β€” all permissively licensed (Apache 2.0 or MIT).

* **This Adapter & Code:** Released under the **MIT License**.  
  You are free to use, modify, and distribute it with proper attribution.