File size: 6,112 Bytes

---
license: mit
base_model: microsoft/Phi-3-mini-4k-instruct
datasets:
- b-mc2/sql-create-context
tags:
- peft
- qlora
- text-to-sql
- phi-3
---

# Enhanced QLoRA Adapter for Phi-3-mini: A Technical SQL Assistant (2 Epochs)

This repository contains an improved, high-performance QLoRA adapter for the `microsoft/Phi-3-mini-4k-instruct` model.
This version has been fine-tuned for **two full epochs** on a Text-to-SQL task, resulting in enhanced performance and reliability compared to single-epoch versions.

The model is designed to function as a technical assistant, capable of generating accurate SQL queries from natural language questions based on a provided database schema.
This project was developed for an engineering and deployment course, with a focus on creating a robust, reproducible, and practical AI artifact.

## Key Improvements in This Version

-   **Enhanced Reliability:** Training for two epochs has significantly improved the model's ability to consistently adhere to the required chat template format, reducing parsing errors in production.
-   **Maintained Accuracy:** The model maintains its high accuracy in generating syntactically correct and logically sound SQL queries.
-   **Robust Loading:** The usage instructions below follow best practices to ensure reliable loading across different environments.

## How to Use

First, ensure you have a compatible environment by installing these specific library versions:
```bash
pip install transformers==4.38.2 peft==0.10.0 accelerate==0.28.0 bitsandbytes==0.43.0 torch
```

The following code provides the most robust method for loading and running inference with this adapter.

```python
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
import torch

# --- 1. Configuration ---
base_model_id = "microsoft/Phi-3-mini-4k-instruct"
# IMPORTANT: Replace with your new model's ID on the Hugging Face Hub
adapter_id = "YourUsername/YourNewModelName" 

# --- 2. Load the Quantized Base Model ---
# This is required to fit the model in memory-constrained environments like Colab
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(base_model_id, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

# --- 3. Load and Apply the LoRA Adapter ---
model = PeftModel.from_pretrained(base_model, adapter_id)
print("Successfully loaded quantized base model and applied adapter.")

# --- 4. Prepare for Inference ---
context = "CREATE TABLE employees (name VARCHAR, department VARCHAR, salary INTEGER)"
question = "What are the names of employees in the 'Engineering' department with a salary over 80000?"

prompt = f"""<|user|>
Given the database schema:
{context}

Generate the SQL query for the following request:
{question}<|end|>
<|assistant|>
"""

# --- 5. Generate the Response ---
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(model.device)
outputs = model.generate(input_ids=input_ids, max_new_tokens=100, do_sample=False)
generated_text = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
generated_sql = generated_text.split("<|assistant|>")[-1].strip()

print(f"\nGenerated SQL: {generated_sql}")
# Expected output: SELECT name FROM employees WHERE department = 'Engineering' AND salary > 80000

```

## Training Procedure
### Dataset

The model was fine-tuned on a 10,000-sample subset of the b-mc2/sql-create-context dataset, split 90/10 for training and validation.

### Fine-tuning Configuration (QLoRA)

* **Quantization:** 4-bit NormalFloat (NF4) with `bfloat16` compute dtype.
* **LoRA Rank (`r`):** 8
* **LoRA Alpha (`lora_alpha`):** 16
* **Target Modules:** All linear layers in the Phi-3 architecture (`q_proj`, `k_proj`, `v_proj`, `o_proj`, etc.).

### Training Hyperparameters

* **Learning Rate:** 2e-4
* **Epochs: 2**
* **Effective Batch Size:** 8
* **Optimizer:** Paged AdamW (32-bit)
* **LR Scheduler:** Cosine


## Evaluation and Results

Qualitative evaluation on a held-out test set confirms that the model consistently generates correct SQL queries.
The extended training to two epochs has successfully addressed the primary limitation of the single-epoch version: inconsistent formatting.
This model now reliably generates the `<|assistant|>` token, making it more suitable for automated parsing and deployment.

## Deployment & Optimization Considerations

For deployment in a production environment, consider the following optimizations:

1. **Merge Adapter Weights:** Before deploying, merge the adapter weights into the base model to create a single, solid model.
This eliminates the overhead of dynamically applying the adapter during inference and can improve performance.

```python
# After loading the model and adapter:
merged_model = model.merge_and_unload()
# Use 'merged_model' for all subsequent 'generate' calls.
```

2. **Further Quantization:** For CPU-based deployment or even more efficient GPU usage, the merged model can be further quantized into formats like **GGUF** (for use with `llama.cpp`) or **AWQ/GPTQ**.
3. **API Serving:** Wrap the model in a high-performance web server like FastAPI or use a dedicated LLM serving framework like vLLM for optimal throughput and batching.


## Limitations and Responsible AI

* **Generalization:** The model is specialized for the Text-to-SQL task and the schema styles seen in its training data. It may not perform well on highly complex or esoteric SQL dialects.
* **Security:** This model is a proof-of-concept and has not been hardened against SQL injection attacks. All generated SQL should be treated as untrusted input and must be sanitized or executed in a sandboxed, read-only environment.
* **Bias:** The training data is the source of the model's knowledge. Any biases present in the sql-create-context dataset may be reflected in the model's outputs.