--- license: mit base_model: microsoft/Phi-3-mini-4k-instruct datasets: - b-mc2/sql-create-context tags: - peft - qlora - text-to-sql - phi-3 --- # Enhanced QLoRA Adapter for Phi-3-mini: A Technical SQL Assistant (2 Epochs) This repository contains an improved, high-performance QLoRA adapter for the `microsoft/Phi-3-mini-4k-instruct` model. This version has been fine-tuned for **two full epochs** on a Text-to-SQL task, resulting in enhanced performance and reliability compared to single-epoch versions. The model is designed to function as a technical assistant, capable of generating accurate SQL queries from natural language questions based on a provided database schema. This project was developed for an engineering and deployment course, with a focus on creating a robust, reproducible, and practical AI artifact. ## Key Improvements in This Version - **Enhanced Reliability:** Training for two epochs has significantly improved the model's ability to consistently adhere to the required chat template format, reducing parsing errors in production. - **Maintained Accuracy:** The model maintains its high accuracy in generating syntactically correct and logically sound SQL queries. - **Robust Loading:** The usage instructions below follow best practices to ensure reliable loading across different environments. ## How to Use First, ensure you have a compatible environment by installing these specific library versions: ```bash pip install transformers==4.38.2 peft==0.10.0 accelerate==0.28.0 bitsandbytes==0.43.0 torch ``` The following code provides the most robust method for loading and running inference with this adapter. ```python from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig from peft import PeftModel import torch # --- 1. Configuration --- base_model_id = "microsoft/Phi-3-mini-4k-instruct" # IMPORTANT: Replace with your new model's ID on the Hugging Face Hub adapter_id = "YourUsername/YourNewModelName" # --- 2. Load the Quantized Base Model --- # This is required to fit the model in memory-constrained environments like Colab bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16, ) base_model = AutoModelForCausalLM.from_pretrained( base_model_id, quantization_config=bnb_config, device_map="auto", trust_remote_code=True, ) tokenizer = AutoTokenizer.from_pretrained(base_model_id, trust_remote_code=True) tokenizer.pad_token = tokenizer.eos_token # --- 3. Load and Apply the LoRA Adapter --- model = PeftModel.from_pretrained(base_model, adapter_id) print("Successfully loaded quantized base model and applied adapter.") # --- 4. Prepare for Inference --- context = "CREATE TABLE employees (name VARCHAR, department VARCHAR, salary INTEGER)" question = "What are the names of employees in the 'Engineering' department with a salary over 80000?" prompt = f"""<|user|> Given the database schema: {context} Generate the SQL query for the following request: {question}<|end|> <|assistant|> """ # --- 5. Generate the Response --- input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(model.device) outputs = model.generate(input_ids=input_ids, max_new_tokens=100, do_sample=False) generated_text = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0] generated_sql = generated_text.split("<|assistant|>")[-1].strip() print(f"\nGenerated SQL: {generated_sql}") # Expected output: SELECT name FROM employees WHERE department = 'Engineering' AND salary > 80000 ``` ## Training Procedure ### Dataset The model was fine-tuned on a 10,000-sample subset of the b-mc2/sql-create-context dataset, split 90/10 for training and validation. ### Fine-tuning Configuration (QLoRA) * **Quantization:** 4-bit NormalFloat (NF4) with `bfloat16` compute dtype. * **LoRA Rank (`r`):** 8 * **LoRA Alpha (`lora_alpha`):** 16 * **Target Modules:** All linear layers in the Phi-3 architecture (`q_proj`, `k_proj`, `v_proj`, `o_proj`, etc.). ### Training Hyperparameters * **Learning Rate:** 2e-4 * **Epochs: 2** * **Effective Batch Size:** 8 * **Optimizer:** Paged AdamW (32-bit) * **LR Scheduler:** Cosine ## Evaluation and Results Qualitative evaluation on a held-out test set confirms that the model consistently generates correct SQL queries. The extended training to two epochs has successfully addressed the primary limitation of the single-epoch version: inconsistent formatting. This model now reliably generates the `<|assistant|>` token, making it more suitable for automated parsing and deployment. ## Deployment & Optimization Considerations For deployment in a production environment, consider the following optimizations: 1. **Merge Adapter Weights:** Before deploying, merge the adapter weights into the base model to create a single, solid model. This eliminates the overhead of dynamically applying the adapter during inference and can improve performance. ```python # After loading the model and adapter: merged_model = model.merge_and_unload() # Use 'merged_model' for all subsequent 'generate' calls. ``` 2. **Further Quantization:** For CPU-based deployment or even more efficient GPU usage, the merged model can be further quantized into formats like **GGUF** (for use with `llama.cpp`) or **AWQ/GPTQ**. 3. **API Serving:** Wrap the model in a high-performance web server like FastAPI or use a dedicated LLM serving framework like vLLM for optimal throughput and batching. ## Limitations and Responsible AI * **Generalization:** The model is specialized for the Text-to-SQL task and the schema styles seen in its training data. It may not perform well on highly complex or esoteric SQL dialects. * **Security:** This model is a proof-of-concept and has not been hardened against SQL injection attacks. All generated SQL should be treated as untrusted input and must be sanitized or executed in a sandboxed, read-only environment. * **Bias:** The training data is the source of the model's knowledge. Any biases present in the sql-create-context dataset may be reflected in the model's outputs.