--- base_model: unsloth/qwen2.5-coder-14b-instruct-bnb-4bit library_name: peft pipeline_tag: text-generation tags: - base_model:adapter:unsloth/qwen2.5-coder-14b-instruct-bnb-4bit - lora - sft - transformers - trl - unsloth - text-to-sql - education - local-first --- # LFED SQL Assistant — Qwen2.5-Coder-14B-LoRA A LoRA adapter that turns plain-English school-data questions into read-only DuckDB SQL queries. Built for the **Local First Education Data Framework (LFED)**, a local-first analytics assistant for school administrators. - **Live demo:** https://huggingface.co/spaces/build-small-hackathon/Kasualdad_LFED - **GGUF (local/llama.cpp):** https://huggingface.co/build-small-hackathon/lfed-qwen2.5-coder-14b-sql-gguf - **Project:** https://huggingface.co/spaces/build-small-hackathon/Kasualdad_LFED --- ## Model Details - **Developer:** build-small-hackathon (HF Build Small Hackathon, Chapter One: Backyard AI) - **Base model:** [`unsloth/qwen2.5-coder-14b-instruct-bnb-4bit`](https://huggingface.co/unsloth/qwen2.5-coder-14b-instruct-bnb-4bit) - **Model type:** Causal LM, text-to-SQL, instruction-tuned - **Language:** English (questions and SQL) - **License:** Apache 2.0 - **Fine-tuning method:** Unsloth QLoRA (r=32, α=32, target modules `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`) - **Training data:** 27,859 synthetic NL→SQL pairs over a 5-table education schema - **Adapter size:** ~551 MB - **Framework versions:** PEFT 0.19.1, transformers, Unsloth, TRL --- ## Intended Use This adapter is designed for a single downstream task: **converting natural-language questions about school data into safe, read-only SQL**. ### Direct use - Drop the adapter onto the base model for inference in the LFED Gradio app. - Run locally with the matching GGUF or with `transformers + PEFT`. ### Suitable questions - "How many students were chronically absent at Lincoln Elementary in 2023-2024?" - "What is the suspension rate by race/ethnicity at Washington Middle?" - "Show the average GPA for English learners vs non-English learners." - "Which school has the highest enrollment growth since 2021?" ### Out-of-scope use - Not a general chatbot or coding assistant. - Not trained on real student PII; the demo uses synthetic seed data only. - Not suitable for arbitrary SQL dialects beyond DuckDB-compatible queries. - Should not be used for write operations; the execution guard allows only `SELECT` statements. --- ## Training Details ### Training data - **Source:** synthetic data generated from hand-written templates, augmented with Gretel, and rephrased for natural-language variety. - **Schema:** 5 tables — `students`, `enrollment`, `attendance`, `discipline`, `grades`. - **Coverage:** single-table aggregations, joins, filtering by school/year/grade, subgroup comparisons, ranking, and simple rates/percentages. - **Size:** 27,859 question→SQL pairs. - **Format:** each example contains a `question` and a `sql` field. - **Data generation scripts:** `modal_train/generate_synthetic_v2.py`, `modal_train/augment_gretel.py`, `modal_train/rephrase_pairs.py` in the project repo. ### Training procedure | Setting | Value | |---|---| | Optimizer | AdamW (Unsloth default) | | Learning rate | 1e-4 | | LR scheduler | cosine | | Warmup steps | 10 | | Batch size | 4 | | Gradient accumulation | 4 | | Epochs | 2 | | LoRA r | 32 | | LoRA α | 32 | | LoRA dropout | 0 | | Target modules | all linear layers | | Quantization | 4-bit (bnb NF4) | | Max sequence length | 2048 | | Trainer | SFTTrainer (TRL) | | Packing | False | | Hardware | Modal A10G | Training completed on 2026-06-10. ### Outputs | Artifact | Location | |---|---| | This LoRA adapter | `build-small-hackathon/lfed-qwen2.5-coder-14b-sql-lora` | | Merged GGUF Q4_K_M | `build-small-hackathon/lfed-qwen2.5-coder-14b-sql-gguf` | | Training code | `modal_train/` in the LFED project repo | --- ## Evaluation ### Approach Evaluation is currently manual: a bank of 15 real-world-style queries spanning attendance, discipline, grades, enrollment, and equity comparisons is run through the LFED demo UI. Each query is scored on: 1. Correctness — does the answer match the expected aggregation/join? 2. SQL quality — is the generated SQL valid, safe, and readable? 3. UX — is the summary + table useful? 4. Latency — does the query complete within a reasonable time? ### Known limitations - The model is fine-tuned on synthetic data; real-world schema variations require additional prompting or fine-tuning. - It occasionally needs explicit school name and school year in the question to produce the most reliable query. - Complex multi-step reasoning (e.g., "students who improved GPA across consecutive years") can be brittle. - Percentage/rate formatting is handled by the downstream app, not the model; the model may return either 0–1 proportions or already-scaled percentages. - No formal academic benchmark evaluation has been run. --- ## How to Use ### With transformers + PEFT (HF Space path) ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer from peft import PeftModel base_id = "unsloth/qwen2.5-coder-14b-instruct-bnb-4bit" adapter_id = "build-small-hackathon/lfed-qwen2.5-coder-14b-sql-lora" tokenizer = AutoTokenizer.from_pretrained(base_id) model = AutoModelForCausalLM.from_pretrained( base_id, torch_dtype=torch.float16, device_map="auto", ) model = PeftModel.from_pretrained(model, adapter_id, torch_device="cpu") prompt = """You are an assistant that converts school-data questions into DuckDB SQL. Schema: - students(student_id, school_name, grade_level, gender, race_ethnicity, english_learner, special_education, economically_disadvantaged) - attendance(student_id, school_name, school_year, absence_count, is_chronically_absent) Question: How many chronically absent students at Lincoln Elementary in 2023-2024? SQL:""" inputs = tokenizer(prompt, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=128, temperature=0.0) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ### With llama.cpp (local-first path) Use the matching GGUF: ```bash llama-cli \ -m lfed-qwen2.5-coder-14b-sql-gguf/ggml-model-q4_k_m.gguf \ -p "Question: How many chronically absent students at Lincoln Elementary in 2023-2024?\nSQL:" \ -n 128 --temp 0.0 ``` Or run the full LFED app locally: ```bash git checkout -b product local-llamacpp-v1 python3.12 -m venv .venv && source .venv/bin/activate pip install -r requirements.txt python app.py ``` --- ## Bias, Risks, and Limitations - **Synthetic data:** the training data is generated from templates and rephrased. Demographic patterns in the seed data do not represent any real population; they exist to exercise joins and filters. - **No PII handling training:** the model has no special safeguards around personally identifiable information because the schema uses anonymized `student_id`s only. - **Read-only enforcement is app-level:** the adapter itself will emit any SQL-like text; the downstream `data_engine.py` validator enforces `SELECT`-only and forbidden-token rules. - **Hallucinated columns/tables:** the model may occasionally reference a plausible-sounding column that does not exist. The execution guard catches these via schema-aware `EXPLAIN` validation. - **Numeric accuracy:** the model writes the SQL; percentage interpretation depends on the application layer. Users should verify rates and percentages against their own conventions. --- ## Environmental Impact Estimated training energy use on a Modal A10G for ~2 epochs: - **Hardware type:** NVIDIA A10G - **Training time:** approximately 1–2 hours - **Cloud provider:** Modal - **Region:** likely US-east (Modal default) - **Carbon emitted:** not precisely measured; rough estimate using [ML CO2 Impact calculator](https://mlco2.github.io/impact#compute) is on the order of tens to low-hundreds of grams of CO2eq. --- ## Citation If you use this model, please cite the base model and the LFED project: **BibTeX:** ```bibtex @misc{lfed_sql_adapter, title={Local First Education Data Framework: A Qwen2.5-Coder-14B LoRA Adapter for School-Data Text-to-SQL}, author={build-small-hackathon}, year={2026}, howpublished={\url{https://huggingface.co/build-small-hackathon/lfed-qwen2.5-coder-14b-sql-lora}} } ``` **APA:** build-small-hackathon. (2026). *Local First Education Data Framework: A Qwen2.5-Coder-14B LoRA adapter for school-data text-to-SQL*. Hugging Face. https://huggingface.co/build-small-hackathon/lfed-qwen2.5-coder-14b-sql-lora --- ## Framework versions - PEFT 0.19.1