--- language: en license: apache-2.0 tags: - text2sql - sql - structured-data - natural-language-to-sql - mistral - qlora - lora - peft - transformers - huggingface - streamlit - evaluation - spider datasets: - b-mc2/sql-create-context library_name: transformers pipeline_tag: text-generation base_model: mistralai/Mistral-7B-Instruct-v0.1 --- # Analytics Copilot (Text-to-SQL) – Mistral-7B QLoRA This repository contains a **Text-to-SQL** model built by fine-tuning **`mistralai/Mistral-7B-Instruct-v0.1`** with **QLoRA** on the **`b-mc2/sql-create-context`** dataset, plus an evaluation pipeline and a Streamlit UI for interactive usage. The model’s goal is to convert a **natural-language question** and a concrete **database schema** (as `CREATE TABLE` DDL) into a **single SQL query** that answers the question. > **Note:** This model card documents the *adapter* (QLoRA) or fine-tuned model > released from the Analytics Copilot Text-to-SQL project. It assumes the > underlying base model is `mistralai/Mistral-7B-Instruct-v0.1` and that > training was run using the public **`b-mc2/sql-create-context`** dataset. --- ## Model Summary - **Task:** Text-to-SQL (natural-language questions → SQL queries) - **Base model:** `mistralai/Mistral-7B-Instruct-v0.1` - **Fine-tuning method:** QLoRA (4-bit) with LoRA adapters - **Libraries:** `transformers`, `peft`, `trl`, `unsloth`, `bitsandbytes` - **Primary training data:** `b-mc2/sql-create-context` - **Evaluation datasets:** - Internal: processed val split from `b-mc2/sql-create-context` - External: Spider dev (via `xlangai/spider` + `richardr1126/spider-schema`) - **Input:** Schema (`CREATE TABLE` context) + natural-language question - **Output:** A single SQL query string - **Usage:** Mainly via Hugging Face Inference Endpoints + LoRA adapters, or by loading the adapter with `transformers` + `peft`. --- ## Intended Use and Limitations ### Intended Use This model is intended as a **developer-facing Text-to-SQL assistant**. Typical uses include: - Helping analysts and engineers generate SQL queries from natural language when they: - Already know the schema (or can paste it). - Want to prototype queries quickly. - Powering a **Text-to-SQL copilot UI**, e.g., the included Streamlit app: - Paste database schema (DDL) into a text area. - Ask a question in natural language. - Get suggested SQL as a starting point. - Serving as a **research / teaching artifact**: - Demonstrates how to fine-tune an open LLM with QLoRA for Text-to-SQL. - Provides a reproducible evaluation pipeline on a public dataset. ### Out of Scope / Misuse The model is **not** intended for: - Direct, unsupervised execution against **production databases**: - SQL may be syntactically valid but semantically off. - The model is not aware of performance / cost implications. - Use as a general-purpose chatbot: - It is trained specifically on schema + question → SQL. - Generating **arbitrary SQL** without schema: - It is strongly conditioned on explicit schema context. - High-stakes domains: - Healthcare, finance, safety-critical environments, or any domain where incorrect queries can cause harm or large financial loss. ### Limitations - **Hallucinations:** Despite having schema context, the model can: - Refer to non-existent tables/columns. - Misinterpret relationships between tables. - **No automatic execution safety:** - The training objective does not enforce read-only queries. - You must wrap the model in a strict execution layer (e.g., allow only `SELECT`, enforce limits, static analysis). - **Domain coverage:** - Training is driven by `b-mc2/sql-create-context` and Spider; behavior on very different schemas or DB engines may degrade. - **Locale and language:** - Primarily English; performance on non-English questions is untested. You should treat generated SQL as **suggestions** that require human review before execution. --- ## Model Details ### Architecture - **Base architecture:** Mistral-7B (decoder-only Transformer) - **Base model:** `mistralai/Mistral-7B-Instruct-v0.1` - Licensed under **Apache 2.0**. - **Fine-tuning method:** QLoRA (Low-Rank Adapters with 4-bit quantized base) - **Adapter mechanism:** LoRA adapters (PEFT / Unsloth) Typical QLoRA configuration (as used in the training script/notebook): - `lora_r`: 16 - `lora_alpha`: 16 - `lora_dropout`: 0.0 - `max_seq_length`: 2048 - 4-bit quantization with bitsandbytes: - `bnb_4bit_quant_type = "nf4"` - `bnb_4bit_compute_dtype = "float16"` (on CUDA) - `bnb_4bit_use_double_quant = True` ### Training Configuration (QLoRA) The project defines a `TrainingConfig` dataclass with the following key fields: - `base_model` (str): e.g. `"mistralai/Mistral-7B-Instruct-v0.1"` - `max_steps` (int): e.g. 500 - `per_device_train_batch_size` (int): typically small (e.g. 1) - `gradient_accumulation_steps` (int): e.g. 8 (to achieve an effective batch size) - `learning_rate` (float): e.g. `2e-4` - `warmup_steps` (int): e.g. 50 - `weight_decay` (float): typically `0.0` for QLoRA - `max_seq_length` (int): e.g. 2048 - `lora_r` (int): e.g. 16 - `lora_alpha` (int): e.g. 16 - `lora_dropout` (float): e.g. 0.0 - `seed` (int): e.g. 42 These values are exposed via the CLI script: ```bash python scripts/train_qlora.py \ --train_path data/processed/train.jsonl \ --val_path data/processed/val.jsonl \ --base_model mistralai/Mistral-7B-Instruct-v0.1 \ --output_dir outputs/ \ --max_steps 500 \ --per_device_train_batch_size 1 \ --gradient_accumulation_steps 8 \ --learning_rate 2e-4 \ --warmup_steps 50 \ --weight_decay 0.0 \ --max_seq_length 2048 \ --lora_r 16 \ --lora_alpha 16 \ --lora_dropout 0.0 \ --seed 42 ``` ## Data and Preprocessing ### Primary Training Dataset: `b-mc2/sql-create-context` - **Name:** `b-mc2/sql-create-context` - **Source:** Hugging Face Datasets - **Dataset page:** https://huggingface.co/datasets/b-mc2/sql-create-context **Fields:** - `question` – natural language question from the user - `context` – schema context as one or more `CREATE TABLE` statements - `answer` – gold SQL query **Example (conceptual):** { "question": "How many heads of the departments are older than 56?", "context": "CREATE TABLE head (age INTEGER)", "answer": "SELECT COUNT(*) FROM head WHERE age > 56" } Please refer to the dataset page on Hugging Face for licensing and further details. This model inherits any legal constraints from both the base model and this dataset. --- ### Train / Validation Split The dataset only provides a `train` split. The project creates its own train/validation split using: - `datasets.Dataset.train_test_split` with: - `test_size = val_ratio` (default: `0.08`) - `seed = 42` Renames: - `train` → final training split - `test` → final validation split This yields: - `data/processed/train.jsonl` – training examples - `data/processed/val.jsonl` – validation examples --- ### Instruction-Tuning Format (Alpaca-style JSONL) Each processed example has: - `id` – e.g. `"sqlcc-train-000001"` - `instruction` – static instruction text - `input` – formatted schema + question - `output` – normalized SQL query - `source` – `"b-mc2/sql-create-context"` - `meta` – metadata (original split, row index, seed, etc.) **Example:** { "id": "sqlcc-train-000001", "instruction": "Write a SQL query that answers the user's question using ONLY the tables and columns provided in the schema.", "input": "### Schema:\nCREATE TABLE head (age INTEGER)\n\n### Question:\nHow many heads of the departments are older than 56 ?", "output": "SELECT COUNT(*) FROM head WHERE age > 56", "source": "b-mc2/sql-create-context", "meta": { "original_split": "train", "row": 0, "split": "train", "val_ratio": 0.08, "seed": 42, "from_local_input": false } } --- ### Instruction Text The instruction is fixed: Write a SQL query that answers the user's question using ONLY the tables and columns provided in the schema. --- ### Input Formatting `input` is constructed as: ### Schema: ### Question: This is implemented in `text2sql.data_prep.build_input_text`. --- ### SQL Normalization The dataset builder applies light normalization to the answer: - Strip leading/trailing whitespace - Collapse runs of whitespace into a single space This is implemented as `text2sql.data_prep.normalize_sql`. --- ## Training Procedure ### Prompt Format for Training To build the final training text, the project uses a simple prompt template: ### Instruction: ### Input: ### Response: This template is defined as `PROMPT_TEMPLATE` in `src/text2sql/training/formatting.py`, and filled via: from text2sql.training.formatting import build_prompt prompt = build_prompt(instruction, input_text) # Final training text is: prompt + output_sql `output_sql` is normalized SQL, optionally further cleaned with `ensure_sql_only` when used at inference time. --- ### Optimization - Optimizer & scheduler are provided by `trl.SFTTrainer` / `transformers`. - Mixed precision (e.g. bf16/fp16) is enabled when supported. - LoRA adapters are applied to a subset of projection layers; typical choices include attention and MLP projections (see training code for exact `target_modules`). --- ### Hardware Intended to run on a single modern GPU (e.g., A10, A100, L4) with ≥16GB VRAM using 4-bit quantization. The CLI script has: - `--dry_run` mode (no model load; checks dataset & formatting). - `--smoke` mode (lightweight config check; on CPU-only machines it skips loading the full model). --- ### Outputs After a full run you should obtain: - `outputs/adapters/` – LoRA adapter weights / config - `outputs/run_meta.json` – training config, data paths, etc. - `outputs/metrics.json` – training/eval metrics as reported by the trainer These artifacts can be published to the Hub via the helper script `scripts/publish_to_hub.py`. --- ## Evaluation The project provides a dedicated evaluation pipeline for both internal and external validation. --- ### Metrics All evaluation flows share the same core metrics, implemented in `src/text2sql/eval/metrics.py`: #### Exact Match (EM) (normalized SQL) Uses `normalize_sql`: - Strip whitespace - Remove trailing semicolons - Collapse whitespace runs Checks exact string equality between normalized prediction and gold SQL. #### No-values Exact Match Uses `normalize_sql_no_values`: - Normalize SQL as above - Replace single-quoted string literals with a placeholder (`'__STR__'`) - Replace numeric literals (integers/decimals) with a placeholder (`__NUM__`) Captures structural equality even when literal values differ. #### SQL parse success rate Uses `sqlglot.parse_one` to parse the predicted SQL. Fraction of predictions that parse successfully. #### Schema adherence - Parses the `CREATE TABLE` context with `sqlglot` to recover: - Tables and columns - Parses predicted SQL and extracts table/column references - A prediction is schema-adherent if all references exist in the schema. Metrics are aggregated as: { "n_examples": ..., "exact_match": {"count": ..., "rate": ...}, "no_values_em": {"count": ..., "rate": ...}, "parse_success": {"count": ..., "rate": ...}, "schema_adherence": {"count": ..., "rate": ...} // optional } **Important:** At the time of writing, this model card does not include specific numeric metrics. After you run `scripts/evaluate_internal.py` and `scripts/evaluate_spider_external.py`, you should update this section with actual results from: - `reports/eval_internal.json` / `.md` - `reports/eval_spider.json` / `.md` --- ### Internal Evaluation (b-mc2/sql-create-context val) **Input:** `data/processed/val.jsonl` (same format as training) **Script:** python scripts/evaluate_internal.py \ --val_path data/processed/val.jsonl \ --base_model mistralai/Mistral-7B-Instruct-v0.1 \ --adapter_dir /path/to/outputs/adapters \ --device auto \ --max_examples 200 \ --temperature 0.0 \ --top_p 0.9 \ --max_new_tokens 256 \ --out_dir reports/ **Notes:** - `--device auto` chooses GPU when available. - 4-bit quantization is enabled by default on CUDA; configurable via: - `--load_in_4bit` / `--no_load_in_4bit` - `--bnb_4bit_quant_type`, `--bnb_4bit_compute_dtype`, etc. - `--smoke` runs a small subset; on CPU-only environments it falls back to mock mode (gold SQL as prediction) to exercise the metrics without loading the model. **Outputs:** - `reports/eval_internal.json` - `reports/eval_internal.md` --- ### External Validation (Spider dev) **Datasets:** - Examples: `xlangai/spider` (split: `validation`) - Schema helper: `richardr1126/spider-schema` (contains create_table_context) - License note: `richardr1126/spider-schema` is licensed under **CC BY-SA 4.0**. Spider is used only for evaluation, not training. **Prompt format:** ### Schema: ### Question: Instruction text is the same as training. Prompts are constructed with the same formatter used for training (via helper functions in `text2sql.eval.spider`). **Script:** python scripts/evaluate_spider_external.py \ --base_model mistralai/Mistral-7B-Instruct-v0.1 \ --adapter_dir /path/to/outputs/adapters \ --device auto \ --spider_source xlangai/spider \ --schema_source richardr1126/spider-schema \ --spider_split validation \ --max_examples 200 \ --temperature 0.0 \ --top_p 0.9 \ --max_new_tokens 256 \ --out_dir reports/ **Outputs:** - `reports/eval_spider.json` - `reports/eval_spider.md` The same metrics (EM, no-values EM, parse success, schema adherence) are computed, but note: - This is not a full reproduction of official Spider evaluation (which includes component matching, execution metrics, etc.). - It is a lightweight proxy for cross-domain Text-to-SQL quality. --- ### Mock / Offline Modes Both evaluation scripts have `--mock` modes: - Use small fixtures from `tests/fixtures/` - Treat gold SQL as predictions - Avoid network / heavy model loads Ideal for CI and offline smoketests. --- ## Inference and Deployment ### Basic Usage with Hugging Face Transformers (Adapters) Assuming this repo provides a LoRA adapter that you can load on top of `mistralai/Mistral-7B-Instruct-v0.1`: from transformers import AutoModelForCausalLM, AutoTokenizer from peft import PeftModel BASE_MODEL = "mistralai/Mistral-7B-Instruct-v0.1" ADAPTER_REPO = "your-username/analytics-copilot-text2sql-mistral7b-qlora" tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL) base_model = AutoModelForCausalLM.from_pretrained( BASE_MODEL, load_in_4bit=True, device_map="auto", ) model = PeftModel.from_pretrained(base_model, ADAPTER_REPO) schema = """CREATE TABLE orders ( id INTEGER PRIMARY KEY, customer_id INTEGER, amount NUMERIC, created_at TIMESTAMP );""" question = "Total order amount per customer for the last 7 days." instruction = ( "Write a SQL query that answers the user's question using ONLY " "the tables and columns provided in the schema." ) input_text = f"### Schema:\n{schema}\n\n### Question:\n{question}" prompt = f"### Instruction:\n{instruction}\n\n### Input:\n{input_text}\n\n### Response:\n" inputs = tokenizer(prompt, return_tensors="pt").to(model.device) output_ids = model.generate( **inputs, max_new_tokens=256, temperature=0.0, ) raw_text = tokenizer.decode(output_ids[0], skip_special_tokens=True) # Optionally, post-process with the project’s SQL cleaner: # from text2sql.training.formatting import ensure_sql_only # sql = ensure_sql_only(raw_text) print(raw_text) --- ### Inference Endpoints + Multi-LoRA (Recommended for Production) If you host the base model in a Hugging Face Inference Endpoint with a Multi-LoRA configuration (via `LORA_ADAPTERS`), you can select this adapter at inference time by `adapter_id`. Example environment for TGI: LORA_ADAPTERS='[ {"id": "text2sql-qlora", "source": "your-username/analytics-copilot-text2sql-mistral7b-qlora"} ]' Then in Python: from huggingface_hub import InferenceClient ENDPOINT_URL = "https://your-endpoint-1234.us-east-1.aws.endpoints.huggingface.cloud" HF_TOKEN = "hf_your_token_here" client = InferenceClient(base_url=ENDPOINT_URL, api_key=HF_TOKEN) schema = """CREATE TABLE orders ( id INTEGER PRIMARY KEY, customer_id INTEGER, amount NUMERIC, created_at TIMESTAMP );""" question = "Total order amount per customer for the last 7 days." prompt = f"""### Schema: {schema} ### Question: {question} Return only the SQL query.""" response = client.post( json={ "inputs": prompt, "parameters": { "adapter_id": "text2sql-qlora", "max_new_tokens": 256, "temperature": 0.0, }, } ) print(response) --- ### Streamlit UI The accompanying repo includes a Streamlit app (`app/streamlit_app.py`) that: - Runs on Streamlit Community Cloud or locally. - Calls a Hugging Face Inference Endpoint or router via `InferenceClient`. - Reads config from Streamlit secrets or environment: - `HF_TOKEN` - `HF_ENDPOINT_URL` + `HF_ADAPTER_ID` (preferred, TGI endpoint + adapter) - Or `HF_MODEL_ID` + `HF_PROVIDER` (router-based fallback, for merged models) - Optionally uses an OpenAI fallback model when HF inference fails. Deployment instructions are documented in `docs/deploy_streamlit_cloud.md`. --- ## Ethical Considerations and Risks ### Data and Bias The training data (`b-mc2/sql-create-context`) may contain: - Synthetic or curated schemas and questions - Biases in naming conventions, example queries, or tasks The base model (`Mistral-7B-Instruct`) is trained on large-scale web and other data. It inherits any demographic, cultural, and representational biases present in those sources. As a result: - The model can produce SQL that, if combined with biased downstream usage (e.g., unfair filtering in a user database), may exacerbate existing biases. - The model is not aware of ethical / legal constraints around data access; it will happily generate queries that might retrieve sensitive fields (e.g., emails, PII) if such columns exist in the schema. --- ### Safety and Security Generated SQL may contain: - Expensive operations (full table scans on large tables) - Potentially unsafe patterns (e.g., missing `LIMIT`, cross joins) The model does not perform: - Access control - Row-level security - SQL injection detection You must implement: - A strict execution sandbox: - Allow only `SELECT` (no `INSERT`, `UPDATE`, `DELETE`, `DROP`, etc.) - Enforce timeouts and row limits - Appropriate logging and review of executed queries --- ### Human Oversight Always: - Present generated SQL to users for review - Encourage edits and manual validation - Provide clear warnings that the system is a copilot, not an oracle --- ### Environmental Impact Training details vary depending on your hardware and hyperparameters, but in general: - QLoRA + 4-bit quantization significantly reduces compute and memory compared to full fine-tuning: - Fewer GPU-hours - Lower VRAM requirements - The example configuration (7B model, QLoRA, moderate steps) is designed to fit on commodity cloud GPUs (e.g., single A10/A100-class instance). To be transparent, you should log and publish: - GPU type and count - Total training time - Number of runs and restarts --- ## How to Cite If you use this model or the underlying codebase in a research project or production system, please consider citing: - The base model authors: Mistral AI (`mistralai/Mistral-7B-Instruct-v0.1`) - The training dataset: `b-mc2/sql-create-context` (see dataset page for citation) - This project (replace with your own reference): Analytics Copilot (Text-to-SQL) – Mistral-7B QLoRA, GitHub: https://github.com/brej-29/analytics-copilot-text2sql You may also add a BibTeX entry, for example: @misc{analytics_copilot_text2sql, title = {Analytics Copilot (Text-to-SQL) -- Mistral-7B QLoRA}, author = {Your Name}, year = {2026}, howpublished = {\url{https://github.com/brej-29/analytics-copilot-text2sql}}, note = {Text-to-SQL fine-tuning of Mistral-7B using QLoRA on b-mc2/sql-create-context} } --- ## Changelog - **Initial adapter / model card:** - QLoRA fine-tuning on `b-mc2/sql-create-context` - Internal and external evaluation pipelines implemented - Streamlit UI for remote inference via Hugging Face Inference