File size: 21,183 Bytes
2b6326d 1b3a99a 2b6326d 1b3a99a c056dbc 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a 2b6326d 1b3a99a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 |
---
language: en
license: apache-2.0
tags:
- text2sql
- sql
- structured-data
- natural-language-to-sql
- mistral
- qlora
- lora
- peft
- transformers
- huggingface
- streamlit
- evaluation
- spider
datasets:
- b-mc2/sql-create-context
library_name: transformers
pipeline_tag: text-generation
base_model: mistralai/Mistral-7B-Instruct-v0.1
---
# Analytics Copilot (Text-to-SQL) – Mistral-7B QLoRA
This repository contains a **Text-to-SQL** model built by fine-tuning
**`mistralai/Mistral-7B-Instruct-v0.1`** with **QLoRA** on the
**`b-mc2/sql-create-context`** dataset, plus an evaluation pipeline and a
Streamlit UI for interactive usage.
The model’s goal is to convert a **natural-language question** and a concrete
**database schema** (as `CREATE TABLE` DDL) into a **single SQL query** that
answers the question.
> **Note:** This model card documents the *adapter* (QLoRA) or fine-tuned model
> released from the Analytics Copilot Text-to-SQL project. It assumes the
> underlying base model is `mistralai/Mistral-7B-Instruct-v0.1` and that
> training was run using the public **`b-mc2/sql-create-context`** dataset.
---
## Model Summary
- **Task:** Text-to-SQL (natural-language questions → SQL queries)
- **Base model:** `mistralai/Mistral-7B-Instruct-v0.1`
- **Fine-tuning method:** QLoRA (4-bit) with LoRA adapters
- **Libraries:** `transformers`, `peft`, `trl`, `unsloth`, `bitsandbytes`
- **Primary training data:** `b-mc2/sql-create-context`
- **Evaluation datasets:**
- Internal: processed val split from `b-mc2/sql-create-context`
- External: Spider dev (via `xlangai/spider` + `richardr1126/spider-schema`)
- **Input:** Schema (`CREATE TABLE` context) + natural-language question
- **Output:** A single SQL query string
- **Usage:** Mainly via Hugging Face Inference Endpoints + LoRA adapters, or
by loading the adapter with `transformers` + `peft`.
---
## Intended Use and Limitations
### Intended Use
This model is intended as a **developer-facing Text-to-SQL assistant**. Typical
uses include:
- Helping analysts and engineers generate SQL queries from natural language
when they:
- Already know the schema (or can paste it).
- Want to prototype queries quickly.
- Powering a **Text-to-SQL copilot UI**, e.g., the included Streamlit app:
- Paste database schema (DDL) into a text area.
- Ask a question in natural language.
- Get suggested SQL as a starting point.
- Serving as a **research / teaching artifact**:
- Demonstrates how to fine-tune an open LLM with QLoRA for Text-to-SQL.
- Provides a reproducible evaluation pipeline on a public dataset.
### Out of Scope / Misuse
The model is **not** intended for:
- Direct, unsupervised execution against **production databases**:
- SQL may be syntactically valid but semantically off.
- The model is not aware of performance / cost implications.
- Use as a general-purpose chatbot:
- It is trained specifically on schema + question → SQL.
- Generating **arbitrary SQL** without schema:
- It is strongly conditioned on explicit schema context.
- High-stakes domains:
- Healthcare, finance, safety-critical environments, or any domain where
incorrect queries can cause harm or large financial loss.
### Limitations
- **Hallucinations:** Despite having schema context, the model can:
- Refer to non-existent tables/columns.
- Misinterpret relationships between tables.
- **No automatic execution safety:**
- The training objective does not enforce read-only queries.
- You must wrap the model in a strict execution layer (e.g., allow only
`SELECT`, enforce limits, static analysis).
- **Domain coverage:**
- Training is driven by `b-mc2/sql-create-context` and Spider; behavior on
very different schemas or DB engines may degrade.
- **Locale and language:**
- Primarily English; performance on non-English questions is untested.
You should treat generated SQL as **suggestions** that require human review
before execution.
---
## Model Details
### Architecture
- **Base architecture:** Mistral-7B (decoder-only Transformer)
- **Base model:** `mistralai/Mistral-7B-Instruct-v0.1`
- Licensed under **Apache 2.0**.
- **Fine-tuning method:** QLoRA (Low-Rank Adapters with 4-bit quantized base)
- **Adapter mechanism:** LoRA adapters (PEFT / Unsloth)
Typical QLoRA configuration (as used in the training script/notebook):
- `lora_r`: 16
- `lora_alpha`: 16
- `lora_dropout`: 0.0
- `max_seq_length`: 2048
- 4-bit quantization with bitsandbytes:
- `bnb_4bit_quant_type = "nf4"`
- `bnb_4bit_compute_dtype = "float16"` (on CUDA)
- `bnb_4bit_use_double_quant = True`
### Training Configuration (QLoRA)
The project defines a `TrainingConfig` dataclass with the following key fields:
- `base_model` (str): e.g. `"mistralai/Mistral-7B-Instruct-v0.1"`
- `max_steps` (int): e.g. 500
- `per_device_train_batch_size` (int): typically small (e.g. 1)
- `gradient_accumulation_steps` (int): e.g. 8 (to achieve an effective batch size)
- `learning_rate` (float): e.g. `2e-4`
- `warmup_steps` (int): e.g. 50
- `weight_decay` (float): typically `0.0` for QLoRA
- `max_seq_length` (int): e.g. 2048
- `lora_r` (int): e.g. 16
- `lora_alpha` (int): e.g. 16
- `lora_dropout` (float): e.g. 0.0
- `seed` (int): e.g. 42
These values are exposed via the CLI script:
```bash
python scripts/train_qlora.py \
--train_path data/processed/train.jsonl \
--val_path data/processed/val.jsonl \
--base_model mistralai/Mistral-7B-Instruct-v0.1 \
--output_dir outputs/ \
--max_steps 500 \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 8 \
--learning_rate 2e-4 \
--warmup_steps 50 \
--weight_decay 0.0 \
--max_seq_length 2048 \
--lora_r 16 \
--lora_alpha 16 \
--lora_dropout 0.0 \
--seed 42
```
## Data and Preprocessing
### Primary Training Dataset: `b-mc2/sql-create-context`
- **Name:** `b-mc2/sql-create-context`
- **Source:** Hugging Face Datasets
- **Dataset page:** https://huggingface.co/datasets/b-mc2/sql-create-context
**Fields:**
- `question` – natural language question from the user
- `context` – schema context as one or more `CREATE TABLE` statements
- `answer` – gold SQL query
**Example (conceptual):**
{
"question": "How many heads of the departments are older than 56?",
"context": "CREATE TABLE head (age INTEGER)",
"answer": "SELECT COUNT(*) FROM head WHERE age > 56"
}
Please refer to the dataset page on Hugging Face for licensing and further details. This model inherits any legal constraints from both the base model and this dataset.
---
### Train / Validation Split
The dataset only provides a `train` split. The project creates its own train/validation split using:
- `datasets.Dataset.train_test_split` with:
- `test_size = val_ratio` (default: `0.08`)
- `seed = 42`
Renames:
- `train` → final training split
- `test` → final validation split
This yields:
- `data/processed/train.jsonl` – training examples
- `data/processed/val.jsonl` – validation examples
---
### Instruction-Tuning Format (Alpaca-style JSONL)
Each processed example has:
- `id` – e.g. `"sqlcc-train-000001"`
- `instruction` – static instruction text
- `input` – formatted schema + question
- `output` – normalized SQL query
- `source` – `"b-mc2/sql-create-context"`
- `meta` – metadata (original split, row index, seed, etc.)
**Example:**
{
"id": "sqlcc-train-000001",
"instruction": "Write a SQL query that answers the user's question using ONLY the tables and columns provided in the schema.",
"input": "### Schema:\nCREATE TABLE head (age INTEGER)\n\n### Question:\nHow many heads of the departments are older than 56 ?",
"output": "SELECT COUNT(*) FROM head WHERE age > 56",
"source": "b-mc2/sql-create-context",
"meta": {
"original_split": "train",
"row": 0,
"split": "train",
"val_ratio": 0.08,
"seed": 42,
"from_local_input": false
}
}
---
### Instruction Text
The instruction is fixed:
Write a SQL query that answers the user's question using ONLY the tables and columns provided in the schema.
---
### Input Formatting
`input` is constructed as:
### Schema:
<CREATE TABLE ...>
### Question:
<question text>
This is implemented in `text2sql.data_prep.build_input_text`.
---
### SQL Normalization
The dataset builder applies light normalization to the answer:
- Strip leading/trailing whitespace
- Collapse runs of whitespace into a single space
This is implemented as `text2sql.data_prep.normalize_sql`.
---
## Training Procedure
### Prompt Format for Training
To build the final training text, the project uses a simple prompt template:
### Instruction:
<instruction>
### Input:
<input>
### Response:
This template is defined as `PROMPT_TEMPLATE` in `src/text2sql/training/formatting.py`, and filled via:
from text2sql.training.formatting import build_prompt
prompt = build_prompt(instruction, input_text)
# Final training text is: prompt + output_sql
`output_sql` is normalized SQL, optionally further cleaned with `ensure_sql_only` when used at inference time.
---
### Optimization
- Optimizer & scheduler are provided by `trl.SFTTrainer` / `transformers`.
- Mixed precision (e.g. bf16/fp16) is enabled when supported.
- LoRA adapters are applied to a subset of projection layers; typical choices include attention and MLP projections (see training code for exact `target_modules`).
---
### Hardware
Intended to run on a single modern GPU (e.g., A10, A100, L4) with ≥16GB VRAM using 4-bit quantization.
The CLI script has:
- `--dry_run` mode (no model load; checks dataset & formatting).
- `--smoke` mode (lightweight config check; on CPU-only machines it skips loading the full model).
---
### Outputs
After a full run you should obtain:
- `outputs/adapters/` – LoRA adapter weights / config
- `outputs/run_meta.json` – training config, data paths, etc.
- `outputs/metrics.json` – training/eval metrics as reported by the trainer
These artifacts can be published to the Hub via the helper script `scripts/publish_to_hub.py`.
---
## Evaluation
The project provides a dedicated evaluation pipeline for both internal and external validation.
---
### Metrics
All evaluation flows share the same core metrics, implemented in `src/text2sql/eval/metrics.py`:
#### Exact Match (EM) (normalized SQL)
Uses `normalize_sql`:
- Strip whitespace
- Remove trailing semicolons
- Collapse whitespace runs
Checks exact string equality between normalized prediction and gold SQL.
#### No-values Exact Match
Uses `normalize_sql_no_values`:
- Normalize SQL as above
- Replace single-quoted string literals with a placeholder (`'__STR__'`)
- Replace numeric literals (integers/decimals) with a placeholder (`__NUM__`)
Captures structural equality even when literal values differ.
#### SQL parse success rate
Uses `sqlglot.parse_one` to parse the predicted SQL.
Fraction of predictions that parse successfully.
#### Schema adherence
- Parses the `CREATE TABLE` context with `sqlglot` to recover:
- Tables and columns
- Parses predicted SQL and extracts table/column references
- A prediction is schema-adherent if all references exist in the schema.
Metrics are aggregated as:
{
"n_examples": ...,
"exact_match": {"count": ..., "rate": ...},
"no_values_em": {"count": ..., "rate": ...},
"parse_success": {"count": ..., "rate": ...},
"schema_adherence": {"count": ..., "rate": ...} // optional
}
**Important:** At the time of writing, this model card does not include specific numeric metrics. After you run `scripts/evaluate_internal.py` and `scripts/evaluate_spider_external.py`, you should update this section with actual results from:
- `reports/eval_internal.json` / `.md`
- `reports/eval_spider.json` / `.md`
---
### Internal Evaluation (b-mc2/sql-create-context val)
**Input:**
`data/processed/val.jsonl` (same format as training)
**Script:**
python scripts/evaluate_internal.py \
--val_path data/processed/val.jsonl \
--base_model mistralai/Mistral-7B-Instruct-v0.1 \
--adapter_dir /path/to/outputs/adapters \
--device auto \
--max_examples 200 \
--temperature 0.0 \
--top_p 0.9 \
--max_new_tokens 256 \
--out_dir reports/
**Notes:**
- `--device auto` chooses GPU when available.
- 4-bit quantization is enabled by default on CUDA; configurable via:
- `--load_in_4bit` / `--no_load_in_4bit`
- `--bnb_4bit_quant_type`, `--bnb_4bit_compute_dtype`, etc.
- `--smoke` runs a small subset; on CPU-only environments it falls back to mock mode (gold SQL as prediction) to exercise the metrics without loading the model.
**Outputs:**
- `reports/eval_internal.json`
- `reports/eval_internal.md`
---
### External Validation (Spider dev)
**Datasets:**
- Examples: `xlangai/spider` (split: `validation`)
- Schema helper: `richardr1126/spider-schema` (contains create_table_context)
- License note: `richardr1126/spider-schema` is licensed under **CC BY-SA 4.0**. Spider is used only for evaluation, not training.
**Prompt format:**
### Schema:
<create_table_context>
### Question:
<Spider question>
Instruction text is the same as training. Prompts are constructed with the same formatter used for training (via helper functions in `text2sql.eval.spider`).
**Script:**
python scripts/evaluate_spider_external.py \
--base_model mistralai/Mistral-7B-Instruct-v0.1 \
--adapter_dir /path/to/outputs/adapters \
--device auto \
--spider_source xlangai/spider \
--schema_source richardr1126/spider-schema \
--spider_split validation \
--max_examples 200 \
--temperature 0.0 \
--top_p 0.9 \
--max_new_tokens 256 \
--out_dir reports/
**Outputs:**
- `reports/eval_spider.json`
- `reports/eval_spider.md`
The same metrics (EM, no-values EM, parse success, schema adherence) are computed, but note:
- This is not a full reproduction of official Spider evaluation (which includes component matching, execution metrics, etc.).
- It is a lightweight proxy for cross-domain Text-to-SQL quality.
---
### Mock / Offline Modes
Both evaluation scripts have `--mock` modes:
- Use small fixtures from `tests/fixtures/`
- Treat gold SQL as predictions
- Avoid network / heavy model loads
Ideal for CI and offline smoketests.
---
## Inference and Deployment
### Basic Usage with Hugging Face Transformers (Adapters)
Assuming this repo provides a LoRA adapter that you can load on top of `mistralai/Mistral-7B-Instruct-v0.1`:
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
BASE_MODEL = "mistralai/Mistral-7B-Instruct-v0.1"
ADAPTER_REPO = "your-username/analytics-copilot-text2sql-mistral7b-qlora"
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
base_model = AutoModelForCausalLM.from_pretrained(
BASE_MODEL,
load_in_4bit=True,
device_map="auto",
)
model = PeftModel.from_pretrained(base_model, ADAPTER_REPO)
schema = """CREATE TABLE orders (
id INTEGER PRIMARY KEY,
customer_id INTEGER,
amount NUMERIC,
created_at TIMESTAMP
);"""
question = "Total order amount per customer for the last 7 days."
instruction = (
"Write a SQL query that answers the user's question using ONLY "
"the tables and columns provided in the schema."
)
input_text = f"### Schema:\n{schema}\n\n### Question:\n{question}"
prompt = f"### Instruction:\n{instruction}\n\n### Input:\n{input_text}\n\n### Response:\n"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
output_ids = model.generate(
**inputs,
max_new_tokens=256,
temperature=0.0,
)
raw_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
# Optionally, post-process with the project’s SQL cleaner:
# from text2sql.training.formatting import ensure_sql_only
# sql = ensure_sql_only(raw_text)
print(raw_text)
---
### Inference Endpoints + Multi-LoRA (Recommended for Production)
If you host the base model in a Hugging Face Inference Endpoint with a Multi-LoRA configuration (via `LORA_ADAPTERS`), you can select this adapter at inference time by `adapter_id`.
Example environment for TGI:
LORA_ADAPTERS='[
{"id": "text2sql-qlora", "source": "your-username/analytics-copilot-text2sql-mistral7b-qlora"}
]'
Then in Python:
from huggingface_hub import InferenceClient
ENDPOINT_URL = "https://your-endpoint-1234.us-east-1.aws.endpoints.huggingface.cloud"
HF_TOKEN = "hf_your_token_here"
client = InferenceClient(base_url=ENDPOINT_URL, api_key=HF_TOKEN)
schema = """CREATE TABLE orders (
id INTEGER PRIMARY KEY,
customer_id INTEGER,
amount NUMERIC,
created_at TIMESTAMP
);"""
question = "Total order amount per customer for the last 7 days."
prompt = f"""### Schema:
{schema}
### Question:
{question}
Return only the SQL query."""
response = client.post(
json={
"inputs": prompt,
"parameters": {
"adapter_id": "text2sql-qlora",
"max_new_tokens": 256,
"temperature": 0.0,
},
}
)
print(response)
---
### Streamlit UI
The accompanying repo includes a Streamlit app (`app/streamlit_app.py`) that:
- Runs on Streamlit Community Cloud or locally.
- Calls a Hugging Face Inference Endpoint or router via `InferenceClient`.
- Reads config from Streamlit secrets or environment:
- `HF_TOKEN`
- `HF_ENDPOINT_URL` + `HF_ADAPTER_ID` (preferred, TGI endpoint + adapter)
- Or `HF_MODEL_ID` + `HF_PROVIDER` (router-based fallback, for merged models)
- Optionally uses an OpenAI fallback model when HF inference fails.
Deployment instructions are documented in `docs/deploy_streamlit_cloud.md`.
---
## Ethical Considerations and Risks
### Data and Bias
The training data (`b-mc2/sql-create-context`) may contain:
- Synthetic or curated schemas and questions
- Biases in naming conventions, example queries, or tasks
The base model (`Mistral-7B-Instruct`) is trained on large-scale web and other data. It inherits any demographic, cultural, and representational biases present in those sources.
As a result:
- The model can produce SQL that, if combined with biased downstream usage (e.g., unfair filtering in a user database), may exacerbate existing biases.
- The model is not aware of ethical / legal constraints around data access; it will happily generate queries that might retrieve sensitive fields (e.g., emails, PII) if such columns exist in the schema.
---
### Safety and Security
Generated SQL may contain:
- Expensive operations (full table scans on large tables)
- Potentially unsafe patterns (e.g., missing `LIMIT`, cross joins)
The model does not perform:
- Access control
- Row-level security
- SQL injection detection
You must implement:
- A strict execution sandbox:
- Allow only `SELECT` (no `INSERT`, `UPDATE`, `DELETE`, `DROP`, etc.)
- Enforce timeouts and row limits
- Appropriate logging and review of executed queries
---
### Human Oversight
Always:
- Present generated SQL to users for review
- Encourage edits and manual validation
- Provide clear warnings that the system is a copilot, not an oracle
---
### Environmental Impact
Training details vary depending on your hardware and hyperparameters, but in general:
- QLoRA + 4-bit quantization significantly reduces compute and memory compared to full fine-tuning:
- Fewer GPU-hours
- Lower VRAM requirements
- The example configuration (7B model, QLoRA, moderate steps) is designed to fit on commodity cloud GPUs (e.g., single A10/A100-class instance).
To be transparent, you should log and publish:
- GPU type and count
- Total training time
- Number of runs and restarts
---
## How to Cite
If you use this model or the underlying codebase in a research project or production system, please consider citing:
- The base model authors: Mistral AI (`mistralai/Mistral-7B-Instruct-v0.1`)
- The training dataset: `b-mc2/sql-create-context` (see dataset page for citation)
- This project (replace with your own reference):
Analytics Copilot (Text-to-SQL) – Mistral-7B QLoRA,
GitHub: https://github.com/brej-29/analytics-copilot-text2sql
You may also add a BibTeX entry, for example:
@misc{analytics_copilot_text2sql,
title = {Analytics Copilot (Text-to-SQL) -- Mistral-7B QLoRA},
author = {Your Name},
year = {2026},
howpublished = {\url{https://github.com/brej-29/analytics-copilot-text2sql}},
note = {Text-to-SQL fine-tuning of Mistral-7B using QLoRA on b-mc2/sql-create-context}
}
---
## Changelog
- **Initial adapter / model card:**
- QLoRA fine-tuning on `b-mc2/sql-create-context`
- Internal and external evaluation pipelines implemented
- Streamlit UI for remote inference via Hugging Face Inference
|