File size: 21,183 Bytes
2b6326d
1b3a99a
 
2b6326d
1b3a99a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c056dbc
1b3a99a
2b6326d
 
1b3a99a
2b6326d
1b3a99a
 
 
 
2b6326d
1b3a99a
 
 
2b6326d
1b3a99a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2b6326d
 
 
1b3a99a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2b6326d
1b3a99a
2b6326d
1b3a99a
2b6326d
1b3a99a
2b6326d
1b3a99a
 
 
2b6326d
1b3a99a
 
 
2b6326d
1b3a99a
 
 
2b6326d
1b3a99a
2b6326d
1b3a99a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2b6326d
1b3a99a
2b6326d
1b3a99a
2b6326d
1b3a99a
2b6326d
1b3a99a
2b6326d
1b3a99a
2b6326d
1b3a99a
2b6326d
1b3a99a
2b6326d
1b3a99a
 
2b6326d
1b3a99a
 
2b6326d
1b3a99a
2b6326d
1b3a99a
 
 
2b6326d
1b3a99a
2b6326d
1b3a99a
 
2b6326d
1b3a99a
2b6326d
1b3a99a
2b6326d
1b3a99a
2b6326d
1b3a99a
2b6326d
1b3a99a
2b6326d
1b3a99a
 
2b6326d
1b3a99a
 
2b6326d
1b3a99a
2b6326d
1b3a99a
2b6326d
1b3a99a
2b6326d
1b3a99a
 
2b6326d
1b3a99a
2b6326d
1b3a99a
2b6326d
1b3a99a
2b6326d
1b3a99a
 
 
2b6326d
1b3a99a
2b6326d
1b3a99a
2b6326d
1b3a99a
2b6326d
1b3a99a
 
 
2b6326d
1b3a99a
 
 
 
 
 
 
 
 
 
 
 
 
2b6326d
 
 
1b3a99a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2b6326d
1b3a99a
2b6326d
1b3a99a
 
 
 
2b6326d
1b3a99a
2b6326d
1b3a99a
 
 
 
 
 
 
2b6326d
1b3a99a
2b6326d
1b3a99a
 
2b6326d
1b3a99a
 
 
 
 
 
2b6326d
1b3a99a
2b6326d
1b3a99a
 
 
 
 
 
 
 
 
 
2b6326d
1b3a99a
 
 
 
 
 
2b6326d
1b3a99a
 
 
2b6326d
1b3a99a
2b6326d
1b3a99a
2b6326d
1b3a99a
 
 
 
2b6326d
1b3a99a
2b6326d
1b3a99a
 
2b6326d
1b3a99a
 
2b6326d
1b3a99a
2b6326d
1b3a99a
2b6326d
1b3a99a
 
 
 
 
 
 
 
 
 
 
 
2b6326d
1b3a99a
 
 
2b6326d
1b3a99a
 
 
2b6326d
1b3a99a
2b6326d
1b3a99a
2b6326d
1b3a99a
 
 
 
 
2b6326d
1b3a99a
2b6326d
1b3a99a
2b6326d
1b3a99a
2b6326d
1b3a99a
2b6326d
1b3a99a
 
2b6326d
1b3a99a
 
2b6326d
1b3a99a
 
 
 
 
 
2b6326d
1b3a99a
2b6326d
1b3a99a
 
 
 
 
 
2b6326d
1b3a99a
2b6326d
1b3a99a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2b6326d
1b3a99a
2b6326d
1b3a99a
2b6326d
1b3a99a
2b6326d
1b3a99a
 
 
2b6326d
1b3a99a
2b6326d
1b3a99a
2b6326d
1b3a99a
 
2b6326d
1b3a99a
2b6326d
1b3a99a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2b6326d
1b3a99a
2b6326d
1b3a99a
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
---
language: en
license: apache-2.0
tags:
  - text2sql
  - sql
  - structured-data
  - natural-language-to-sql
  - mistral
  - qlora
  - lora
  - peft
  - transformers
  - huggingface
  - streamlit
  - evaluation
  - spider
datasets:
  - b-mc2/sql-create-context
library_name: transformers
pipeline_tag: text-generation
base_model: mistralai/Mistral-7B-Instruct-v0.1
---

# Analytics Copilot (Text-to-SQL) – Mistral-7B QLoRA

This repository contains a **Text-to-SQL** model built by fine-tuning
**`mistralai/Mistral-7B-Instruct-v0.1`** with **QLoRA** on the
**`b-mc2/sql-create-context`** dataset, plus an evaluation pipeline and a
Streamlit UI for interactive usage.

The model’s goal is to convert a **natural-language question** and a concrete
**database schema** (as `CREATE TABLE` DDL) into a **single SQL query** that
answers the question.

> **Note:** This model card documents the *adapter* (QLoRA) or fine-tuned model
> released from the Analytics Copilot Text-to-SQL project. It assumes the
> underlying base model is `mistralai/Mistral-7B-Instruct-v0.1` and that
> training was run using the public **`b-mc2/sql-create-context`** dataset.

---

## Model Summary

- **Task:** Text-to-SQL (natural-language questions → SQL queries)
- **Base model:** `mistralai/Mistral-7B-Instruct-v0.1`
- **Fine-tuning method:** QLoRA (4-bit) with LoRA adapters
- **Libraries:** `transformers`, `peft`, `trl`, `unsloth`, `bitsandbytes`
- **Primary training data:** `b-mc2/sql-create-context`
- **Evaluation datasets:**
  - Internal: processed val split from `b-mc2/sql-create-context`
  - External: Spider dev (via `xlangai/spider` + `richardr1126/spider-schema`)
- **Input:** Schema (`CREATE TABLE` context) + natural-language question
- **Output:** A single SQL query string
- **Usage:** Mainly via Hugging Face Inference Endpoints + LoRA adapters, or
  by loading the adapter with `transformers` + `peft`.

---

## Intended Use and Limitations

### Intended Use

This model is intended as a **developer-facing Text-to-SQL assistant**. Typical
uses include:

- Helping analysts and engineers generate SQL queries from natural language
  when they:
  - Already know the schema (or can paste it).
  - Want to prototype queries quickly.
- Powering a **Text-to-SQL copilot UI**, e.g., the included Streamlit app:
  - Paste database schema (DDL) into a text area.
  - Ask a question in natural language.
  - Get suggested SQL as a starting point.
- Serving as a **research / teaching artifact**:
  - Demonstrates how to fine-tune an open LLM with QLoRA for Text-to-SQL.
  - Provides a reproducible evaluation pipeline on a public dataset.

### Out of Scope / Misuse

The model is **not** intended for:

- Direct, unsupervised execution against **production databases**:
  - SQL may be syntactically valid but semantically off.
  - The model is not aware of performance / cost implications.
- Use as a general-purpose chatbot:
  - It is trained specifically on schema + question → SQL.
- Generating **arbitrary SQL** without schema:
  - It is strongly conditioned on explicit schema context.
- High-stakes domains:
  - Healthcare, finance, safety-critical environments, or any domain where
    incorrect queries can cause harm or large financial loss.

### Limitations

- **Hallucinations:** Despite having schema context, the model can:
  - Refer to non-existent tables/columns.
  - Misinterpret relationships between tables.
- **No automatic execution safety:**
  - The training objective does not enforce read-only queries.
  - You must wrap the model in a strict execution layer (e.g., allow only
    `SELECT`, enforce limits, static analysis).
- **Domain coverage:**
  - Training is driven by `b-mc2/sql-create-context` and Spider; behavior on
    very different schemas or DB engines may degrade.
- **Locale and language:**
  - Primarily English; performance on non-English questions is untested.

You should treat generated SQL as **suggestions** that require human review
before execution.

---

## Model Details

### Architecture

- **Base architecture:** Mistral-7B (decoder-only Transformer)
- **Base model:** `mistralai/Mistral-7B-Instruct-v0.1`
  - Licensed under **Apache 2.0**.
- **Fine-tuning method:** QLoRA (Low-Rank Adapters with 4-bit quantized base)
- **Adapter mechanism:** LoRA adapters (PEFT / Unsloth)

Typical QLoRA configuration (as used in the training script/notebook):

- `lora_r`: 16
- `lora_alpha`: 16
- `lora_dropout`: 0.0
- `max_seq_length`: 2048
- 4-bit quantization with bitsandbytes:
  - `bnb_4bit_quant_type = "nf4"`
  - `bnb_4bit_compute_dtype = "float16"` (on CUDA)
  - `bnb_4bit_use_double_quant = True`

### Training Configuration (QLoRA)

The project defines a `TrainingConfig` dataclass with the following key fields:

- `base_model` (str): e.g. `"mistralai/Mistral-7B-Instruct-v0.1"`
- `max_steps` (int): e.g. 500
- `per_device_train_batch_size` (int): typically small (e.g. 1)
- `gradient_accumulation_steps` (int): e.g. 8 (to achieve an effective batch size)
- `learning_rate` (float): e.g. `2e-4`
- `warmup_steps` (int): e.g. 50
- `weight_decay` (float): typically `0.0` for QLoRA
- `max_seq_length` (int): e.g. 2048
- `lora_r` (int): e.g. 16
- `lora_alpha` (int): e.g. 16
- `lora_dropout` (float): e.g. 0.0
- `seed` (int): e.g. 42

These values are exposed via the CLI script:

```bash
python scripts/train_qlora.py \
  --train_path data/processed/train.jsonl \
  --val_path data/processed/val.jsonl \
  --base_model mistralai/Mistral-7B-Instruct-v0.1 \
  --output_dir outputs/ \
  --max_steps 500 \
  --per_device_train_batch_size 1 \
  --gradient_accumulation_steps 8 \
  --learning_rate 2e-4 \
  --warmup_steps 50 \
  --weight_decay 0.0 \
  --max_seq_length 2048 \
  --lora_r 16 \
  --lora_alpha 16 \
  --lora_dropout 0.0 \
  --seed 42
```

## Data and Preprocessing

### Primary Training Dataset: `b-mc2/sql-create-context`

- **Name:** `b-mc2/sql-create-context`  
- **Source:** Hugging Face Datasets  
- **Dataset page:** https://huggingface.co/datasets/b-mc2/sql-create-context

**Fields:**
- `question` – natural language question from the user  
- `context` – schema context as one or more `CREATE TABLE` statements  
- `answer` – gold SQL query  

**Example (conceptual):**

    {
      "question": "How many heads of the departments are older than 56?",
      "context": "CREATE TABLE head (age INTEGER)",
      "answer": "SELECT COUNT(*) FROM head WHERE age > 56"
    }

Please refer to the dataset page on Hugging Face for licensing and further details. This model inherits any legal constraints from both the base model and this dataset.

---

### Train / Validation Split

The dataset only provides a `train` split. The project creates its own train/validation split using:

- `datasets.Dataset.train_test_split` with:
  - `test_size = val_ratio` (default: `0.08`)
  - `seed = 42`

Renames:
- `train` → final training split  
- `test` → final validation split  

This yields:
- `data/processed/train.jsonl` – training examples  
- `data/processed/val.jsonl` – validation examples  

---

### Instruction-Tuning Format (Alpaca-style JSONL)

Each processed example has:

- `id` – e.g. `"sqlcc-train-000001"`
- `instruction` – static instruction text
- `input` – formatted schema + question
- `output` – normalized SQL query
- `source` – `"b-mc2/sql-create-context"`
- `meta` – metadata (original split, row index, seed, etc.)

**Example:**

    {
      "id": "sqlcc-train-000001",
      "instruction": "Write a SQL query that answers the user's question using ONLY the tables and columns provided in the schema.",
      "input": "### Schema:\nCREATE TABLE head (age INTEGER)\n\n### Question:\nHow many heads of the departments are older than 56 ?",
      "output": "SELECT COUNT(*) FROM head WHERE age > 56",
      "source": "b-mc2/sql-create-context",
      "meta": {
        "original_split": "train",
        "row": 0,
        "split": "train",
        "val_ratio": 0.08,
        "seed": 42,
        "from_local_input": false
      }
    }

---

### Instruction Text

The instruction is fixed:

Write a SQL query that answers the user's question using ONLY the tables and columns provided in the schema.

---

### Input Formatting

`input` is constructed as:

    ### Schema:
    <CREATE TABLE ...>

    ### Question:
    <question text>

This is implemented in `text2sql.data_prep.build_input_text`.

---

### SQL Normalization

The dataset builder applies light normalization to the answer:

- Strip leading/trailing whitespace  
- Collapse runs of whitespace into a single space  

This is implemented as `text2sql.data_prep.normalize_sql`.

---

## Training Procedure

### Prompt Format for Training

To build the final training text, the project uses a simple prompt template:

    ### Instruction:
    <instruction>

    ### Input:
    <input>

    ### Response:

This template is defined as `PROMPT_TEMPLATE` in `src/text2sql/training/formatting.py`, and filled via:

    from text2sql.training.formatting import build_prompt

    prompt = build_prompt(instruction, input_text)
    # Final training text is: prompt + output_sql

`output_sql` is normalized SQL, optionally further cleaned with `ensure_sql_only` when used at inference time.

---

### Optimization

- Optimizer & scheduler are provided by `trl.SFTTrainer` / `transformers`.
- Mixed precision (e.g. bf16/fp16) is enabled when supported.
- LoRA adapters are applied to a subset of projection layers; typical choices include attention and MLP projections (see training code for exact `target_modules`).

---

### Hardware

Intended to run on a single modern GPU (e.g., A10, A100, L4) with ≥16GB VRAM using 4-bit quantization.

The CLI script has:
- `--dry_run` mode (no model load; checks dataset & formatting).
- `--smoke` mode (lightweight config check; on CPU-only machines it skips loading the full model).

---

### Outputs

After a full run you should obtain:

- `outputs/adapters/` – LoRA adapter weights / config
- `outputs/run_meta.json` – training config, data paths, etc.
- `outputs/metrics.json` – training/eval metrics as reported by the trainer

These artifacts can be published to the Hub via the helper script `scripts/publish_to_hub.py`.

---

## Evaluation

The project provides a dedicated evaluation pipeline for both internal and external validation.

---

### Metrics

All evaluation flows share the same core metrics, implemented in `src/text2sql/eval/metrics.py`:

#### Exact Match (EM) (normalized SQL)

Uses `normalize_sql`:
- Strip whitespace
- Remove trailing semicolons
- Collapse whitespace runs  
Checks exact string equality between normalized prediction and gold SQL.

#### No-values Exact Match

Uses `normalize_sql_no_values`:
- Normalize SQL as above
- Replace single-quoted string literals with a placeholder (`'__STR__'`)
- Replace numeric literals (integers/decimals) with a placeholder (`__NUM__`)  
Captures structural equality even when literal values differ.

#### SQL parse success rate

Uses `sqlglot.parse_one` to parse the predicted SQL.  
Fraction of predictions that parse successfully.

#### Schema adherence

- Parses the `CREATE TABLE` context with `sqlglot` to recover:
  - Tables and columns
- Parses predicted SQL and extracts table/column references
- A prediction is schema-adherent if all references exist in the schema.

Metrics are aggregated as:

    {
      "n_examples": ...,
      "exact_match": {"count": ..., "rate": ...},
      "no_values_em": {"count": ..., "rate": ...},
      "parse_success": {"count": ..., "rate": ...},
      "schema_adherence": {"count": ..., "rate": ...}  // optional
    }

**Important:** At the time of writing, this model card does not include specific numeric metrics. After you run `scripts/evaluate_internal.py` and `scripts/evaluate_spider_external.py`, you should update this section with actual results from:

- `reports/eval_internal.json` / `.md`
- `reports/eval_spider.json` / `.md`

---

### Internal Evaluation (b-mc2/sql-create-context val)

**Input:**  
`data/processed/val.jsonl` (same format as training)

**Script:**

    python scripts/evaluate_internal.py \
      --val_path data/processed/val.jsonl \
      --base_model mistralai/Mistral-7B-Instruct-v0.1 \
      --adapter_dir /path/to/outputs/adapters \
      --device auto \
      --max_examples 200 \
      --temperature 0.0 \
      --top_p 0.9 \
      --max_new_tokens 256 \
      --out_dir reports/

**Notes:**
- `--device auto` chooses GPU when available.
- 4-bit quantization is enabled by default on CUDA; configurable via:
  - `--load_in_4bit` / `--no_load_in_4bit`
  - `--bnb_4bit_quant_type`, `--bnb_4bit_compute_dtype`, etc.
- `--smoke` runs a small subset; on CPU-only environments it falls back to mock mode (gold SQL as prediction) to exercise the metrics without loading the model.

**Outputs:**
- `reports/eval_internal.json`
- `reports/eval_internal.md`

---

### External Validation (Spider dev)

**Datasets:**
- Examples: `xlangai/spider` (split: `validation`)
- Schema helper: `richardr1126/spider-schema` (contains create_table_context)
- License note: `richardr1126/spider-schema` is licensed under **CC BY-SA 4.0**. Spider is used only for evaluation, not training.

**Prompt format:**

    ### Schema:
    <create_table_context>

    ### Question:
    <Spider question>

Instruction text is the same as training. Prompts are constructed with the same formatter used for training (via helper functions in `text2sql.eval.spider`).

**Script:**

    python scripts/evaluate_spider_external.py \
      --base_model mistralai/Mistral-7B-Instruct-v0.1 \
      --adapter_dir /path/to/outputs/adapters \
      --device auto \
      --spider_source xlangai/spider \
      --schema_source richardr1126/spider-schema \
      --spider_split validation \
      --max_examples 200 \
      --temperature 0.0 \
      --top_p 0.9 \
      --max_new_tokens 256 \
      --out_dir reports/

**Outputs:**
- `reports/eval_spider.json`
- `reports/eval_spider.md`

The same metrics (EM, no-values EM, parse success, schema adherence) are computed, but note:
- This is not a full reproduction of official Spider evaluation (which includes component matching, execution metrics, etc.).
- It is a lightweight proxy for cross-domain Text-to-SQL quality.

---

### Mock / Offline Modes

Both evaluation scripts have `--mock` modes:
- Use small fixtures from `tests/fixtures/`
- Treat gold SQL as predictions
- Avoid network / heavy model loads  
Ideal for CI and offline smoketests.

---

## Inference and Deployment

### Basic Usage with Hugging Face Transformers (Adapters)

Assuming this repo provides a LoRA adapter that you can load on top of `mistralai/Mistral-7B-Instruct-v0.1`:

    from transformers import AutoModelForCausalLM, AutoTokenizer
    from peft import PeftModel

    BASE_MODEL = "mistralai/Mistral-7B-Instruct-v0.1"
    ADAPTER_REPO = "your-username/analytics-copilot-text2sql-mistral7b-qlora"

    tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
    base_model = AutoModelForCausalLM.from_pretrained(
        BASE_MODEL,
        load_in_4bit=True,
        device_map="auto",
    )

    model = PeftModel.from_pretrained(base_model, ADAPTER_REPO)

    schema = """CREATE TABLE orders (
      id INTEGER PRIMARY KEY,
      customer_id INTEGER,
      amount NUMERIC,
      created_at TIMESTAMP
    );"""

    question = "Total order amount per customer for the last 7 days."

    instruction = (
        "Write a SQL query that answers the user's question using ONLY "
        "the tables and columns provided in the schema."
    )
    input_text = f"### Schema:\n{schema}\n\n### Question:\n{question}"

    prompt = f"### Instruction:\n{instruction}\n\n### Input:\n{input_text}\n\n### Response:\n"

    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    output_ids = model.generate(
        **inputs,
        max_new_tokens=256,
        temperature=0.0,
    )
    raw_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)

    # Optionally, post-process with the project’s SQL cleaner:
    # from text2sql.training.formatting import ensure_sql_only
    # sql = ensure_sql_only(raw_text)
    print(raw_text)

---

### Inference Endpoints + Multi-LoRA (Recommended for Production)

If you host the base model in a Hugging Face Inference Endpoint with a Multi-LoRA configuration (via `LORA_ADAPTERS`), you can select this adapter at inference time by `adapter_id`.

Example environment for TGI:

    LORA_ADAPTERS='[
      {"id": "text2sql-qlora", "source": "your-username/analytics-copilot-text2sql-mistral7b-qlora"}
    ]'

Then in Python:

    from huggingface_hub import InferenceClient

    ENDPOINT_URL = "https://your-endpoint-1234.us-east-1.aws.endpoints.huggingface.cloud"
    HF_TOKEN = "hf_your_token_here"

    client = InferenceClient(base_url=ENDPOINT_URL, api_key=HF_TOKEN)

    schema = """CREATE TABLE orders (
      id INTEGER PRIMARY KEY,
      customer_id INTEGER,
      amount NUMERIC,
      created_at TIMESTAMP
    );"""

    question = "Total order amount per customer for the last 7 days."

    prompt = f"""### Schema:
    {schema}

    ### Question:
    {question}

    Return only the SQL query."""

    response = client.post(
        json={
            "inputs": prompt,
            "parameters": {
                "adapter_id": "text2sql-qlora",
                "max_new_tokens": 256,
                "temperature": 0.0,
            },
        }
    )

    print(response)

---

### Streamlit UI

The accompanying repo includes a Streamlit app (`app/streamlit_app.py`) that:

- Runs on Streamlit Community Cloud or locally.
- Calls a Hugging Face Inference Endpoint or router via `InferenceClient`.
- Reads config from Streamlit secrets or environment:
  - `HF_TOKEN`
  - `HF_ENDPOINT_URL` + `HF_ADAPTER_ID` (preferred, TGI endpoint + adapter)
  - Or `HF_MODEL_ID` + `HF_PROVIDER` (router-based fallback, for merged models)
- Optionally uses an OpenAI fallback model when HF inference fails.

Deployment instructions are documented in `docs/deploy_streamlit_cloud.md`.

---

## Ethical Considerations and Risks

### Data and Bias

The training data (`b-mc2/sql-create-context`) may contain:
- Synthetic or curated schemas and questions
- Biases in naming conventions, example queries, or tasks

The base model (`Mistral-7B-Instruct`) is trained on large-scale web and other data. It inherits any demographic, cultural, and representational biases present in those sources.

As a result:
- The model can produce SQL that, if combined with biased downstream usage (e.g., unfair filtering in a user database), may exacerbate existing biases.
- The model is not aware of ethical / legal constraints around data access; it will happily generate queries that might retrieve sensitive fields (e.g., emails, PII) if such columns exist in the schema.

---

### Safety and Security

Generated SQL may contain:
- Expensive operations (full table scans on large tables)
- Potentially unsafe patterns (e.g., missing `LIMIT`, cross joins)

The model does not perform:
- Access control
- Row-level security
- SQL injection detection

You must implement:
- A strict execution sandbox:
  - Allow only `SELECT` (no `INSERT`, `UPDATE`, `DELETE`, `DROP`, etc.)
  - Enforce timeouts and row limits
- Appropriate logging and review of executed queries

---

### Human Oversight

Always:
- Present generated SQL to users for review
- Encourage edits and manual validation
- Provide clear warnings that the system is a copilot, not an oracle

---

### Environmental Impact

Training details vary depending on your hardware and hyperparameters, but in general:

- QLoRA + 4-bit quantization significantly reduces compute and memory compared to full fine-tuning:
  - Fewer GPU-hours
  - Lower VRAM requirements
- The example configuration (7B model, QLoRA, moderate steps) is designed to fit on commodity cloud GPUs (e.g., single A10/A100-class instance).

To be transparent, you should log and publish:
- GPU type and count
- Total training time
- Number of runs and restarts

---

## How to Cite

If you use this model or the underlying codebase in a research project or production system, please consider citing:

- The base model authors: Mistral AI (`mistralai/Mistral-7B-Instruct-v0.1`)
- The training dataset: `b-mc2/sql-create-context` (see dataset page for citation)
- This project (replace with your own reference):  
  Analytics Copilot (Text-to-SQL) – Mistral-7B QLoRA,  
  GitHub: https://github.com/brej-29/analytics-copilot-text2sql

You may also add a BibTeX entry, for example:

    @misc{analytics_copilot_text2sql,
      title        = {Analytics Copilot (Text-to-SQL) -- Mistral-7B QLoRA},
      author       = {Your Name},
      year         = {2026},
      howpublished = {\url{https://github.com/brej-29/analytics-copilot-text2sql}},
      note         = {Text-to-SQL fine-tuning of Mistral-7B using QLoRA on b-mc2/sql-create-context}
    }

---

## Changelog

- **Initial adapter / model card:**
  - QLoRA fine-tuning on `b-mc2/sql-create-context`
  - Internal and external evaluation pipelines implemented
  - Streamlit UI for remote inference via Hugging Face Inference