BrejBala commited on
Commit
1b3a99a
·
verified ·
1 Parent(s): 2b6326d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +579 -109
README.md CHANGED
@@ -1,210 +1,680 @@
1
  ---
2
- base_model: unsloth/mistral-7b-instruct-v0.1-bnb-4bit
3
- library_name: peft
4
- pipeline_tag: text-generation
5
  tags:
6
- - base_model:adapter:unsloth/mistral-7b-instruct-v0.1-bnb-4bit
7
- - lora
8
- - sft
9
- - transformers
10
- - trl
11
- - unsloth
 
 
 
 
 
 
 
 
 
 
 
 
12
  ---
13
 
14
- # Model Card for Model ID
15
 
16
- <!-- Provide a quick summary of what the model is/does. -->
 
 
 
17
 
 
 
 
18
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
 
20
  ## Model Details
21
 
22
- ### Model Description
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
23
 
24
- <!-- Provide a longer summary of what this model is. -->
25
 
 
26
 
 
27
 
28
- - **Developed by:** [More Information Needed]
29
- - **Funded by [optional]:** [More Information Needed]
30
- - **Shared by [optional]:** [More Information Needed]
31
- - **Model type:** [More Information Needed]
32
- - **Language(s) (NLP):** [More Information Needed]
33
- - **License:** [More Information Needed]
34
- - **Finetuned from model [optional]:** [More Information Needed]
35
 
36
- ### Model Sources [optional]
 
 
37
 
38
- <!-- Provide the basic links for the model. -->
 
 
39
 
40
- - **Repository:** [More Information Needed]
41
- - **Paper [optional]:** [More Information Needed]
42
- - **Demo [optional]:** [More Information Needed]
43
 
44
- ## Uses
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
45
 
46
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
47
 
48
- ### Direct Use
49
 
50
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
51
 
52
- [More Information Needed]
53
 
54
- ### Downstream Use [optional]
55
 
56
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
57
 
58
- [More Information Needed]
59
 
60
- ### Out-of-Scope Use
 
61
 
62
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
 
63
 
64
- [More Information Needed]
65
 
66
- ## Bias, Risks, and Limitations
 
 
67
 
68
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
69
 
70
- [More Information Needed]
 
71
 
72
- ### Recommendations
73
 
74
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
75
 
76
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
77
 
78
- ## How to Get Started with the Model
79
 
80
- Use the code below to get started with the model.
81
 
82
- [More Information Needed]
 
83
 
84
- ## Training Details
 
85
 
86
- ### Training Data
87
 
88
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
89
 
90
- [More Information Needed]
91
 
92
- ### Training Procedure
 
93
 
94
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
95
 
96
- #### Preprocessing [optional]
97
 
98
- [More Information Needed]
99
 
 
 
 
100
 
101
- #### Training Hyperparameters
102
 
103
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
104
 
105
- #### Speeds, Sizes, Times [optional]
106
 
107
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
 
 
108
 
109
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
110
 
111
  ## Evaluation
112
 
113
- <!-- This section describes the evaluation protocols and provides the results. -->
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
114
 
115
- ### Testing Data, Factors & Metrics
116
 
117
- #### Testing Data
 
 
 
118
 
119
- <!-- This should link to a Dataset Card if possible. -->
120
 
121
- [More Information Needed]
 
 
 
 
 
 
122
 
123
- #### Factors
124
 
125
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
 
126
 
127
- [More Information Needed]
 
 
 
 
 
128
 
129
- #### Metrics
130
 
131
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
 
 
 
 
 
 
 
 
 
132
 
133
- [More Information Needed]
 
 
 
 
 
134
 
135
- ### Results
 
 
136
 
137
- [More Information Needed]
138
 
139
- #### Summary
140
 
 
 
 
 
141
 
 
142
 
143
- ## Model Examination [optional]
 
144
 
145
- <!-- Relevant interpretability work for the model goes here -->
 
146
 
147
- [More Information Needed]
148
 
149
- ## Environmental Impact
150
 
151
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
 
 
 
 
 
 
 
 
 
 
 
152
 
153
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
 
 
154
 
155
- - **Hardware Type:** [More Information Needed]
156
- - **Hours used:** [More Information Needed]
157
- - **Cloud Provider:** [More Information Needed]
158
- - **Compute Region:** [More Information Needed]
159
- - **Carbon Emitted:** [More Information Needed]
160
 
161
- ## Technical Specifications [optional]
162
 
163
- ### Model Architecture and Objective
164
 
165
- [More Information Needed]
 
 
 
 
166
 
167
- ### Compute Infrastructure
168
 
169
- [More Information Needed]
170
 
171
- #### Hardware
172
 
173
- [More Information Needed]
174
 
175
- #### Software
 
176
 
177
- [More Information Needed]
 
178
 
179
- ## Citation [optional]
 
 
 
 
 
180
 
181
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
182
 
183
- **BibTeX:**
 
 
 
 
 
184
 
185
- [More Information Needed]
186
 
187
- **APA:**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
188
 
189
- [More Information Needed]
190
 
191
- ## Glossary [optional]
192
 
193
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
194
 
195
- [More Information Needed]
 
 
196
 
197
- ## More Information [optional]
198
 
199
- [More Information Needed]
200
 
201
- ## Model Card Authors [optional]
 
202
 
203
- [More Information Needed]
204
 
205
- ## Model Card Contact
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
206
 
207
- [More Information Needed]
208
- ### Framework versions
209
 
210
- - PEFT 0.18.0
 
 
 
 
1
  ---
2
+ language: en
3
+ license: apache-2.0
 
4
  tags:
5
+ - text2sql
6
+ - sql
7
+ - structured-data
8
+ - natural-language-to-sql
9
+ - mistral
10
+ - qlora
11
+ - lora
12
+ - peft
13
+ - transformers
14
+ - huggingface
15
+ - streamlit
16
+ - evaluation
17
+ - spider
18
+ datasets:
19
+ - b-mc2/sql-create-context
20
+ library_name: transformers
21
+ pipeline_tag: text2text-generation
22
+ base_model: mistralai/Mistral-7B-Instruct-v0.1
23
  ---
24
 
25
+ # Analytics Copilot (Text-to-SQL) Mistral-7B QLoRA
26
 
27
+ This repository contains a **Text-to-SQL** model built by fine-tuning
28
+ **`mistralai/Mistral-7B-Instruct-v0.1`** with **QLoRA** on the
29
+ **`b-mc2/sql-create-context`** dataset, plus an evaluation pipeline and a
30
+ Streamlit UI for interactive usage.
31
 
32
+ The model’s goal is to convert a **natural-language question** and a concrete
33
+ **database schema** (as `CREATE TABLE` DDL) into a **single SQL query** that
34
+ answers the question.
35
 
36
+ > **Note:** This model card documents the *adapter* (QLoRA) or fine-tuned model
37
+ > released from the Analytics Copilot Text-to-SQL project. It assumes the
38
+ > underlying base model is `mistralai/Mistral-7B-Instruct-v0.1` and that
39
+ > training was run using the public **`b-mc2/sql-create-context`** dataset.
40
+
41
+ ---
42
+
43
+ ## Model Summary
44
+
45
+ - **Task:** Text-to-SQL (natural-language questions → SQL queries)
46
+ - **Base model:** `mistralai/Mistral-7B-Instruct-v0.1`
47
+ - **Fine-tuning method:** QLoRA (4-bit) with LoRA adapters
48
+ - **Libraries:** `transformers`, `peft`, `trl`, `unsloth`, `bitsandbytes`
49
+ - **Primary training data:** `b-mc2/sql-create-context`
50
+ - **Evaluation datasets:**
51
+ - Internal: processed val split from `b-mc2/sql-create-context`
52
+ - External: Spider dev (via `xlangai/spider` + `richardr1126/spider-schema`)
53
+ - **Input:** Schema (`CREATE TABLE` context) + natural-language question
54
+ - **Output:** A single SQL query string
55
+ - **Usage:** Mainly via Hugging Face Inference Endpoints + LoRA adapters, or
56
+ by loading the adapter with `transformers` + `peft`.
57
+
58
+ ---
59
+
60
+ ## Intended Use and Limitations
61
+
62
+ ### Intended Use
63
+
64
+ This model is intended as a **developer-facing Text-to-SQL assistant**. Typical
65
+ uses include:
66
+
67
+ - Helping analysts and engineers generate SQL queries from natural language
68
+ when they:
69
+ - Already know the schema (or can paste it).
70
+ - Want to prototype queries quickly.
71
+ - Powering a **Text-to-SQL copilot UI**, e.g., the included Streamlit app:
72
+ - Paste database schema (DDL) into a text area.
73
+ - Ask a question in natural language.
74
+ - Get suggested SQL as a starting point.
75
+ - Serving as a **research / teaching artifact**:
76
+ - Demonstrates how to fine-tune an open LLM with QLoRA for Text-to-SQL.
77
+ - Provides a reproducible evaluation pipeline on a public dataset.
78
+
79
+ ### Out of Scope / Misuse
80
+
81
+ The model is **not** intended for:
82
+
83
+ - Direct, unsupervised execution against **production databases**:
84
+ - SQL may be syntactically valid but semantically off.
85
+ - The model is not aware of performance / cost implications.
86
+ - Use as a general-purpose chatbot:
87
+ - It is trained specifically on schema + question → SQL.
88
+ - Generating **arbitrary SQL** without schema:
89
+ - It is strongly conditioned on explicit schema context.
90
+ - High-stakes domains:
91
+ - Healthcare, finance, safety-critical environments, or any domain where
92
+ incorrect queries can cause harm or large financial loss.
93
+
94
+ ### Limitations
95
+
96
+ - **Hallucinations:** Despite having schema context, the model can:
97
+ - Refer to non-existent tables/columns.
98
+ - Misinterpret relationships between tables.
99
+ - **No automatic execution safety:**
100
+ - The training objective does not enforce read-only queries.
101
+ - You must wrap the model in a strict execution layer (e.g., allow only
102
+ `SELECT`, enforce limits, static analysis).
103
+ - **Domain coverage:**
104
+ - Training is driven by `b-mc2/sql-create-context` and Spider; behavior on
105
+ very different schemas or DB engines may degrade.
106
+ - **Locale and language:**
107
+ - Primarily English; performance on non-English questions is untested.
108
+
109
+ You should treat generated SQL as **suggestions** that require human review
110
+ before execution.
111
+
112
+ ---
113
 
114
  ## Model Details
115
 
116
+ ### Architecture
117
+
118
+ - **Base architecture:** Mistral-7B (decoder-only Transformer)
119
+ - **Base model:** `mistralai/Mistral-7B-Instruct-v0.1`
120
+ - Licensed under **Apache 2.0**.
121
+ - **Fine-tuning method:** QLoRA (Low-Rank Adapters with 4-bit quantized base)
122
+ - **Adapter mechanism:** LoRA adapters (PEFT / Unsloth)
123
+
124
+ Typical QLoRA configuration (as used in the training script/notebook):
125
+
126
+ - `lora_r`: 16
127
+ - `lora_alpha`: 16
128
+ - `lora_dropout`: 0.0
129
+ - `max_seq_length`: 2048
130
+ - 4-bit quantization with bitsandbytes:
131
+ - `bnb_4bit_quant_type = "nf4"`
132
+ - `bnb_4bit_compute_dtype = "float16"` (on CUDA)
133
+ - `bnb_4bit_use_double_quant = True`
134
+
135
+ ### Training Configuration (QLoRA)
136
+
137
+ The project defines a `TrainingConfig` dataclass with the following key fields:
138
+
139
+ - `base_model` (str): e.g. `"mistralai/Mistral-7B-Instruct-v0.1"`
140
+ - `max_steps` (int): e.g. 500
141
+ - `per_device_train_batch_size` (int): typically small (e.g. 1)
142
+ - `gradient_accumulation_steps` (int): e.g. 8 (to achieve an effective batch size)
143
+ - `learning_rate` (float): e.g. `2e-4`
144
+ - `warmup_steps` (int): e.g. 50
145
+ - `weight_decay` (float): typically `0.0` for QLoRA
146
+ - `max_seq_length` (int): e.g. 2048
147
+ - `lora_r` (int): e.g. 16
148
+ - `lora_alpha` (int): e.g. 16
149
+ - `lora_dropout` (float): e.g. 0.0
150
+ - `seed` (int): e.g. 42
151
+
152
+ These values are exposed via the CLI script:
153
+
154
+ ```bash
155
+ python scripts/train_qlora.py \
156
+ --train_path data/processed/train.jsonl \
157
+ --val_path data/processed/val.jsonl \
158
+ --base_model mistralai/Mistral-7B-Instruct-v0.1 \
159
+ --output_dir outputs/ \
160
+ --max_steps 500 \
161
+ --per_device_train_batch_size 1 \
162
+ --gradient_accumulation_steps 8 \
163
+ --learning_rate 2e-4 \
164
+ --warmup_steps 50 \
165
+ --weight_decay 0.0 \
166
+ --max_seq_length 2048 \
167
+ --lora_r 16 \
168
+ --lora_alpha 16 \
169
+ --lora_dropout 0.0 \
170
+ --seed 42
171
+ ```
172
+
173
+ ## Data and Preprocessing
174
+
175
+ ### Primary Training Dataset: `b-mc2/sql-create-context`
176
+
177
+ - **Name:** `b-mc2/sql-create-context`
178
+ - **Source:** Hugging Face Datasets
179
+ - **Dataset page:** https://huggingface.co/datasets/b-mc2/sql-create-context
180
+
181
+ **Fields:**
182
+ - `question` – natural language question from the user
183
+ - `context` – schema context as one or more `CREATE TABLE` statements
184
+ - `answer` – gold SQL query
185
+
186
+ **Example (conceptual):**
187
+
188
+ {
189
+ "question": "How many heads of the departments are older than 56?",
190
+ "context": "CREATE TABLE head (age INTEGER)",
191
+ "answer": "SELECT COUNT(*) FROM head WHERE age > 56"
192
+ }
193
+
194
+ Please refer to the dataset page on Hugging Face for licensing and further details. This model inherits any legal constraints from both the base model and this dataset.
195
 
196
+ ---
197
 
198
+ ### Train / Validation Split
199
 
200
+ The dataset only provides a `train` split. The project creates its own train/validation split using:
201
 
202
+ - `datasets.Dataset.train_test_split` with:
203
+ - `test_size = val_ratio` (default: `0.08`)
204
+ - `seed = 42`
 
 
 
 
205
 
206
+ Renames:
207
+ - `train` → final training split
208
+ - `test` → final validation split
209
 
210
+ This yields:
211
+ - `data/processed/train.jsonl` – training examples
212
+ - `data/processed/val.jsonl` – validation examples
213
 
214
+ ---
 
 
215
 
216
+ ### Instruction-Tuning Format (Alpaca-style JSONL)
217
+
218
+ Each processed example has:
219
+
220
+ - `id` – e.g. `"sqlcc-train-000001"`
221
+ - `instruction` – static instruction text
222
+ - `input` – formatted schema + question
223
+ - `output` – normalized SQL query
224
+ - `source` – `"b-mc2/sql-create-context"`
225
+ - `meta` – metadata (original split, row index, seed, etc.)
226
+
227
+ **Example:**
228
+
229
+ {
230
+ "id": "sqlcc-train-000001",
231
+ "instruction": "Write a SQL query that answers the user's question using ONLY the tables and columns provided in the schema.",
232
+ "input": "### Schema:\nCREATE TABLE head (age INTEGER)\n\n### Question:\nHow many heads of the departments are older than 56 ?",
233
+ "output": "SELECT COUNT(*) FROM head WHERE age > 56",
234
+ "source": "b-mc2/sql-create-context",
235
+ "meta": {
236
+ "original_split": "train",
237
+ "row": 0,
238
+ "split": "train",
239
+ "val_ratio": 0.08,
240
+ "seed": 42,
241
+ "from_local_input": false
242
+ }
243
+ }
244
 
245
+ ---
246
 
247
+ ### Instruction Text
248
 
249
+ The instruction is fixed:
250
 
251
+ Write a SQL query that answers the user's question using ONLY the tables and columns provided in the schema.
252
 
253
+ ---
254
 
255
+ ### Input Formatting
256
 
257
+ `input` is constructed as:
258
 
259
+ ### Schema:
260
+ <CREATE TABLE ...>
261
 
262
+ ### Question:
263
+ <question text>
264
 
265
+ This is implemented in `text2sql.data_prep.build_input_text`.
266
 
267
+ ---
268
+
269
+ ### SQL Normalization
270
 
271
+ The dataset builder applies light normalization to the answer:
272
 
273
+ - Strip leading/trailing whitespace
274
+ - Collapse runs of whitespace into a single space
275
 
276
+ This is implemented as `text2sql.data_prep.normalize_sql`.
277
 
278
+ ---
279
 
280
+ ## Training Procedure
281
 
282
+ ### Prompt Format for Training
283
 
284
+ To build the final training text, the project uses a simple prompt template:
285
 
286
+ ### Instruction:
287
+ <instruction>
288
 
289
+ ### Input:
290
+ <input>
291
 
292
+ ### Response:
293
 
294
+ This template is defined as `PROMPT_TEMPLATE` in `src/text2sql/training/formatting.py`, and filled via:
295
 
296
+ from text2sql.training.formatting import build_prompt
297
 
298
+ prompt = build_prompt(instruction, input_text)
299
+ # Final training text is: prompt + output_sql
300
 
301
+ `output_sql` is normalized SQL, optionally further cleaned with `ensure_sql_only` when used at inference time.
302
 
303
+ ---
304
 
305
+ ### Optimization
306
 
307
+ - Optimizer & scheduler are provided by `trl.SFTTrainer` / `transformers`.
308
+ - Mixed precision (e.g. bf16/fp16) is enabled when supported.
309
+ - LoRA adapters are applied to a subset of projection layers; typical choices include attention and MLP projections (see training code for exact `target_modules`).
310
 
311
+ ---
312
 
313
+ ### Hardware
314
 
315
+ Intended to run on a single modern GPU (e.g., A10, A100, L4) with ≥16GB VRAM using 4-bit quantization.
316
 
317
+ The CLI script has:
318
+ - `--dry_run` mode (no model load; checks dataset & formatting).
319
+ - `--smoke` mode (lightweight config check; on CPU-only machines it skips loading the full model).
320
 
321
+ ---
322
+
323
+ ### Outputs
324
+
325
+ After a full run you should obtain:
326
+
327
+ - `outputs/adapters/` – LoRA adapter weights / config
328
+ - `outputs/run_meta.json` – training config, data paths, etc.
329
+ - `outputs/metrics.json` – training/eval metrics as reported by the trainer
330
+
331
+ These artifacts can be published to the Hub via the helper script `scripts/publish_to_hub.py`.
332
+
333
+ ---
334
 
335
  ## Evaluation
336
 
337
+ The project provides a dedicated evaluation pipeline for both internal and external validation.
338
+
339
+ ---
340
+
341
+ ### Metrics
342
+
343
+ All evaluation flows share the same core metrics, implemented in `src/text2sql/eval/metrics.py`:
344
+
345
+ #### Exact Match (EM) (normalized SQL)
346
+
347
+ Uses `normalize_sql`:
348
+ - Strip whitespace
349
+ - Remove trailing semicolons
350
+ - Collapse whitespace runs
351
+ Checks exact string equality between normalized prediction and gold SQL.
352
+
353
+ #### No-values Exact Match
354
+
355
+ Uses `normalize_sql_no_values`:
356
+ - Normalize SQL as above
357
+ - Replace single-quoted string literals with a placeholder (`'__STR__'`)
358
+ - Replace numeric literals (integers/decimals) with a placeholder (`__NUM__`)
359
+ Captures structural equality even when literal values differ.
360
+
361
+ #### SQL parse success rate
362
+
363
+ Uses `sqlglot.parse_one` to parse the predicted SQL.
364
+ Fraction of predictions that parse successfully.
365
 
366
+ #### Schema adherence
367
 
368
+ - Parses the `CREATE TABLE` context with `sqlglot` to recover:
369
+ - Tables and columns
370
+ - Parses predicted SQL and extracts table/column references
371
+ - A prediction is schema-adherent if all references exist in the schema.
372
 
373
+ Metrics are aggregated as:
374
 
375
+ {
376
+ "n_examples": ...,
377
+ "exact_match": {"count": ..., "rate": ...},
378
+ "no_values_em": {"count": ..., "rate": ...},
379
+ "parse_success": {"count": ..., "rate": ...},
380
+ "schema_adherence": {"count": ..., "rate": ...} // optional
381
+ }
382
 
383
+ **Important:** At the time of writing, this model card does not include specific numeric metrics. After you run `scripts/evaluate_internal.py` and `scripts/evaluate_spider_external.py`, you should update this section with actual results from:
384
 
385
+ - `reports/eval_internal.json` / `.md`
386
+ - `reports/eval_spider.json` / `.md`
387
 
388
+ ---
389
+
390
+ ### Internal Evaluation (b-mc2/sql-create-context val)
391
+
392
+ **Input:**
393
+ `data/processed/val.jsonl` (same format as training)
394
 
395
+ **Script:**
396
 
397
+ python scripts/evaluate_internal.py \
398
+ --val_path data/processed/val.jsonl \
399
+ --base_model mistralai/Mistral-7B-Instruct-v0.1 \
400
+ --adapter_dir /path/to/outputs/adapters \
401
+ --device auto \
402
+ --max_examples 200 \
403
+ --temperature 0.0 \
404
+ --top_p 0.9 \
405
+ --max_new_tokens 256 \
406
+ --out_dir reports/
407
 
408
+ **Notes:**
409
+ - `--device auto` chooses GPU when available.
410
+ - 4-bit quantization is enabled by default on CUDA; configurable via:
411
+ - `--load_in_4bit` / `--no_load_in_4bit`
412
+ - `--bnb_4bit_quant_type`, `--bnb_4bit_compute_dtype`, etc.
413
+ - `--smoke` runs a small subset; on CPU-only environments it falls back to mock mode (gold SQL as prediction) to exercise the metrics without loading the model.
414
 
415
+ **Outputs:**
416
+ - `reports/eval_internal.json`
417
+ - `reports/eval_internal.md`
418
 
419
+ ---
420
 
421
+ ### External Validation (Spider dev)
422
 
423
+ **Datasets:**
424
+ - Examples: `xlangai/spider` (split: `validation`)
425
+ - Schema helper: `richardr1126/spider-schema` (contains create_table_context)
426
+ - License note: `richardr1126/spider-schema` is licensed under **CC BY-SA 4.0**. Spider is used only for evaluation, not training.
427
 
428
+ **Prompt format:**
429
 
430
+ ### Schema:
431
+ <create_table_context>
432
 
433
+ ### Question:
434
+ <Spider question>
435
 
436
+ Instruction text is the same as training. Prompts are constructed with the same formatter used for training (via helper functions in `text2sql.eval.spider`).
437
 
438
+ **Script:**
439
 
440
+ python scripts/evaluate_spider_external.py \
441
+ --base_model mistralai/Mistral-7B-Instruct-v0.1 \
442
+ --adapter_dir /path/to/outputs/adapters \
443
+ --device auto \
444
+ --spider_source xlangai/spider \
445
+ --schema_source richardr1126/spider-schema \
446
+ --spider_split validation \
447
+ --max_examples 200 \
448
+ --temperature 0.0 \
449
+ --top_p 0.9 \
450
+ --max_new_tokens 256 \
451
+ --out_dir reports/
452
 
453
+ **Outputs:**
454
+ - `reports/eval_spider.json`
455
+ - `reports/eval_spider.md`
456
 
457
+ The same metrics (EM, no-values EM, parse success, schema adherence) are computed, but note:
458
+ - This is not a full reproduction of official Spider evaluation (which includes component matching, execution metrics, etc.).
459
+ - It is a lightweight proxy for cross-domain Text-to-SQL quality.
 
 
460
 
461
+ ---
462
 
463
+ ### Mock / Offline Modes
464
 
465
+ Both evaluation scripts have `--mock` modes:
466
+ - Use small fixtures from `tests/fixtures/`
467
+ - Treat gold SQL as predictions
468
+ - Avoid network / heavy model loads
469
+ Ideal for CI and offline smoketests.
470
 
471
+ ---
472
 
473
+ ## Inference and Deployment
474
 
475
+ ### Basic Usage with Hugging Face Transformers (Adapters)
476
 
477
+ Assuming this repo provides a LoRA adapter that you can load on top of `mistralai/Mistral-7B-Instruct-v0.1`:
478
 
479
+ from transformers import AutoModelForCausalLM, AutoTokenizer
480
+ from peft import PeftModel
481
 
482
+ BASE_MODEL = "mistralai/Mistral-7B-Instruct-v0.1"
483
+ ADAPTER_REPO = "your-username/analytics-copilot-text2sql-mistral7b-qlora"
484
 
485
+ tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
486
+ base_model = AutoModelForCausalLM.from_pretrained(
487
+ BASE_MODEL,
488
+ load_in_4bit=True,
489
+ device_map="auto",
490
+ )
491
 
492
+ model = PeftModel.from_pretrained(base_model, ADAPTER_REPO)
493
 
494
+ schema = """CREATE TABLE orders (
495
+ id INTEGER PRIMARY KEY,
496
+ customer_id INTEGER,
497
+ amount NUMERIC,
498
+ created_at TIMESTAMP
499
+ );"""
500
 
501
+ question = "Total order amount per customer for the last 7 days."
502
 
503
+ instruction = (
504
+ "Write a SQL query that answers the user's question using ONLY "
505
+ "the tables and columns provided in the schema."
506
+ )
507
+ input_text = f"### Schema:\n{schema}\n\n### Question:\n{question}"
508
+
509
+ prompt = f"### Instruction:\n{instruction}\n\n### Input:\n{input_text}\n\n### Response:\n"
510
+
511
+ inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
512
+ output_ids = model.generate(
513
+ **inputs,
514
+ max_new_tokens=256,
515
+ temperature=0.0,
516
+ )
517
+ raw_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
518
+
519
+ # Optionally, post-process with the project’s SQL cleaner:
520
+ # from text2sql.training.formatting import ensure_sql_only
521
+ # sql = ensure_sql_only(raw_text)
522
+ print(raw_text)
523
+
524
+ ---
525
 
526
+ ### Inference Endpoints + Multi-LoRA (Recommended for Production)
527
 
528
+ If you host the base model in a Hugging Face Inference Endpoint with a Multi-LoRA configuration (via `LORA_ADAPTERS`), you can select this adapter at inference time by `adapter_id`.
529
 
530
+ Example environment for TGI:
531
 
532
+ LORA_ADAPTERS='[
533
+ {"id": "text2sql-qlora", "source": "your-username/analytics-copilot-text2sql-mistral7b-qlora"}
534
+ ]'
535
 
536
+ Then in Python:
537
 
538
+ from huggingface_hub import InferenceClient
539
 
540
+ ENDPOINT_URL = "https://your-endpoint-1234.us-east-1.aws.endpoints.huggingface.cloud"
541
+ HF_TOKEN = "hf_your_token_here"
542
 
543
+ client = InferenceClient(base_url=ENDPOINT_URL, api_key=HF_TOKEN)
544
 
545
+ schema = """CREATE TABLE orders (
546
+ id INTEGER PRIMARY KEY,
547
+ customer_id INTEGER,
548
+ amount NUMERIC,
549
+ created_at TIMESTAMP
550
+ );"""
551
+
552
+ question = "Total order amount per customer for the last 7 days."
553
+
554
+ prompt = f"""### Schema:
555
+ {schema}
556
+
557
+ ### Question:
558
+ {question}
559
+
560
+ Return only the SQL query."""
561
+
562
+ response = client.post(
563
+ json={
564
+ "inputs": prompt,
565
+ "parameters": {
566
+ "adapter_id": "text2sql-qlora",
567
+ "max_new_tokens": 256,
568
+ "temperature": 0.0,
569
+ },
570
+ }
571
+ )
572
+
573
+ print(response)
574
+
575
+ ---
576
+
577
+ ### Streamlit UI
578
+
579
+ The accompanying repo includes a Streamlit app (`app/streamlit_app.py`) that:
580
+
581
+ - Runs on Streamlit Community Cloud or locally.
582
+ - Calls a Hugging Face Inference Endpoint or router via `InferenceClient`.
583
+ - Reads config from Streamlit secrets or environment:
584
+ - `HF_TOKEN`
585
+ - `HF_ENDPOINT_URL` + `HF_ADAPTER_ID` (preferred, TGI endpoint + adapter)
586
+ - Or `HF_MODEL_ID` + `HF_PROVIDER` (router-based fallback, for merged models)
587
+ - Optionally uses an OpenAI fallback model when HF inference fails.
588
+
589
+ Deployment instructions are documented in `docs/deploy_streamlit_cloud.md`.
590
+
591
+ ---
592
+
593
+ ## Ethical Considerations and Risks
594
+
595
+ ### Data and Bias
596
+
597
+ The training data (`b-mc2/sql-create-context`) may contain:
598
+ - Synthetic or curated schemas and questions
599
+ - Biases in naming conventions, example queries, or tasks
600
+
601
+ The base model (`Mistral-7B-Instruct`) is trained on large-scale web and other data. It inherits any demographic, cultural, and representational biases present in those sources.
602
+
603
+ As a result:
604
+ - The model can produce SQL that, if combined with biased downstream usage (e.g., unfair filtering in a user database), may exacerbate existing biases.
605
+ - The model is not aware of ethical / legal constraints around data access; it will happily generate queries that might retrieve sensitive fields (e.g., emails, PII) if such columns exist in the schema.
606
+
607
+ ---
608
+
609
+ ### Safety and Security
610
+
611
+ Generated SQL may contain:
612
+ - Expensive operations (full table scans on large tables)
613
+ - Potentially unsafe patterns (e.g., missing `LIMIT`, cross joins)
614
+
615
+ The model does not perform:
616
+ - Access control
617
+ - Row-level security
618
+ - SQL injection detection
619
+
620
+ You must implement:
621
+ - A strict execution sandbox:
622
+ - Allow only `SELECT` (no `INSERT`, `UPDATE`, `DELETE`, `DROP`, etc.)
623
+ - Enforce timeouts and row limits
624
+ - Appropriate logging and review of executed queries
625
+
626
+ ---
627
+
628
+ ### Human Oversight
629
+
630
+ Always:
631
+ - Present generated SQL to users for review
632
+ - Encourage edits and manual validation
633
+ - Provide clear warnings that the system is a copilot, not an oracle
634
+
635
+ ---
636
+
637
+ ### Environmental Impact
638
+
639
+ Training details vary depending on your hardware and hyperparameters, but in general:
640
+
641
+ - QLoRA + 4-bit quantization significantly reduces compute and memory compared to full fine-tuning:
642
+ - Fewer GPU-hours
643
+ - Lower VRAM requirements
644
+ - The example configuration (7B model, QLoRA, moderate steps) is designed to fit on commodity cloud GPUs (e.g., single A10/A100-class instance).
645
+
646
+ To be transparent, you should log and publish:
647
+ - GPU type and count
648
+ - Total training time
649
+ - Number of runs and restarts
650
+
651
+ ---
652
+
653
+ ## How to Cite
654
+
655
+ If you use this model or the underlying codebase in a research project or production system, please consider citing:
656
+
657
+ - The base model authors: Mistral AI (`mistralai/Mistral-7B-Instruct-v0.1`)
658
+ - The training dataset: `b-mc2/sql-create-context` (see dataset page for citation)
659
+ - This project (replace with your own reference):
660
+ Analytics Copilot (Text-to-SQL) – Mistral-7B QLoRA,
661
+ GitHub: https://github.com/brej-29/analytics-copilot-text2sql
662
+
663
+ You may also add a BibTeX entry, for example:
664
+
665
+ @misc{analytics_copilot_text2sql,
666
+ title = {Analytics Copilot (Text-to-SQL) -- Mistral-7B QLoRA},
667
+ author = {Your Name},
668
+ year = {2026},
669
+ howpublished = {\url{https://github.com/brej-29/analytics-copilot-text2sql}},
670
+ note = {Text-to-SQL fine-tuning of Mistral-7B using QLoRA on b-mc2/sql-create-context}
671
+ }
672
+
673
+ ---
674
 
675
+ ## Changelog
 
676
 
677
+ - **Initial adapter / model card:**
678
+ - QLoRA fine-tuning on `b-mc2/sql-create-context`
679
+ - Internal and external evaluation pipelines implemented
680
+ - Streamlit UI for remote inference via Hugging Face Inference