|
|
--- |
|
|
language: |
|
|
- en |
|
|
base_model: mistralai/Mistral-Nemo-Base-2407 |
|
|
tags: |
|
|
- text-to-sql |
|
|
- mistral-nemo |
|
|
- spider |
|
|
- peft |
|
|
- qlora |
|
|
metrics: |
|
|
- execution_accuracy |
|
|
- exact_match |
|
|
model_creator: NBAmine |
|
|
pipeline_tag: text-generation |
|
|
datasets: |
|
|
- gretelai/synthetic_text_to_sql |
|
|
- xlangai/spider |
|
|
- NBAmine/xlangai-spider-with-context |
|
|
library_name: transformers |
|
|
--- |
|
|
|
|
|
# Mistral-Nemo-12B-Text-to-SQL |
|
|
|
|
|
[](https://github.com/NBAmine/Nemo-text-to-sql) |
|
|
|
|
|
|
|
|
## Model Overview |
|
|
This is the full-precision (BF16), merged version of a **Mistral-Nemo-12B** model Parameter-Efficient Fine-Tuned for high-performance **Text-to-SQL** generation. This model is the result of merging LoRA adapters—trained via a two-phase curriculum learning strategy—back into the base weights. |
|
|
|
|
|
It is designed to serve as the "Source of Truth" for further optimizations (like AWQ or GGUF) and represents the peak predictive performance of the training pipeline before any quantization-related drift. |
|
|
|
|
|
- **Base Model:** `mistralai/Mistral-Nemo-Base-2407` |
|
|
- **Primary Task:** Natural Language to SQL generation with DDL context. |
|
|
- **Output Format:** Standalone SQL queries compatible with standard SQL engines. |
|
|
|
|
|
## Training Methodology |
|
|
The model was developed using an MLOps pipeline on dual T4 GPUs in Kaggle. |
|
|
|
|
|
### 1. Curriculum Learning Strategy |
|
|
The model underwent a two-stage training process: |
|
|
- **Phase 1 (Syntactic Alignment):** Focused on SQL syntax, basic keywords, and simple schema mapping. |
|
|
- **Phase 2 (Logical Alignment):** Introduced complex reasoning tasks including multiple `JOIN` operations, nested subqueries, and set operations (`UNION`, `INTERSECT`). |
|
|
|
|
|
### 2. Fine-Tuning Details |
|
|
- **Technique:** QLoRA (Rank 16, Alpha 32) |
|
|
- **Quantization (during training):** 4-bit NF4 |
|
|
- **Optimizer:** Paged AdamW 8-bit |
|
|
- **Hardware:** 2x NVIDIA T4 (Kaggle). |
|
|
|
|
|
## Evaluation Results |
|
|
Evaluated on the **Spider** validation set: |
|
|
- **Execution Accuracy (EX):** **69.5%** |
|
|
- **Exact Match (EM):** 61.2% |
|
|
- **Max Context Length:** 2048 tokens |
|
|
|
|
|
## Architecture Specs |
|
|
The merged weights utilize the standard Mistral-Nemo 12B architecture: |
|
|
- **Parameters:** 12.2B |
|
|
- **Layers:** 40 |
|
|
- **Attention:** Grouped Query Attention (GQA) with 8 KV heads. |
|
|
- **Vocabulary Size:** 128k (Tekken Tokenizer) |
|
|
- **VRAM Requirements:** ~24GB for inference in BF16/FP16. |
|
|
|
|
|
|
|
|
## Template used during training |
|
|
|
|
|
prompt = "Context: {DDL}<br>Question: {NL_QUERY}<br>Answer:" |