|
|
--- |
|
|
language: |
|
|
- en |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- text2sql |
|
|
- spatial-sql |
|
|
- postgis |
|
|
- city-information-modeling |
|
|
- cim |
|
|
- fine-tuned |
|
|
- bird-baseline |
|
|
- lora |
|
|
- qlora |
|
|
base_model: defog/sqlcoder-7b-2 |
|
|
datasets: |
|
|
- taherdoust/ai4cimdb |
|
|
library_name: transformers |
|
|
pipeline_tag: text-generation |
|
|
--- |
|
|
|
|
|
# SQLCoder 7B - CIM Spatial SQL (BIRD Pre-trained Baseline) |
|
|
|
|
|
**Fine-tuned for thesis comparison: Generic BIRD pre-training vs Domain-Specific Training** |
|
|
|
|
|
This model is a fine-tuned version of [defog/sqlcoder-7b-2](https://huggingface.co/defog/sqlcoder-7b-2) on the [taherdoust/ai4cimdb](https://huggingface.co/datasets/taherdoust/ai4cimdb) dataset for City Information Modeling (CIM) spatial SQL generation. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
**Purpose**: Academic baseline for thesis comparison - demonstrates the performance gap between generic text-to-SQL models (trained on Spider/BIRD) and domain-specific models when handling PostGIS spatial functions. |
|
|
|
|
|
**Training Strategy**: Minimal adaptation (1 epoch only) to measure transfer learning effectiveness from BIRD to PostGIS spatial SQL domain. |
|
|
|
|
|
### Key Characteristics |
|
|
|
|
|
- **Base Model**: SQLCoder 7B-2 (StarCoder-based, pre-trained on Spider + BIRD + commercial SQL) |
|
|
- **Training**: 1 epoch fine-tuning on CIM spatial SQL dataset |
|
|
- **Training Time**: 71.7 hours on NVIDIA Quadro RTX 6000 24GB |
|
|
- **Method**: QLoRA (4-bit quantization + LoRA rank 16) |
|
|
- **Trainable Parameters**: 39,976,960 (0.59% of 6.78B total) |
|
|
|
|
|
### Research Question |
|
|
|
|
|
**How does a generic text-to-SQL model perform on specialized spatial SQL tasks?** |
|
|
|
|
|
This model establishes a baseline for comparing: |
|
|
- Generic BIRD pre-training (standard SQL) vs Domain-specific training (PostGIS spatial) |
|
|
- Transfer learning effectiveness for specialized SQL dialects |
|
|
- Performance gaps on PostGIS spatial functions (ST_Intersects, ST_Within, ST_Distance, etc.) |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
### Direct Use |
|
|
|
|
|
Generate PostGIS spatial SQL queries for City Information Modeling databases from natural language questions. |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
import torch |
|
|
|
|
|
model_name = "taherdoust/sqlcoder-7b-cim-q2sql-bird-comparison" |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
model_name, |
|
|
torch_dtype=torch.float16, |
|
|
device_map="auto" |
|
|
) |
|
|
|
|
|
question = "Find all buildings within 100 meters of census zone SEZ123" |
|
|
|
|
|
prompt = f"""### Task |
|
|
Generate a SQL query to answer the question. |
|
|
|
|
|
### Database Schema |
|
|
- cim_vector.cim_wizard_building (building_id, building_geometry, project_id) |
|
|
- cim_census.censusgeo (id, sez2011, census_geometry, population) |
|
|
|
|
|
### Question |
|
|
{question} |
|
|
|
|
|
### SQL Query |
|
|
""" |
|
|
|
|
|
inputs = tokenizer(prompt, return_tensors="pt").to(model.device) |
|
|
outputs = model.generate(**inputs, max_new_tokens=512) |
|
|
sql = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
|
print(sql) |
|
|
``` |
|
|
|
|
|
### Thesis Comparison Use |
|
|
|
|
|
Compare this model's performance against domain-specific models to quantify the PostGIS knowledge gap: |
|
|
|
|
|
**Expected Performance:** |
|
|
- **Standard SQL**: 85-90% (similar to domain models) |
|
|
- **PostGIS Spatial Functions**: 30-50% (vs 85-92% domain models) ← **Gap** |
|
|
- **CIM Domain Terms**: 40-60% (vs 85-90% domain models) |
|
|
- **Overall EX Accuracy**: 60-75% baseline → 75-85% after 1 epoch fine-tuning |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Training Data |
|
|
|
|
|
- **Dataset**: [taherdoust/ai4cimdb](https://huggingface.co/datasets/taherdoust/ai4cimdb) |
|
|
- **Training Samples**: 88,480 (70% of 126,400 curated) |
|
|
- **Validation Samples**: 18,960 (15%) |
|
|
- **Test Samples**: 18,960 (15%) |
|
|
- **Total Raw Samples**: 176,837 (3-stage generation: templates → CTGAN → GPT-4o-mini) |
|
|
|
|
|
### Training Procedure |
|
|
|
|
|
**QLoRA Configuration:** |
|
|
```python |
|
|
BitsAndBytesConfig: |
|
|
- load_in_4bit: True |
|
|
- bnb_4bit_quant_type: "nf4" |
|
|
- bnb_4bit_compute_dtype: bfloat16 |
|
|
|
|
|
LoraConfig: |
|
|
- r: 16 |
|
|
- lora_alpha: 32 |
|
|
- lora_dropout: 0.1 |
|
|
- target_modules: ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"] |
|
|
``` |
|
|
|
|
|
**Training Hyperparameters:** |
|
|
- **Epochs**: 1 (minimal adaptation for baseline comparison) |
|
|
- **Batch Size**: 2 per device |
|
|
- **Gradient Accumulation**: 8 steps (effective batch size: 16) |
|
|
- **Learning Rate**: 2.0e-4 (higher for 1-epoch training) |
|
|
- **LR Scheduler**: Cosine with 10% warmup |
|
|
- **Optimizer**: Paged AdamW 8-bit |
|
|
- **Precision**: bfloat16 |
|
|
- **Gradient Checkpointing**: Enabled |
|
|
- **Max Sequence Length**: 2048 tokens |
|
|
|
|
|
**Training Results:** |
|
|
- **Training Time**: 71 hours 43 minutes (258,221 seconds) |
|
|
- **Final Training Loss**: 0.0980 |
|
|
- **Training Speed**: 0.343 samples/sec |
|
|
- **Total Steps**: 5,530 steps |
|
|
- **Hardware**: NVIDIA Quadro RTX 6000 (24GB VRAM) |
|
|
|
|
|
### Database Schema Context |
|
|
|
|
|
**CIM Database (PostgreSQL + PostGIS):** |
|
|
|
|
|
**cim_vector Schema:** |
|
|
- `cim_wizard_building`: Building geometries (POLYGON) |
|
|
- `cim_wizard_building_properties`: Building attributes (height, area, energy) |
|
|
- `cim_wizard_project_scenario`: Project and scenario metadata |
|
|
- `network_buses`, `network_lines`: Electrical grid infrastructure (POINT, LINESTRING) |
|
|
|
|
|
**cim_census Schema:** |
|
|
- `censusgeo`: Italian ISTAT 2011 census zones (POLYGON) |
|
|
- Demographic data: population, age distribution (E8-E16), housing (ST3-ST5) |
|
|
|
|
|
**cim_raster Schema:** |
|
|
- `dtm`: Digital Terrain Model (RASTER) |
|
|
- `dsm`: Digital Surface Model (RASTER) |
|
|
- Building height calculation via raster-vector operations |
|
|
|
|
|
## Evaluation Metrics |
|
|
|
|
|
### Expected Performance (Thesis Hypothesis) |
|
|
|
|
|
| Metric | SQLCoder (BIRD) | Domain Models | Gap | |
|
|
|--------|----------------|---------------|-----| |
|
|
| **Standard SQL** | 85-90% | 85-92% | ±2-5% | |
|
|
| **PostGIS Functions** | 30-50% | 85-92% | **35-50%** ← Research Gap | |
|
|
| **CIM Domain Terms** | 40-60% | 85-90% | 25-40% | |
|
|
| **Multi-Schema** | 60-75% | 82-90% | 15-25% | |
|
|
| **Overall EX** | 60-75% | 82-92% | **15-25%** | |
|
|
|
|
|
**Key Finding**: Generic BIRD models struggle with specialized SQL dialects (PostGIS spatial functions) despite strong standard SQL performance. |
|
|
|
|
|
### Evaluation Modes |
|
|
|
|
|
**EM (Exact Match)**: String-level comparison (25-35% expected) |
|
|
**EX (Execution Accuracy)**: Result-level comparison (60-75% expected) |
|
|
**EA (Eventual Accuracy)**: Agent mode with self-correction (65-80% expected) |
|
|
|
|
|
## Academic Contribution |
|
|
|
|
|
### Thesis Context |
|
|
|
|
|
This model serves as a controlled baseline for demonstrating: |
|
|
|
|
|
1. **PostGIS Knowledge Gap**: BIRD pre-training lacks spatial function exposure |
|
|
2. **Domain Terminology Gap**: CIM-specific terms (SEZ2011, TABULA, E8-E16) require domain training |
|
|
3. **Transfer Learning Limits**: 1 epoch fine-tuning improves but doesn't close the gap |
|
|
4. **Multi-Schema Complexity**: Cross-schema joins (cim_vector + cim_census + cim_raster) challenge generic models |
|
|
|
|
|
### Comparison Framework |
|
|
|
|
|
**Models for Comparison:** |
|
|
- **This Model** (SQLCoder 7B BIRD): Generic baseline |
|
|
- **Llama 3.1 8B** (Domain-specific): 3 epochs on CIM data |
|
|
- **Qwen 2.5 14B** (Domain-specific): 3 epochs on CIM data |
|
|
- **DeepSeek-Coder 6.7B** (Domain-specific): 3 epochs on CIM data |
|
|
|
|
|
## Environmental Impact |
|
|
|
|
|
- **Hardware**: NVIDIA Quadro RTX 6000 (24GB VRAM, 250W TDP) |
|
|
- **Training Time**: 71.7 hours |
|
|
- **Estimated Energy**: ~17.9 kWh (250W × 71.7h) |
|
|
- **Carbon Footprint**: ~7.2 kg CO₂ (401 g CO₂/kWh, Italy grid 2024) |
|
|
|
|
|
## Technical Specifications |
|
|
|
|
|
### Model Architecture |
|
|
|
|
|
- **Base**: StarCoder (7B parameters) |
|
|
- **Attention**: Multi-head attention with 32 heads |
|
|
- **Layers**: 32 transformer layers |
|
|
- **Vocabulary**: 49,152 tokens |
|
|
- **Context Window**: 8,192 tokens |
|
|
- **Activation**: GELU |
|
|
|
|
|
### LoRA Adapter |
|
|
|
|
|
- **Adapter Size**: ~260 MB (safetensors) |
|
|
- **Trainable Params**: 39,976,960 (0.59%) |
|
|
- **Target Modules**: 7 modules (q/k/v/o/gate/up/down projections) |
|
|
- **Rank**: 16 |
|
|
- **Alpha**: 32 |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@misc{taherdoust2025sqlcoder_cim, |
|
|
title={SQLCoder 7B for CIM Spatial SQL: BIRD Baseline Comparison}, |
|
|
author={Taherdoust, Ali}, |
|
|
year={2025}, |
|
|
institution={Politecnico di Torino}, |
|
|
note={Fine-tuned on ai4cimdb dataset for thesis comparison: |
|
|
Generic BIRD pre-training vs Domain-specific PostGIS training}, |
|
|
url={https://huggingface.co/taherdoust/sqlcoder-7b-cim-q2sql-bird-comparison} |
|
|
} |
|
|
|
|
|
@software{defog2024sqlcoder, |
|
|
title={SQLCoder: A State-of-the-Art LLM for SQL Generation}, |
|
|
author={Defog.ai}, |
|
|
year={2024}, |
|
|
url={https://github.com/defog-ai/sqlcoder} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Acknowledgments |
|
|
|
|
|
- **Base Model**: [defog/sqlcoder-7b-2](https://huggingface.co/defog/sqlcoder-7b-2) by Defog.ai |
|
|
- **Dataset**: [taherdoust/ai4cimdb](https://huggingface.co/datasets/taherdoust/ai4cimdb) (176K samples) |
|
|
- **Institution**: Politecnico di Torino, DENERG Department |
|
|
- **Infrastructure**: ECLab ipazia126 GPU server |
|
|
- **Frameworks**: HuggingFace Transformers, PEFT, BitsAndBytes |
|
|
|
|
|
## License |
|
|
|
|
|
Apache 2.0 (inherited from SQLCoder base model) |
|
|
|
|
|
## Model Card Authors |
|
|
|
|
|
Ali Taherdoust (Politecnico di Torino) |
|
|
|
|
|
## Model Card Contact |
|
|
|
|
|
ali.taherdoustmohammadi@polito.it |
|
|
|