taherdoust's picture
Upload README.md
7f255e4 verified
---
language:
- en
license: apache-2.0
tags:
- text2sql
- spatial-sql
- postgis
- city-information-modeling
- cim
- fine-tuned
- bird-baseline
- lora
- qlora
base_model: defog/sqlcoder-7b-2
datasets:
- taherdoust/ai4cimdb
library_name: transformers
pipeline_tag: text-generation
---
# SQLCoder 7B - CIM Spatial SQL (BIRD Pre-trained Baseline)
**Fine-tuned for thesis comparison: Generic BIRD pre-training vs Domain-Specific Training**
This model is a fine-tuned version of [defog/sqlcoder-7b-2](https://huggingface.co/defog/sqlcoder-7b-2) on the [taherdoust/ai4cimdb](https://huggingface.co/datasets/taherdoust/ai4cimdb) dataset for City Information Modeling (CIM) spatial SQL generation.
## Model Description
**Purpose**: Academic baseline for thesis comparison - demonstrates the performance gap between generic text-to-SQL models (trained on Spider/BIRD) and domain-specific models when handling PostGIS spatial functions.
**Training Strategy**: Minimal adaptation (1 epoch only) to measure transfer learning effectiveness from BIRD to PostGIS spatial SQL domain.
### Key Characteristics
- **Base Model**: SQLCoder 7B-2 (StarCoder-based, pre-trained on Spider + BIRD + commercial SQL)
- **Training**: 1 epoch fine-tuning on CIM spatial SQL dataset
- **Training Time**: 71.7 hours on NVIDIA Quadro RTX 6000 24GB
- **Method**: QLoRA (4-bit quantization + LoRA rank 16)
- **Trainable Parameters**: 39,976,960 (0.59% of 6.78B total)
### Research Question
**How does a generic text-to-SQL model perform on specialized spatial SQL tasks?**
This model establishes a baseline for comparing:
- Generic BIRD pre-training (standard SQL) vs Domain-specific training (PostGIS spatial)
- Transfer learning effectiveness for specialized SQL dialects
- Performance gaps on PostGIS spatial functions (ST_Intersects, ST_Within, ST_Distance, etc.)
## Intended Use
### Direct Use
Generate PostGIS spatial SQL queries for City Information Modeling databases from natural language questions.
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_name = "taherdoust/sqlcoder-7b-cim-q2sql-bird-comparison"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto"
)
question = "Find all buildings within 100 meters of census zone SEZ123"
prompt = f"""### Task
Generate a SQL query to answer the question.
### Database Schema
- cim_vector.cim_wizard_building (building_id, building_geometry, project_id)
- cim_census.censusgeo (id, sez2011, census_geometry, population)
### Question
{question}
### SQL Query
"""
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
sql = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(sql)
```
### Thesis Comparison Use
Compare this model's performance against domain-specific models to quantify the PostGIS knowledge gap:
**Expected Performance:**
- **Standard SQL**: 85-90% (similar to domain models)
- **PostGIS Spatial Functions**: 30-50% (vs 85-92% domain models) ← **Gap**
- **CIM Domain Terms**: 40-60% (vs 85-90% domain models)
- **Overall EX Accuracy**: 60-75% baseline → 75-85% after 1 epoch fine-tuning
## Training Details
### Training Data
- **Dataset**: [taherdoust/ai4cimdb](https://huggingface.co/datasets/taherdoust/ai4cimdb)
- **Training Samples**: 88,480 (70% of 126,400 curated)
- **Validation Samples**: 18,960 (15%)
- **Test Samples**: 18,960 (15%)
- **Total Raw Samples**: 176,837 (3-stage generation: templates → CTGAN → GPT-4o-mini)
### Training Procedure
**QLoRA Configuration:**
```python
BitsAndBytesConfig:
- load_in_4bit: True
- bnb_4bit_quant_type: "nf4"
- bnb_4bit_compute_dtype: bfloat16
LoraConfig:
- r: 16
- lora_alpha: 32
- lora_dropout: 0.1
- target_modules: ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
```
**Training Hyperparameters:**
- **Epochs**: 1 (minimal adaptation for baseline comparison)
- **Batch Size**: 2 per device
- **Gradient Accumulation**: 8 steps (effective batch size: 16)
- **Learning Rate**: 2.0e-4 (higher for 1-epoch training)
- **LR Scheduler**: Cosine with 10% warmup
- **Optimizer**: Paged AdamW 8-bit
- **Precision**: bfloat16
- **Gradient Checkpointing**: Enabled
- **Max Sequence Length**: 2048 tokens
**Training Results:**
- **Training Time**: 71 hours 43 minutes (258,221 seconds)
- **Final Training Loss**: 0.0980
- **Training Speed**: 0.343 samples/sec
- **Total Steps**: 5,530 steps
- **Hardware**: NVIDIA Quadro RTX 6000 (24GB VRAM)
### Database Schema Context
**CIM Database (PostgreSQL + PostGIS):**
**cim_vector Schema:**
- `cim_wizard_building`: Building geometries (POLYGON)
- `cim_wizard_building_properties`: Building attributes (height, area, energy)
- `cim_wizard_project_scenario`: Project and scenario metadata
- `network_buses`, `network_lines`: Electrical grid infrastructure (POINT, LINESTRING)
**cim_census Schema:**
- `censusgeo`: Italian ISTAT 2011 census zones (POLYGON)
- Demographic data: population, age distribution (E8-E16), housing (ST3-ST5)
**cim_raster Schema:**
- `dtm`: Digital Terrain Model (RASTER)
- `dsm`: Digital Surface Model (RASTER)
- Building height calculation via raster-vector operations
## Evaluation Metrics
### Expected Performance (Thesis Hypothesis)
| Metric | SQLCoder (BIRD) | Domain Models | Gap |
|--------|----------------|---------------|-----|
| **Standard SQL** | 85-90% | 85-92% | ±2-5% |
| **PostGIS Functions** | 30-50% | 85-92% | **35-50%** ← Research Gap |
| **CIM Domain Terms** | 40-60% | 85-90% | 25-40% |
| **Multi-Schema** | 60-75% | 82-90% | 15-25% |
| **Overall EX** | 60-75% | 82-92% | **15-25%** |
**Key Finding**: Generic BIRD models struggle with specialized SQL dialects (PostGIS spatial functions) despite strong standard SQL performance.
### Evaluation Modes
**EM (Exact Match)**: String-level comparison (25-35% expected)
**EX (Execution Accuracy)**: Result-level comparison (60-75% expected)
**EA (Eventual Accuracy)**: Agent mode with self-correction (65-80% expected)
## Academic Contribution
### Thesis Context
This model serves as a controlled baseline for demonstrating:
1. **PostGIS Knowledge Gap**: BIRD pre-training lacks spatial function exposure
2. **Domain Terminology Gap**: CIM-specific terms (SEZ2011, TABULA, E8-E16) require domain training
3. **Transfer Learning Limits**: 1 epoch fine-tuning improves but doesn't close the gap
4. **Multi-Schema Complexity**: Cross-schema joins (cim_vector + cim_census + cim_raster) challenge generic models
### Comparison Framework
**Models for Comparison:**
- **This Model** (SQLCoder 7B BIRD): Generic baseline
- **Llama 3.1 8B** (Domain-specific): 3 epochs on CIM data
- **Qwen 2.5 14B** (Domain-specific): 3 epochs on CIM data
- **DeepSeek-Coder 6.7B** (Domain-specific): 3 epochs on CIM data
## Environmental Impact
- **Hardware**: NVIDIA Quadro RTX 6000 (24GB VRAM, 250W TDP)
- **Training Time**: 71.7 hours
- **Estimated Energy**: ~17.9 kWh (250W × 71.7h)
- **Carbon Footprint**: ~7.2 kg CO₂ (401 g CO₂/kWh, Italy grid 2024)
## Technical Specifications
### Model Architecture
- **Base**: StarCoder (7B parameters)
- **Attention**: Multi-head attention with 32 heads
- **Layers**: 32 transformer layers
- **Vocabulary**: 49,152 tokens
- **Context Window**: 8,192 tokens
- **Activation**: GELU
### LoRA Adapter
- **Adapter Size**: ~260 MB (safetensors)
- **Trainable Params**: 39,976,960 (0.59%)
- **Target Modules**: 7 modules (q/k/v/o/gate/up/down projections)
- **Rank**: 16
- **Alpha**: 32
## Citation
```bibtex
@misc{taherdoust2025sqlcoder_cim,
title={SQLCoder 7B for CIM Spatial SQL: BIRD Baseline Comparison},
author={Taherdoust, Ali},
year={2025},
institution={Politecnico di Torino},
note={Fine-tuned on ai4cimdb dataset for thesis comparison:
Generic BIRD pre-training vs Domain-specific PostGIS training},
url={https://huggingface.co/taherdoust/sqlcoder-7b-cim-q2sql-bird-comparison}
}
@software{defog2024sqlcoder,
title={SQLCoder: A State-of-the-Art LLM for SQL Generation},
author={Defog.ai},
year={2024},
url={https://github.com/defog-ai/sqlcoder}
}
```
## Acknowledgments
- **Base Model**: [defog/sqlcoder-7b-2](https://huggingface.co/defog/sqlcoder-7b-2) by Defog.ai
- **Dataset**: [taherdoust/ai4cimdb](https://huggingface.co/datasets/taherdoust/ai4cimdb) (176K samples)
- **Institution**: Politecnico di Torino, DENERG Department
- **Infrastructure**: ECLab ipazia126 GPU server
- **Frameworks**: HuggingFace Transformers, PEFT, BitsAndBytes
## License
Apache 2.0 (inherited from SQLCoder base model)
## Model Card Authors
Ali Taherdoust (Politecnico di Torino)
## Model Card Contact
ali.taherdoustmohammadi@polito.it