---
language:
- en
license: apache-2.0
tags:
- text2sql
- spatial-sql
- postgis
- city-information-modeling
- cim
- fine-tuned
- bird-baseline
- lora
- qlora
base_model: defog/sqlcoder-7b-2
datasets:
- taherdoust/ai4cimdb
library_name: transformers
pipeline_tag: text-generation
---

# SQLCoder 7B - CIM Spatial SQL (BIRD Pre-trained Baseline)

**Fine-tuned for thesis comparison: Generic BIRD pre-training vs Domain-Specific Training**

This model is a fine-tuned version of [defog/sqlcoder-7b-2](https://huggingface.co/defog/sqlcoder-7b-2) on the [taherdoust/ai4cimdb](https://huggingface.co/datasets/taherdoust/ai4cimdb) dataset for City Information Modeling (CIM) spatial SQL generation.

## Model Description

**Purpose**: Academic baseline for thesis comparison - demonstrates the performance gap between generic text-to-SQL models (trained on Spider/BIRD) and domain-specific models when handling PostGIS spatial functions.

**Training Strategy**: Minimal adaptation (1 epoch only) to measure transfer learning effectiveness from BIRD to PostGIS spatial SQL domain.

### Key Characteristics

- **Base Model**: SQLCoder 7B-2 (StarCoder-based, pre-trained on Spider + BIRD + commercial SQL)
- **Training**: 1 epoch fine-tuning on CIM spatial SQL dataset
- **Training Time**: 71.7 hours on NVIDIA Quadro RTX 6000 24GB
- **Method**: QLoRA (4-bit quantization + LoRA rank 16)
- **Trainable Parameters**: 39,976,960 (0.59% of 6.78B total)

### Research Question

**How does a generic text-to-SQL model perform on specialized spatial SQL tasks?**

This model establishes a baseline for comparing:
- Generic BIRD pre-training (standard SQL) vs Domain-specific training (PostGIS spatial)
- Transfer learning effectiveness for specialized SQL dialects
- Performance gaps on PostGIS spatial functions (ST_Intersects, ST_Within, ST_Distance, etc.)

## Intended Use

### Direct Use

Generate PostGIS spatial SQL queries for City Information Modeling databases from natural language questions.

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "taherdoust/sqlcoder-7b-cim-q2sql-bird-comparison"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

question = "Find all buildings within 100 meters of census zone SEZ123"

prompt = f"""### Task
Generate a SQL query to answer the question.

### Database Schema
- cim_vector.cim_wizard_building (building_id, building_geometry, project_id)
- cim_census.censusgeo (id, sez2011, census_geometry, population)

### Question
{question}

### SQL Query
"""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
sql = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(sql)
```

### Thesis Comparison Use

Compare this model's performance against domain-specific models to quantify the PostGIS knowledge gap:

**Expected Performance:**
- **Standard SQL**: 85-90% (similar to domain models)
- **PostGIS Spatial Functions**: 30-50% (vs 85-92% domain models) ← **Gap**
- **CIM Domain Terms**: 40-60% (vs 85-90% domain models)
- **Overall EX Accuracy**: 60-75% baseline → 75-85% after 1 epoch fine-tuning

## Training Details

### Training Data

- **Dataset**: [taherdoust/ai4cimdb](https://huggingface.co/datasets/taherdoust/ai4cimdb)
- **Training Samples**: 88,480 (70% of 126,400 curated)
- **Validation Samples**: 18,960 (15%)
- **Test Samples**: 18,960 (15%)
- **Total Raw Samples**: 176,837 (3-stage generation: templates → CTGAN → GPT-4o-mini)

### Training Procedure

**QLoRA Configuration:**
```python
BitsAndBytesConfig:
  - load_in_4bit: True
  - bnb_4bit_quant_type: "nf4"
  - bnb_4bit_compute_dtype: bfloat16

LoraConfig:
  - r: 16
  - lora_alpha: 32
  - lora_dropout: 0.1
  - target_modules: ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
```

**Training Hyperparameters:**
- **Epochs**: 1 (minimal adaptation for baseline comparison)
- **Batch Size**: 2 per device
- **Gradient Accumulation**: 8 steps (effective batch size: 16)
- **Learning Rate**: 2.0e-4 (higher for 1-epoch training)
- **LR Scheduler**: Cosine with 10% warmup
- **Optimizer**: Paged AdamW 8-bit
- **Precision**: bfloat16
- **Gradient Checkpointing**: Enabled
- **Max Sequence Length**: 2048 tokens

**Training Results:**
- **Training Time**: 71 hours 43 minutes (258,221 seconds)
- **Final Training Loss**: 0.0980
- **Training Speed**: 0.343 samples/sec
- **Total Steps**: 5,530 steps
- **Hardware**: NVIDIA Quadro RTX 6000 (24GB VRAM)

### Database Schema Context

**CIM Database (PostgreSQL + PostGIS):**

**cim_vector Schema:**
- `cim_wizard_building`: Building geometries (POLYGON)
- `cim_wizard_building_properties`: Building attributes (height, area, energy)
- `cim_wizard_project_scenario`: Project and scenario metadata
- `network_buses`, `network_lines`: Electrical grid infrastructure (POINT, LINESTRING)

**cim_census Schema:**
- `censusgeo`: Italian ISTAT 2011 census zones (POLYGON)
- Demographic data: population, age distribution (E8-E16), housing (ST3-ST5)

**cim_raster Schema:**
- `dtm`: Digital Terrain Model (RASTER)
- `dsm`: Digital Surface Model (RASTER)
- Building height calculation via raster-vector operations

## Evaluation Metrics

### Expected Performance (Thesis Hypothesis)

| Metric | SQLCoder (BIRD) | Domain Models | Gap |
|--------|----------------|---------------|-----|
| **Standard SQL** | 85-90% | 85-92% | ±2-5% |
| **PostGIS Functions** | 30-50% | 85-92% | **35-50%** ← Research Gap |
| **CIM Domain Terms** | 40-60% | 85-90% | 25-40% |
| **Multi-Schema** | 60-75% | 82-90% | 15-25% |
| **Overall EX** | 60-75% | 82-92% | **15-25%** |

**Key Finding**: Generic BIRD models struggle with specialized SQL dialects (PostGIS spatial functions) despite strong standard SQL performance.

### Evaluation Modes

**EM (Exact Match)**: String-level comparison (25-35% expected)
**EX (Execution Accuracy)**: Result-level comparison (60-75% expected)
**EA (Eventual Accuracy)**: Agent mode with self-correction (65-80% expected)

## Academic Contribution

### Thesis Context

This model serves as a controlled baseline for demonstrating:

1. **PostGIS Knowledge Gap**: BIRD pre-training lacks spatial function exposure
2. **Domain Terminology Gap**: CIM-specific terms (SEZ2011, TABULA, E8-E16) require domain training
3. **Transfer Learning Limits**: 1 epoch fine-tuning improves but doesn't close the gap
4. **Multi-Schema Complexity**: Cross-schema joins (cim_vector + cim_census + cim_raster) challenge generic models

### Comparison Framework

**Models for Comparison:**
- **This Model** (SQLCoder 7B BIRD): Generic baseline
- **Llama 3.1 8B** (Domain-specific): 3 epochs on CIM data
- **Qwen 2.5 14B** (Domain-specific): 3 epochs on CIM data
- **DeepSeek-Coder 6.7B** (Domain-specific): 3 epochs on CIM data

## Environmental Impact

- **Hardware**: NVIDIA Quadro RTX 6000 (24GB VRAM, 250W TDP)
- **Training Time**: 71.7 hours
- **Estimated Energy**: ~17.9 kWh (250W × 71.7h)
- **Carbon Footprint**: ~7.2 kg CO₂ (401 g CO₂/kWh, Italy grid 2024)

## Technical Specifications

### Model Architecture

- **Base**: StarCoder (7B parameters)
- **Attention**: Multi-head attention with 32 heads
- **Layers**: 32 transformer layers
- **Vocabulary**: 49,152 tokens
- **Context Window**: 8,192 tokens
- **Activation**: GELU

### LoRA Adapter

- **Adapter Size**: ~260 MB (safetensors)
- **Trainable Params**: 39,976,960 (0.59%)
- **Target Modules**: 7 modules (q/k/v/o/gate/up/down projections)
- **Rank**: 16
- **Alpha**: 32

## Citation

```bibtex
@misc{taherdoust2025sqlcoder_cim,
  title={SQLCoder 7B for CIM Spatial SQL: BIRD Baseline Comparison},
  author={Taherdoust, Ali},
  year={2025},
  institution={Politecnico di Torino},
  note={Fine-tuned on ai4cimdb dataset for thesis comparison: 
        Generic BIRD pre-training vs Domain-specific PostGIS training},
  url={https://huggingface.co/taherdoust/sqlcoder-7b-cim-q2sql-bird-comparison}
}

@software{defog2024sqlcoder,
  title={SQLCoder: A State-of-the-Art LLM for SQL Generation},
  author={Defog.ai},
  year={2024},
  url={https://github.com/defog-ai/sqlcoder}
}
```

## Acknowledgments

- **Base Model**: [defog/sqlcoder-7b-2](https://huggingface.co/defog/sqlcoder-7b-2) by Defog.ai
- **Dataset**: [taherdoust/ai4cimdb](https://huggingface.co/datasets/taherdoust/ai4cimdb) (176K samples)
- **Institution**: Politecnico di Torino, DENERG Department
- **Infrastructure**: ECLab ipazia126 GPU server
- **Frameworks**: HuggingFace Transformers, PEFT, BitsAndBytes

## License

Apache 2.0 (inherited from SQLCoder base model)

## Model Card Authors

Ali Taherdoust (Politecnico di Torino)

## Model Card Contact

ali.taherdoustmohammadi@polito.it