File size: 8,839 Bytes
2dcc8f6
7f255e4
 
 
 
 
 
 
 
 
 
 
 
 
2dcc8f6
7f255e4
 
 
 
2dcc8f6
 
7f255e4
2dcc8f6
7f255e4
2dcc8f6
7f255e4
2dcc8f6
7f255e4
2dcc8f6
7f255e4
2dcc8f6
7f255e4
2dcc8f6
7f255e4
2dcc8f6
7f255e4
 
 
 
 
2dcc8f6
7f255e4
2dcc8f6
7f255e4
2dcc8f6
7f255e4
 
 
 
2dcc8f6
7f255e4
2dcc8f6
 
 
7f255e4
2dcc8f6
7f255e4
 
 
2dcc8f6
7f255e4
 
 
 
 
 
 
2dcc8f6
7f255e4
2dcc8f6
7f255e4
 
2dcc8f6
7f255e4
 
 
2dcc8f6
7f255e4
 
2dcc8f6
7f255e4
 
2dcc8f6
7f255e4
 
 
 
 
2dcc8f6
7f255e4
2dcc8f6
7f255e4
2dcc8f6
7f255e4
 
 
 
 
2dcc8f6
 
 
 
 
7f255e4
 
 
 
 
2dcc8f6
 
 
7f255e4
 
 
 
 
 
2dcc8f6
7f255e4
 
 
 
 
 
2dcc8f6
7f255e4
 
 
 
 
 
 
 
 
 
2dcc8f6
7f255e4
 
 
 
 
 
2dcc8f6
7f255e4
2dcc8f6
7f255e4
2dcc8f6
7f255e4
 
 
 
 
2dcc8f6
7f255e4
 
 
2dcc8f6
7f255e4
 
 
 
2dcc8f6
7f255e4
2dcc8f6
7f255e4
2dcc8f6
7f255e4
 
 
 
 
 
 
2dcc8f6
7f255e4
2dcc8f6
7f255e4
2dcc8f6
7f255e4
 
 
2dcc8f6
7f255e4
2dcc8f6
7f255e4
2dcc8f6
7f255e4
2dcc8f6
7f255e4
 
 
 
2dcc8f6
7f255e4
2dcc8f6
7f255e4
 
 
 
 
2dcc8f6
 
 
7f255e4
 
 
 
2dcc8f6
7f255e4
2dcc8f6
7f255e4
2dcc8f6
7f255e4
 
 
 
 
 
2dcc8f6
7f255e4
2dcc8f6
7f255e4
 
 
 
 
2dcc8f6
7f255e4
2dcc8f6
7f255e4
 
 
 
 
 
 
 
 
 
2dcc8f6
7f255e4
 
 
 
 
 
 
2dcc8f6
7f255e4
2dcc8f6
7f255e4
 
 
 
 
2dcc8f6
7f255e4
2dcc8f6
7f255e4
2dcc8f6
7f255e4
2dcc8f6
7f255e4
2dcc8f6
 
 
7f255e4
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
---
language:
- en
license: apache-2.0
tags:
- text2sql
- spatial-sql
- postgis
- city-information-modeling
- cim
- fine-tuned
- bird-baseline
- lora
- qlora
base_model: defog/sqlcoder-7b-2
datasets:
- taherdoust/ai4cimdb
library_name: transformers
pipeline_tag: text-generation
---

# SQLCoder 7B - CIM Spatial SQL (BIRD Pre-trained Baseline)

**Fine-tuned for thesis comparison: Generic BIRD pre-training vs Domain-Specific Training**

This model is a fine-tuned version of [defog/sqlcoder-7b-2](https://huggingface.co/defog/sqlcoder-7b-2) on the [taherdoust/ai4cimdb](https://huggingface.co/datasets/taherdoust/ai4cimdb) dataset for City Information Modeling (CIM) spatial SQL generation.

## Model Description

**Purpose**: Academic baseline for thesis comparison - demonstrates the performance gap between generic text-to-SQL models (trained on Spider/BIRD) and domain-specific models when handling PostGIS spatial functions.

**Training Strategy**: Minimal adaptation (1 epoch only) to measure transfer learning effectiveness from BIRD to PostGIS spatial SQL domain.

### Key Characteristics

- **Base Model**: SQLCoder 7B-2 (StarCoder-based, pre-trained on Spider + BIRD + commercial SQL)
- **Training**: 1 epoch fine-tuning on CIM spatial SQL dataset
- **Training Time**: 71.7 hours on NVIDIA Quadro RTX 6000 24GB
- **Method**: QLoRA (4-bit quantization + LoRA rank 16)
- **Trainable Parameters**: 39,976,960 (0.59% of 6.78B total)

### Research Question

**How does a generic text-to-SQL model perform on specialized spatial SQL tasks?**

This model establishes a baseline for comparing:
- Generic BIRD pre-training (standard SQL) vs Domain-specific training (PostGIS spatial)
- Transfer learning effectiveness for specialized SQL dialects
- Performance gaps on PostGIS spatial functions (ST_Intersects, ST_Within, ST_Distance, etc.)

## Intended Use

### Direct Use

Generate PostGIS spatial SQL queries for City Information Modeling databases from natural language questions.

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "taherdoust/sqlcoder-7b-cim-q2sql-bird-comparison"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

question = "Find all buildings within 100 meters of census zone SEZ123"

prompt = f"""### Task
Generate a SQL query to answer the question.

### Database Schema
- cim_vector.cim_wizard_building (building_id, building_geometry, project_id)
- cim_census.censusgeo (id, sez2011, census_geometry, population)

### Question
{question}

### SQL Query
"""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
sql = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(sql)
```

### Thesis Comparison Use

Compare this model's performance against domain-specific models to quantify the PostGIS knowledge gap:

**Expected Performance:**
- **Standard SQL**: 85-90% (similar to domain models)
- **PostGIS Spatial Functions**: 30-50% (vs 85-92% domain models) ← **Gap**
- **CIM Domain Terms**: 40-60% (vs 85-90% domain models)
- **Overall EX Accuracy**: 60-75% baseline → 75-85% after 1 epoch fine-tuning

## Training Details

### Training Data

- **Dataset**: [taherdoust/ai4cimdb](https://huggingface.co/datasets/taherdoust/ai4cimdb)
- **Training Samples**: 88,480 (70% of 126,400 curated)
- **Validation Samples**: 18,960 (15%)
- **Test Samples**: 18,960 (15%)
- **Total Raw Samples**: 176,837 (3-stage generation: templates → CTGAN → GPT-4o-mini)

### Training Procedure

**QLoRA Configuration:**
```python
BitsAndBytesConfig:
  - load_in_4bit: True
  - bnb_4bit_quant_type: "nf4"
  - bnb_4bit_compute_dtype: bfloat16

LoraConfig:
  - r: 16
  - lora_alpha: 32
  - lora_dropout: 0.1
  - target_modules: ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
```

**Training Hyperparameters:**
- **Epochs**: 1 (minimal adaptation for baseline comparison)
- **Batch Size**: 2 per device
- **Gradient Accumulation**: 8 steps (effective batch size: 16)
- **Learning Rate**: 2.0e-4 (higher for 1-epoch training)
- **LR Scheduler**: Cosine with 10% warmup
- **Optimizer**: Paged AdamW 8-bit
- **Precision**: bfloat16
- **Gradient Checkpointing**: Enabled
- **Max Sequence Length**: 2048 tokens

**Training Results:**
- **Training Time**: 71 hours 43 minutes (258,221 seconds)
- **Final Training Loss**: 0.0980
- **Training Speed**: 0.343 samples/sec
- **Total Steps**: 5,530 steps
- **Hardware**: NVIDIA Quadro RTX 6000 (24GB VRAM)

### Database Schema Context

**CIM Database (PostgreSQL + PostGIS):**

**cim_vector Schema:**
- `cim_wizard_building`: Building geometries (POLYGON)
- `cim_wizard_building_properties`: Building attributes (height, area, energy)
- `cim_wizard_project_scenario`: Project and scenario metadata
- `network_buses`, `network_lines`: Electrical grid infrastructure (POINT, LINESTRING)

**cim_census Schema:**
- `censusgeo`: Italian ISTAT 2011 census zones (POLYGON)
- Demographic data: population, age distribution (E8-E16), housing (ST3-ST5)

**cim_raster Schema:**
- `dtm`: Digital Terrain Model (RASTER)
- `dsm`: Digital Surface Model (RASTER)
- Building height calculation via raster-vector operations

## Evaluation Metrics

### Expected Performance (Thesis Hypothesis)

| Metric | SQLCoder (BIRD) | Domain Models | Gap |
|--------|----------------|---------------|-----|
| **Standard SQL** | 85-90% | 85-92% | ±2-5% |
| **PostGIS Functions** | 30-50% | 85-92% | **35-50%** ← Research Gap |
| **CIM Domain Terms** | 40-60% | 85-90% | 25-40% |
| **Multi-Schema** | 60-75% | 82-90% | 15-25% |
| **Overall EX** | 60-75% | 82-92% | **15-25%** |

**Key Finding**: Generic BIRD models struggle with specialized SQL dialects (PostGIS spatial functions) despite strong standard SQL performance.

### Evaluation Modes

**EM (Exact Match)**: String-level comparison (25-35% expected)
**EX (Execution Accuracy)**: Result-level comparison (60-75% expected)
**EA (Eventual Accuracy)**: Agent mode with self-correction (65-80% expected)

## Academic Contribution

### Thesis Context

This model serves as a controlled baseline for demonstrating:

1. **PostGIS Knowledge Gap**: BIRD pre-training lacks spatial function exposure
2. **Domain Terminology Gap**: CIM-specific terms (SEZ2011, TABULA, E8-E16) require domain training
3. **Transfer Learning Limits**: 1 epoch fine-tuning improves but doesn't close the gap
4. **Multi-Schema Complexity**: Cross-schema joins (cim_vector + cim_census + cim_raster) challenge generic models

### Comparison Framework

**Models for Comparison:**
- **This Model** (SQLCoder 7B BIRD): Generic baseline
- **Llama 3.1 8B** (Domain-specific): 3 epochs on CIM data
- **Qwen 2.5 14B** (Domain-specific): 3 epochs on CIM data
- **DeepSeek-Coder 6.7B** (Domain-specific): 3 epochs on CIM data

## Environmental Impact

- **Hardware**: NVIDIA Quadro RTX 6000 (24GB VRAM, 250W TDP)
- **Training Time**: 71.7 hours
- **Estimated Energy**: ~17.9 kWh (250W × 71.7h)
- **Carbon Footprint**: ~7.2 kg CO₂ (401 g CO₂/kWh, Italy grid 2024)

## Technical Specifications

### Model Architecture

- **Base**: StarCoder (7B parameters)
- **Attention**: Multi-head attention with 32 heads
- **Layers**: 32 transformer layers
- **Vocabulary**: 49,152 tokens
- **Context Window**: 8,192 tokens
- **Activation**: GELU

### LoRA Adapter

- **Adapter Size**: ~260 MB (safetensors)
- **Trainable Params**: 39,976,960 (0.59%)
- **Target Modules**: 7 modules (q/k/v/o/gate/up/down projections)
- **Rank**: 16
- **Alpha**: 32

## Citation

```bibtex
@misc{taherdoust2025sqlcoder_cim,
  title={SQLCoder 7B for CIM Spatial SQL: BIRD Baseline Comparison},
  author={Taherdoust, Ali},
  year={2025},
  institution={Politecnico di Torino},
  note={Fine-tuned on ai4cimdb dataset for thesis comparison: 
        Generic BIRD pre-training vs Domain-specific PostGIS training},
  url={https://huggingface.co/taherdoust/sqlcoder-7b-cim-q2sql-bird-comparison}
}

@software{defog2024sqlcoder,
  title={SQLCoder: A State-of-the-Art LLM for SQL Generation},
  author={Defog.ai},
  year={2024},
  url={https://github.com/defog-ai/sqlcoder}
}
```

## Acknowledgments

- **Base Model**: [defog/sqlcoder-7b-2](https://huggingface.co/defog/sqlcoder-7b-2) by Defog.ai
- **Dataset**: [taherdoust/ai4cimdb](https://huggingface.co/datasets/taherdoust/ai4cimdb) (176K samples)
- **Institution**: Politecnico di Torino, DENERG Department
- **Infrastructure**: ECLab ipazia126 GPU server
- **Frameworks**: HuggingFace Transformers, PEFT, BitsAndBytes

## License

Apache 2.0 (inherited from SQLCoder base model)

## Model Card Authors

Ali Taherdoust (Politecnico di Torino)

## Model Card Contact

ali.taherdoustmohammadi@polito.it