--- language: - en license: apache-2.0 tags: - text2sql - spatial-sql - postgis - city-information-modeling - cim - fine-tuned - bird-baseline - lora - qlora base_model: defog/sqlcoder-7b-2 datasets: - taherdoust/ai4cimdb library_name: transformers pipeline_tag: text-generation --- # SQLCoder 7B - CIM Spatial SQL (BIRD Pre-trained Baseline) **Fine-tuned for thesis comparison: Generic BIRD pre-training vs Domain-Specific Training** This model is a fine-tuned version of [defog/sqlcoder-7b-2](https://huggingface.co/defog/sqlcoder-7b-2) on the [taherdoust/ai4cimdb](https://huggingface.co/datasets/taherdoust/ai4cimdb) dataset for City Information Modeling (CIM) spatial SQL generation. ## Model Description **Purpose**: Academic baseline for thesis comparison - demonstrates the performance gap between generic text-to-SQL models (trained on Spider/BIRD) and domain-specific models when handling PostGIS spatial functions. **Training Strategy**: Minimal adaptation (1 epoch only) to measure transfer learning effectiveness from BIRD to PostGIS spatial SQL domain. ### Key Characteristics - **Base Model**: SQLCoder 7B-2 (StarCoder-based, pre-trained on Spider + BIRD + commercial SQL) - **Training**: 1 epoch fine-tuning on CIM spatial SQL dataset - **Training Time**: 71.7 hours on NVIDIA Quadro RTX 6000 24GB - **Method**: QLoRA (4-bit quantization + LoRA rank 16) - **Trainable Parameters**: 39,976,960 (0.59% of 6.78B total) ### Research Question **How does a generic text-to-SQL model perform on specialized spatial SQL tasks?** This model establishes a baseline for comparing: - Generic BIRD pre-training (standard SQL) vs Domain-specific training (PostGIS spatial) - Transfer learning effectiveness for specialized SQL dialects - Performance gaps on PostGIS spatial functions (ST_Intersects, ST_Within, ST_Distance, etc.) ## Intended Use ### Direct Use Generate PostGIS spatial SQL queries for City Information Modeling databases from natural language questions. ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch model_name = "taherdoust/sqlcoder-7b-cim-q2sql-bird-comparison" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.float16, device_map="auto" ) question = "Find all buildings within 100 meters of census zone SEZ123" prompt = f"""### Task Generate a SQL query to answer the question. ### Database Schema - cim_vector.cim_wizard_building (building_id, building_geometry, project_id) - cim_census.censusgeo (id, sez2011, census_geometry, population) ### Question {question} ### SQL Query """ inputs = tokenizer(prompt, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=512) sql = tokenizer.decode(outputs[0], skip_special_tokens=True) print(sql) ``` ### Thesis Comparison Use Compare this model's performance against domain-specific models to quantify the PostGIS knowledge gap: **Expected Performance:** - **Standard SQL**: 85-90% (similar to domain models) - **PostGIS Spatial Functions**: 30-50% (vs 85-92% domain models) ← **Gap** - **CIM Domain Terms**: 40-60% (vs 85-90% domain models) - **Overall EX Accuracy**: 60-75% baseline → 75-85% after 1 epoch fine-tuning ## Training Details ### Training Data - **Dataset**: [taherdoust/ai4cimdb](https://huggingface.co/datasets/taherdoust/ai4cimdb) - **Training Samples**: 88,480 (70% of 126,400 curated) - **Validation Samples**: 18,960 (15%) - **Test Samples**: 18,960 (15%) - **Total Raw Samples**: 176,837 (3-stage generation: templates → CTGAN → GPT-4o-mini) ### Training Procedure **QLoRA Configuration:** ```python BitsAndBytesConfig: - load_in_4bit: True - bnb_4bit_quant_type: "nf4" - bnb_4bit_compute_dtype: bfloat16 LoraConfig: - r: 16 - lora_alpha: 32 - lora_dropout: 0.1 - target_modules: ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"] ``` **Training Hyperparameters:** - **Epochs**: 1 (minimal adaptation for baseline comparison) - **Batch Size**: 2 per device - **Gradient Accumulation**: 8 steps (effective batch size: 16) - **Learning Rate**: 2.0e-4 (higher for 1-epoch training) - **LR Scheduler**: Cosine with 10% warmup - **Optimizer**: Paged AdamW 8-bit - **Precision**: bfloat16 - **Gradient Checkpointing**: Enabled - **Max Sequence Length**: 2048 tokens **Training Results:** - **Training Time**: 71 hours 43 minutes (258,221 seconds) - **Final Training Loss**: 0.0980 - **Training Speed**: 0.343 samples/sec - **Total Steps**: 5,530 steps - **Hardware**: NVIDIA Quadro RTX 6000 (24GB VRAM) ### Database Schema Context **CIM Database (PostgreSQL + PostGIS):** **cim_vector Schema:** - `cim_wizard_building`: Building geometries (POLYGON) - `cim_wizard_building_properties`: Building attributes (height, area, energy) - `cim_wizard_project_scenario`: Project and scenario metadata - `network_buses`, `network_lines`: Electrical grid infrastructure (POINT, LINESTRING) **cim_census Schema:** - `censusgeo`: Italian ISTAT 2011 census zones (POLYGON) - Demographic data: population, age distribution (E8-E16), housing (ST3-ST5) **cim_raster Schema:** - `dtm`: Digital Terrain Model (RASTER) - `dsm`: Digital Surface Model (RASTER) - Building height calculation via raster-vector operations ## Evaluation Metrics ### Expected Performance (Thesis Hypothesis) | Metric | SQLCoder (BIRD) | Domain Models | Gap | |--------|----------------|---------------|-----| | **Standard SQL** | 85-90% | 85-92% | ±2-5% | | **PostGIS Functions** | 30-50% | 85-92% | **35-50%** ← Research Gap | | **CIM Domain Terms** | 40-60% | 85-90% | 25-40% | | **Multi-Schema** | 60-75% | 82-90% | 15-25% | | **Overall EX** | 60-75% | 82-92% | **15-25%** | **Key Finding**: Generic BIRD models struggle with specialized SQL dialects (PostGIS spatial functions) despite strong standard SQL performance. ### Evaluation Modes **EM (Exact Match)**: String-level comparison (25-35% expected) **EX (Execution Accuracy)**: Result-level comparison (60-75% expected) **EA (Eventual Accuracy)**: Agent mode with self-correction (65-80% expected) ## Academic Contribution ### Thesis Context This model serves as a controlled baseline for demonstrating: 1. **PostGIS Knowledge Gap**: BIRD pre-training lacks spatial function exposure 2. **Domain Terminology Gap**: CIM-specific terms (SEZ2011, TABULA, E8-E16) require domain training 3. **Transfer Learning Limits**: 1 epoch fine-tuning improves but doesn't close the gap 4. **Multi-Schema Complexity**: Cross-schema joins (cim_vector + cim_census + cim_raster) challenge generic models ### Comparison Framework **Models for Comparison:** - **This Model** (SQLCoder 7B BIRD): Generic baseline - **Llama 3.1 8B** (Domain-specific): 3 epochs on CIM data - **Qwen 2.5 14B** (Domain-specific): 3 epochs on CIM data - **DeepSeek-Coder 6.7B** (Domain-specific): 3 epochs on CIM data ## Environmental Impact - **Hardware**: NVIDIA Quadro RTX 6000 (24GB VRAM, 250W TDP) - **Training Time**: 71.7 hours - **Estimated Energy**: ~17.9 kWh (250W × 71.7h) - **Carbon Footprint**: ~7.2 kg CO₂ (401 g CO₂/kWh, Italy grid 2024) ## Technical Specifications ### Model Architecture - **Base**: StarCoder (7B parameters) - **Attention**: Multi-head attention with 32 heads - **Layers**: 32 transformer layers - **Vocabulary**: 49,152 tokens - **Context Window**: 8,192 tokens - **Activation**: GELU ### LoRA Adapter - **Adapter Size**: ~260 MB (safetensors) - **Trainable Params**: 39,976,960 (0.59%) - **Target Modules**: 7 modules (q/k/v/o/gate/up/down projections) - **Rank**: 16 - **Alpha**: 32 ## Citation ```bibtex @misc{taherdoust2025sqlcoder_cim, title={SQLCoder 7B for CIM Spatial SQL: BIRD Baseline Comparison}, author={Taherdoust, Ali}, year={2025}, institution={Politecnico di Torino}, note={Fine-tuned on ai4cimdb dataset for thesis comparison: Generic BIRD pre-training vs Domain-specific PostGIS training}, url={https://huggingface.co/taherdoust/sqlcoder-7b-cim-q2sql-bird-comparison} } @software{defog2024sqlcoder, title={SQLCoder: A State-of-the-Art LLM for SQL Generation}, author={Defog.ai}, year={2024}, url={https://github.com/defog-ai/sqlcoder} } ``` ## Acknowledgments - **Base Model**: [defog/sqlcoder-7b-2](https://huggingface.co/defog/sqlcoder-7b-2) by Defog.ai - **Dataset**: [taherdoust/ai4cimdb](https://huggingface.co/datasets/taherdoust/ai4cimdb) (176K samples) - **Institution**: Politecnico di Torino, DENERG Department - **Infrastructure**: ECLab ipazia126 GPU server - **Frameworks**: HuggingFace Transformers, PEFT, BitsAndBytes ## License Apache 2.0 (inherited from SQLCoder base model) ## Model Card Authors Ali Taherdoust (Politecnico di Torino) ## Model Card Contact ali.taherdoustmohammadi@polito.it