Upload README.md

7f255e4 verified about 1 month ago

8.84 kB

	---
	language:
	- en
	license: apache-2.0
	tags:
	- text2sql
	- spatial-sql
	- postgis
	- city-information-modeling
	- cim
	- fine-tuned
	- bird-baseline
	- lora
	- qlora
	base_model: defog/sqlcoder-7b-2
	datasets:
	- taherdoust/ai4cimdb
	library_name: transformers
	pipeline_tag: text-generation
	---

	# SQLCoder 7B - CIM Spatial SQL (BIRD Pre-trained Baseline)

	Fine-tuned for thesis comparison: Generic BIRD pre-training vs Domain-Specific Training

	This model is a fine-tuned version of [defog/sqlcoder-7b-2](https://huggingface.co/defog/sqlcoder-7b-2) on the [taherdoust/ai4cimdb](https://huggingface.co/datasets/taherdoust/ai4cimdb) dataset for City Information Modeling (CIM) spatial SQL generation.

	## Model Description

	Purpose: Academic baseline for thesis comparison - demonstrates the performance gap between generic text-to-SQL models (trained on Spider/BIRD) and domain-specific models when handling PostGIS spatial functions.

	Training Strategy: Minimal adaptation (1 epoch only) to measure transfer learning effectiveness from BIRD to PostGIS spatial SQL domain.

	### Key Characteristics

	- Base Model: SQLCoder 7B-2 (StarCoder-based, pre-trained on Spider + BIRD + commercial SQL)
	- Training: 1 epoch fine-tuning on CIM spatial SQL dataset
	- Training Time: 71.7 hours on NVIDIA Quadro RTX 6000 24GB
	- Method: QLoRA (4-bit quantization + LoRA rank 16)
	- Trainable Parameters: 39,976,960 (0.59% of 6.78B total)

	### Research Question

	How does a generic text-to-SQL model perform on specialized spatial SQL tasks?

	This model establishes a baseline for comparing:
	- Generic BIRD pre-training (standard SQL) vs Domain-specific training (PostGIS spatial)
	- Transfer learning effectiveness for specialized SQL dialects
	- Performance gaps on PostGIS spatial functions (ST_Intersects, ST_Within, ST_Distance, etc.)

	## Intended Use

	### Direct Use

	Generate PostGIS spatial SQL queries for City Information Modeling databases from natural language questions.

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM
	import torch

	model_name = "taherdoust/sqlcoder-7b-cim-q2sql-bird-comparison"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForCausalLM.from_pretrained(
	model_name,
	torch_dtype=torch.float16,
	device_map="auto"
	)

	question = "Find all buildings within 100 meters of census zone SEZ123"

	prompt = f"""### Task
	Generate a SQL query to answer the question.

	### Database Schema
	- cim_vector.cim_wizard_building (building_id, building_geometry, project_id)
	- cim_census.censusgeo (id, sez2011, census_geometry, population)

	### Question
	{question}

	### SQL Query
	"""

	inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
	outputs = model.generate(**inputs, max_new_tokens=512)
	sql = tokenizer.decode(outputs[0], skip_special_tokens=True)
	print(sql)
	```

	### Thesis Comparison Use

	Compare this model's performance against domain-specific models to quantify the PostGIS knowledge gap:

	Expected Performance:
	- Standard SQL: 85-90% (similar to domain models)
	- PostGIS Spatial Functions: 30-50% (vs 85-92% domain models) ← Gap
	- CIM Domain Terms: 40-60% (vs 85-90% domain models)
	- Overall EX Accuracy: 60-75% baseline → 75-85% after 1 epoch fine-tuning

	## Training Details

	### Training Data

	- Dataset: [taherdoust/ai4cimdb](https://huggingface.co/datasets/taherdoust/ai4cimdb)
	- Training Samples: 88,480 (70% of 126,400 curated)
	- Validation Samples: 18,960 (15%)
	- Test Samples: 18,960 (15%)
	- Total Raw Samples: 176,837 (3-stage generation: templates → CTGAN → GPT-4o-mini)

	### Training Procedure

	QLoRA Configuration:
	```python
	BitsAndBytesConfig:
	- load_in_4bit: True
	- bnb_4bit_quant_type: "nf4"
	- bnb_4bit_compute_dtype: bfloat16

	LoraConfig:
	- r: 16
	- lora_alpha: 32
	- lora_dropout: 0.1
	- target_modules: ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
	```

	Training Hyperparameters:
	- Epochs: 1 (minimal adaptation for baseline comparison)
	- Batch Size: 2 per device
	- Gradient Accumulation: 8 steps (effective batch size: 16)
	- Learning Rate: 2.0e-4 (higher for 1-epoch training)
	- LR Scheduler: Cosine with 10% warmup
	- Optimizer: Paged AdamW 8-bit
	- Precision: bfloat16
	- Gradient Checkpointing: Enabled
	- Max Sequence Length: 2048 tokens

	Training Results:
	- Training Time: 71 hours 43 minutes (258,221 seconds)
	- Final Training Loss: 0.0980
	- Training Speed: 0.343 samples/sec
	- Total Steps: 5,530 steps
	- Hardware: NVIDIA Quadro RTX 6000 (24GB VRAM)

	### Database Schema Context

	CIM Database (PostgreSQL + PostGIS):

	cim_vector Schema:
	- `cim_wizard_building`: Building geometries (POLYGON)
	- `cim_wizard_building_properties`: Building attributes (height, area, energy)
	- `cim_wizard_project_scenario`: Project and scenario metadata
	- `network_buses`, `network_lines`: Electrical grid infrastructure (POINT, LINESTRING)

	cim_census Schema:
	- `censusgeo`: Italian ISTAT 2011 census zones (POLYGON)
	- Demographic data: population, age distribution (E8-E16), housing (ST3-ST5)

	cim_raster Schema:
	- `dtm`: Digital Terrain Model (RASTER)
	- `dsm`: Digital Surface Model (RASTER)
	- Building height calculation via raster-vector operations

	## Evaluation Metrics

	### Expected Performance (Thesis Hypothesis)

	\| Metric \| SQLCoder (BIRD) \| Domain Models \| Gap \|
	\|--------\|----------------\|---------------\|-----\|
	\| Standard SQL \| 85-90% \| 85-92% \| ±2-5% \|
	\| PostGIS Functions \| 30-50% \| 85-92% \| 35-50% ← Research Gap \|
	\| CIM Domain Terms \| 40-60% \| 85-90% \| 25-40% \|
	\| Multi-Schema \| 60-75% \| 82-90% \| 15-25% \|
	\| Overall EX \| 60-75% \| 82-92% \| 15-25% \|

	Key Finding: Generic BIRD models struggle with specialized SQL dialects (PostGIS spatial functions) despite strong standard SQL performance.

	### Evaluation Modes

	EM (Exact Match): String-level comparison (25-35% expected)
	EX (Execution Accuracy): Result-level comparison (60-75% expected)
	EA (Eventual Accuracy): Agent mode with self-correction (65-80% expected)

	## Academic Contribution

	### Thesis Context

	This model serves as a controlled baseline for demonstrating:

	1. PostGIS Knowledge Gap: BIRD pre-training lacks spatial function exposure
	2. Domain Terminology Gap: CIM-specific terms (SEZ2011, TABULA, E8-E16) require domain training
	3. Transfer Learning Limits: 1 epoch fine-tuning improves but doesn't close the gap
	4. Multi-Schema Complexity: Cross-schema joins (cim_vector + cim_census + cim_raster) challenge generic models

	### Comparison Framework

	Models for Comparison:
	- This Model (SQLCoder 7B BIRD): Generic baseline
	- Llama 3.1 8B (Domain-specific): 3 epochs on CIM data
	- Qwen 2.5 14B (Domain-specific): 3 epochs on CIM data
	- DeepSeek-Coder 6.7B (Domain-specific): 3 epochs on CIM data

	## Environmental Impact

	- Hardware: NVIDIA Quadro RTX 6000 (24GB VRAM, 250W TDP)
	- Training Time: 71.7 hours
	- Estimated Energy: ~17.9 kWh (250W × 71.7h)
	- Carbon Footprint: ~7.2 kg CO₂ (401 g CO₂/kWh, Italy grid 2024)

	## Technical Specifications

	### Model Architecture

	- Base: StarCoder (7B parameters)
	- Attention: Multi-head attention with 32 heads
	- Layers: 32 transformer layers
	- Vocabulary: 49,152 tokens
	- Context Window: 8,192 tokens
	- Activation: GELU

	### LoRA Adapter

	- Adapter Size: ~260 MB (safetensors)
	- Trainable Params: 39,976,960 (0.59%)
	- Target Modules: 7 modules (q/k/v/o/gate/up/down projections)
	- Rank: 16
	- Alpha: 32

	## Citation

	```bibtex
	@misc{taherdoust2025sqlcoder_cim,
	title={SQLCoder 7B for CIM Spatial SQL: BIRD Baseline Comparison},
	author={Taherdoust, Ali},
	year={2025},
	institution={Politecnico di Torino},
	note={Fine-tuned on ai4cimdb dataset for thesis comparison:
	Generic BIRD pre-training vs Domain-specific PostGIS training},
	url={https://huggingface.co/taherdoust/sqlcoder-7b-cim-q2sql-bird-comparison}
	}

	@software{defog2024sqlcoder,
	title={SQLCoder: A State-of-the-Art LLM for SQL Generation},
	author={Defog.ai},
	year={2024},
	url={https://github.com/defog-ai/sqlcoder}
	}
	```

	## Acknowledgments

	- Base Model: [defog/sqlcoder-7b-2](https://huggingface.co/defog/sqlcoder-7b-2) by Defog.ai
	- Dataset: [taherdoust/ai4cimdb](https://huggingface.co/datasets/taherdoust/ai4cimdb) (176K samples)
	- Institution: Politecnico di Torino, DENERG Department
	- Infrastructure: ECLab ipazia126 GPU server
	- Frameworks: HuggingFace Transformers, PEFT, BitsAndBytes

	## License

	Apache 2.0 (inherited from SQLCoder base model)

	## Model Card Authors

	Ali Taherdoust (Politecnico di Torino)

	## Model Card Contact

	ali.taherdoustmohammadi@polito.it