taherdoust commited on
Commit
7f255e4
·
verified ·
1 Parent(s): 2dcc8f6

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +190 -125
README.md CHANGED
@@ -1,202 +1,267 @@
1
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  base_model: defog/sqlcoder-7b-2
3
- library_name: peft
 
 
 
4
  ---
5
 
6
- # Model Card for Model ID
7
 
8
- <!-- Provide a quick summary of what the model is/does. -->
9
 
 
10
 
 
11
 
12
- ## Model Details
13
 
14
- ### Model Description
15
 
16
- <!-- Provide a longer summary of what this model is. -->
17
 
 
 
 
 
 
18
 
 
19
 
20
- - **Developed by:** [More Information Needed]
21
- - **Funded by [optional]:** [More Information Needed]
22
- - **Shared by [optional]:** [More Information Needed]
23
- - **Model type:** [More Information Needed]
24
- - **Language(s) (NLP):** [More Information Needed]
25
- - **License:** [More Information Needed]
26
- - **Finetuned from model [optional]:** [More Information Needed]
27
 
28
- ### Model Sources [optional]
 
 
 
29
 
30
- <!-- Provide the basic links for the model. -->
31
-
32
- - **Repository:** [More Information Needed]
33
- - **Paper [optional]:** [More Information Needed]
34
- - **Demo [optional]:** [More Information Needed]
35
-
36
- ## Uses
37
-
38
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
39
 
40
  ### Direct Use
41
 
42
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
43
-
44
- [More Information Needed]
45
-
46
- ### Downstream Use [optional]
47
-
48
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
49
-
50
- [More Information Needed]
51
 
52
- ### Out-of-Scope Use
 
 
53
 
54
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
 
 
 
 
 
 
55
 
56
- [More Information Needed]
57
 
58
- ## Bias, Risks, and Limitations
 
59
 
60
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
 
 
61
 
62
- [More Information Needed]
 
63
 
64
- ### Recommendations
 
65
 
66
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
 
 
 
 
67
 
68
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
69
 
70
- ## How to Get Started with the Model
71
 
72
- Use the code below to get started with the model.
73
-
74
- [More Information Needed]
 
 
75
 
76
  ## Training Details
77
 
78
  ### Training Data
79
 
80
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
81
-
82
- [More Information Needed]
 
 
83
 
84
  ### Training Procedure
85
 
86
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
87
-
88
- #### Preprocessing [optional]
89
-
90
- [More Information Needed]
91
-
92
-
93
- #### Training Hyperparameters
94
-
95
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
96
 
97
- #### Speeds, Sizes, Times [optional]
 
 
 
 
 
98
 
99
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
 
 
 
 
 
 
 
 
 
100
 
101
- [More Information Needed]
 
 
 
 
 
102
 
103
- ## Evaluation
104
 
105
- <!-- This section describes the evaluation protocols and provides the results. -->
106
 
107
- ### Testing Data, Factors & Metrics
 
 
 
 
108
 
109
- #### Testing Data
 
 
110
 
111
- <!-- This should link to a Dataset Card if possible. -->
 
 
 
112
 
113
- [More Information Needed]
114
 
115
- #### Factors
116
 
117
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
 
 
 
 
 
 
118
 
119
- [More Information Needed]
120
 
121
- #### Metrics
122
 
123
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
 
 
124
 
125
- [More Information Needed]
126
 
127
- ### Results
128
 
129
- [More Information Needed]
130
 
131
- #### Summary
 
 
 
132
 
 
133
 
134
-
135
- ## Model Examination [optional]
136
-
137
- <!-- Relevant interpretability work for the model goes here -->
138
-
139
- [More Information Needed]
140
 
141
  ## Environmental Impact
142
 
143
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
144
-
145
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
146
-
147
- - **Hardware Type:** [More Information Needed]
148
- - **Hours used:** [More Information Needed]
149
- - **Cloud Provider:** [More Information Needed]
150
- - **Compute Region:** [More Information Needed]
151
- - **Carbon Emitted:** [More Information Needed]
152
-
153
- ## Technical Specifications [optional]
154
-
155
- ### Model Architecture and Objective
156
 
157
- [More Information Needed]
158
 
159
- ### Compute Infrastructure
160
 
161
- [More Information Needed]
 
 
 
 
 
162
 
163
- #### Hardware
164
 
165
- [More Information Needed]
 
 
 
 
166
 
167
- #### Software
168
 
169
- [More Information Needed]
 
 
 
 
 
 
 
 
 
170
 
171
- ## Citation [optional]
 
 
 
 
 
 
172
 
173
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
174
 
175
- **BibTeX:**
 
 
 
 
176
 
177
- [More Information Needed]
178
 
179
- **APA:**
180
 
181
- [More Information Needed]
182
 
183
- ## Glossary [optional]
184
-
185
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
186
-
187
- [More Information Needed]
188
-
189
- ## More Information [optional]
190
-
191
- [More Information Needed]
192
-
193
- ## Model Card Authors [optional]
194
-
195
- [More Information Needed]
196
 
197
  ## Model Card Contact
198
 
199
- [More Information Needed]
200
- ### Framework versions
201
-
202
- - PEFT 0.12.0
 
1
  ---
2
+ language:
3
+ - en
4
+ license: apache-2.0
5
+ tags:
6
+ - text2sql
7
+ - spatial-sql
8
+ - postgis
9
+ - city-information-modeling
10
+ - cim
11
+ - fine-tuned
12
+ - bird-baseline
13
+ - lora
14
+ - qlora
15
  base_model: defog/sqlcoder-7b-2
16
+ datasets:
17
+ - taherdoust/ai4cimdb
18
+ library_name: transformers
19
+ pipeline_tag: text-generation
20
  ---
21
 
22
+ # SQLCoder 7B - CIM Spatial SQL (BIRD Pre-trained Baseline)
23
 
24
+ **Fine-tuned for thesis comparison: Generic BIRD pre-training vs Domain-Specific Training**
25
 
26
+ This model is a fine-tuned version of [defog/sqlcoder-7b-2](https://huggingface.co/defog/sqlcoder-7b-2) on the [taherdoust/ai4cimdb](https://huggingface.co/datasets/taherdoust/ai4cimdb) dataset for City Information Modeling (CIM) spatial SQL generation.
27
 
28
+ ## Model Description
29
 
30
+ **Purpose**: Academic baseline for thesis comparison - demonstrates the performance gap between generic text-to-SQL models (trained on Spider/BIRD) and domain-specific models when handling PostGIS spatial functions.
31
 
32
+ **Training Strategy**: Minimal adaptation (1 epoch only) to measure transfer learning effectiveness from BIRD to PostGIS spatial SQL domain.
33
 
34
+ ### Key Characteristics
35
 
36
+ - **Base Model**: SQLCoder 7B-2 (StarCoder-based, pre-trained on Spider + BIRD + commercial SQL)
37
+ - **Training**: 1 epoch fine-tuning on CIM spatial SQL dataset
38
+ - **Training Time**: 71.7 hours on NVIDIA Quadro RTX 6000 24GB
39
+ - **Method**: QLoRA (4-bit quantization + LoRA rank 16)
40
+ - **Trainable Parameters**: 39,976,960 (0.59% of 6.78B total)
41
 
42
+ ### Research Question
43
 
44
+ **How does a generic text-to-SQL model perform on specialized spatial SQL tasks?**
 
 
 
 
 
 
45
 
46
+ This model establishes a baseline for comparing:
47
+ - Generic BIRD pre-training (standard SQL) vs Domain-specific training (PostGIS spatial)
48
+ - Transfer learning effectiveness for specialized SQL dialects
49
+ - Performance gaps on PostGIS spatial functions (ST_Intersects, ST_Within, ST_Distance, etc.)
50
 
51
+ ## Intended Use
 
 
 
 
 
 
 
 
52
 
53
  ### Direct Use
54
 
55
+ Generate PostGIS spatial SQL queries for City Information Modeling databases from natural language questions.
 
 
 
 
 
 
 
 
56
 
57
+ ```python
58
+ from transformers import AutoTokenizer, AutoModelForCausalLM
59
+ import torch
60
 
61
+ model_name = "taherdoust/sqlcoder-7b-cim-q2sql-bird-comparison"
62
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
63
+ model = AutoModelForCausalLM.from_pretrained(
64
+ model_name,
65
+ torch_dtype=torch.float16,
66
+ device_map="auto"
67
+ )
68
 
69
+ question = "Find all buildings within 100 meters of census zone SEZ123"
70
 
71
+ prompt = f"""### Task
72
+ Generate a SQL query to answer the question.
73
 
74
+ ### Database Schema
75
+ - cim_vector.cim_wizard_building (building_id, building_geometry, project_id)
76
+ - cim_census.censusgeo (id, sez2011, census_geometry, population)
77
 
78
+ ### Question
79
+ {question}
80
 
81
+ ### SQL Query
82
+ """
83
 
84
+ inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
85
+ outputs = model.generate(**inputs, max_new_tokens=512)
86
+ sql = tokenizer.decode(outputs[0], skip_special_tokens=True)
87
+ print(sql)
88
+ ```
89
 
90
+ ### Thesis Comparison Use
91
 
92
+ Compare this model's performance against domain-specific models to quantify the PostGIS knowledge gap:
93
 
94
+ **Expected Performance:**
95
+ - **Standard SQL**: 85-90% (similar to domain models)
96
+ - **PostGIS Spatial Functions**: 30-50% (vs 85-92% domain models) ← **Gap**
97
+ - **CIM Domain Terms**: 40-60% (vs 85-90% domain models)
98
+ - **Overall EX Accuracy**: 60-75% baseline → 75-85% after 1 epoch fine-tuning
99
 
100
  ## Training Details
101
 
102
  ### Training Data
103
 
104
+ - **Dataset**: [taherdoust/ai4cimdb](https://huggingface.co/datasets/taherdoust/ai4cimdb)
105
+ - **Training Samples**: 88,480 (70% of 126,400 curated)
106
+ - **Validation Samples**: 18,960 (15%)
107
+ - **Test Samples**: 18,960 (15%)
108
+ - **Total Raw Samples**: 176,837 (3-stage generation: templates → CTGAN → GPT-4o-mini)
109
 
110
  ### Training Procedure
111
 
112
+ **QLoRA Configuration:**
113
+ ```python
114
+ BitsAndBytesConfig:
115
+ - load_in_4bit: True
116
+ - bnb_4bit_quant_type: "nf4"
117
+ - bnb_4bit_compute_dtype: bfloat16
 
 
 
 
118
 
119
+ LoraConfig:
120
+ - r: 16
121
+ - lora_alpha: 32
122
+ - lora_dropout: 0.1
123
+ - target_modules: ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
124
+ ```
125
 
126
+ **Training Hyperparameters:**
127
+ - **Epochs**: 1 (minimal adaptation for baseline comparison)
128
+ - **Batch Size**: 2 per device
129
+ - **Gradient Accumulation**: 8 steps (effective batch size: 16)
130
+ - **Learning Rate**: 2.0e-4 (higher for 1-epoch training)
131
+ - **LR Scheduler**: Cosine with 10% warmup
132
+ - **Optimizer**: Paged AdamW 8-bit
133
+ - **Precision**: bfloat16
134
+ - **Gradient Checkpointing**: Enabled
135
+ - **Max Sequence Length**: 2048 tokens
136
 
137
+ **Training Results:**
138
+ - **Training Time**: 71 hours 43 minutes (258,221 seconds)
139
+ - **Final Training Loss**: 0.0980
140
+ - **Training Speed**: 0.343 samples/sec
141
+ - **Total Steps**: 5,530 steps
142
+ - **Hardware**: NVIDIA Quadro RTX 6000 (24GB VRAM)
143
 
144
+ ### Database Schema Context
145
 
146
+ **CIM Database (PostgreSQL + PostGIS):**
147
 
148
+ **cim_vector Schema:**
149
+ - `cim_wizard_building`: Building geometries (POLYGON)
150
+ - `cim_wizard_building_properties`: Building attributes (height, area, energy)
151
+ - `cim_wizard_project_scenario`: Project and scenario metadata
152
+ - `network_buses`, `network_lines`: Electrical grid infrastructure (POINT, LINESTRING)
153
 
154
+ **cim_census Schema:**
155
+ - `censusgeo`: Italian ISTAT 2011 census zones (POLYGON)
156
+ - Demographic data: population, age distribution (E8-E16), housing (ST3-ST5)
157
 
158
+ **cim_raster Schema:**
159
+ - `dtm`: Digital Terrain Model (RASTER)
160
+ - `dsm`: Digital Surface Model (RASTER)
161
+ - Building height calculation via raster-vector operations
162
 
163
+ ## Evaluation Metrics
164
 
165
+ ### Expected Performance (Thesis Hypothesis)
166
 
167
+ | Metric | SQLCoder (BIRD) | Domain Models | Gap |
168
+ |--------|----------------|---------------|-----|
169
+ | **Standard SQL** | 85-90% | 85-92% | ±2-5% |
170
+ | **PostGIS Functions** | 30-50% | 85-92% | **35-50%** ← Research Gap |
171
+ | **CIM Domain Terms** | 40-60% | 85-90% | 25-40% |
172
+ | **Multi-Schema** | 60-75% | 82-90% | 15-25% |
173
+ | **Overall EX** | 60-75% | 82-92% | **15-25%** |
174
 
175
+ **Key Finding**: Generic BIRD models struggle with specialized SQL dialects (PostGIS spatial functions) despite strong standard SQL performance.
176
 
177
+ ### Evaluation Modes
178
 
179
+ **EM (Exact Match)**: String-level comparison (25-35% expected)
180
+ **EX (Execution Accuracy)**: Result-level comparison (60-75% expected)
181
+ **EA (Eventual Accuracy)**: Agent mode with self-correction (65-80% expected)
182
 
183
+ ## Academic Contribution
184
 
185
+ ### Thesis Context
186
 
187
+ This model serves as a controlled baseline for demonstrating:
188
 
189
+ 1. **PostGIS Knowledge Gap**: BIRD pre-training lacks spatial function exposure
190
+ 2. **Domain Terminology Gap**: CIM-specific terms (SEZ2011, TABULA, E8-E16) require domain training
191
+ 3. **Transfer Learning Limits**: 1 epoch fine-tuning improves but doesn't close the gap
192
+ 4. **Multi-Schema Complexity**: Cross-schema joins (cim_vector + cim_census + cim_raster) challenge generic models
193
 
194
+ ### Comparison Framework
195
 
196
+ **Models for Comparison:**
197
+ - **This Model** (SQLCoder 7B BIRD): Generic baseline
198
+ - **Llama 3.1 8B** (Domain-specific): 3 epochs on CIM data
199
+ - **Qwen 2.5 14B** (Domain-specific): 3 epochs on CIM data
200
+ - **DeepSeek-Coder 6.7B** (Domain-specific): 3 epochs on CIM data
 
201
 
202
  ## Environmental Impact
203
 
204
+ - **Hardware**: NVIDIA Quadro RTX 6000 (24GB VRAM, 250W TDP)
205
+ - **Training Time**: 71.7 hours
206
+ - **Estimated Energy**: ~17.9 kWh (250W × 71.7h)
207
+ - **Carbon Footprint**: ~7.2 kg CO₂ (401 g CO₂/kWh, Italy grid 2024)
 
 
 
 
 
 
 
 
 
208
 
209
+ ## Technical Specifications
210
 
211
+ ### Model Architecture
212
 
213
+ - **Base**: StarCoder (7B parameters)
214
+ - **Attention**: Multi-head attention with 32 heads
215
+ - **Layers**: 32 transformer layers
216
+ - **Vocabulary**: 49,152 tokens
217
+ - **Context Window**: 8,192 tokens
218
+ - **Activation**: GELU
219
 
220
+ ### LoRA Adapter
221
 
222
+ - **Adapter Size**: ~260 MB (safetensors)
223
+ - **Trainable Params**: 39,976,960 (0.59%)
224
+ - **Target Modules**: 7 modules (q/k/v/o/gate/up/down projections)
225
+ - **Rank**: 16
226
+ - **Alpha**: 32
227
 
228
+ ## Citation
229
 
230
+ ```bibtex
231
+ @misc{taherdoust2025sqlcoder_cim,
232
+ title={SQLCoder 7B for CIM Spatial SQL: BIRD Baseline Comparison},
233
+ author={Taherdoust, Ali},
234
+ year={2025},
235
+ institution={Politecnico di Torino},
236
+ note={Fine-tuned on ai4cimdb dataset for thesis comparison:
237
+ Generic BIRD pre-training vs Domain-specific PostGIS training},
238
+ url={https://huggingface.co/taherdoust/sqlcoder-7b-cim-q2sql-bird-comparison}
239
+ }
240
 
241
+ @software{defog2024sqlcoder,
242
+ title={SQLCoder: A State-of-the-Art LLM for SQL Generation},
243
+ author={Defog.ai},
244
+ year={2024},
245
+ url={https://github.com/defog-ai/sqlcoder}
246
+ }
247
+ ```
248
 
249
+ ## Acknowledgments
250
 
251
+ - **Base Model**: [defog/sqlcoder-7b-2](https://huggingface.co/defog/sqlcoder-7b-2) by Defog.ai
252
+ - **Dataset**: [taherdoust/ai4cimdb](https://huggingface.co/datasets/taherdoust/ai4cimdb) (176K samples)
253
+ - **Institution**: Politecnico di Torino, DENERG Department
254
+ - **Infrastructure**: ECLab ipazia126 GPU server
255
+ - **Frameworks**: HuggingFace Transformers, PEFT, BitsAndBytes
256
 
257
+ ## License
258
 
259
+ Apache 2.0 (inherited from SQLCoder base model)
260
 
261
+ ## Model Card Authors
262
 
263
+ Ali Taherdoust (Politecnico di Torino)
 
 
 
 
 
 
 
 
 
 
 
 
264
 
265
  ## Model Card Contact
266
 
267
+ ali.taherdoustmohammadi@polito.it