File size: 4,995 Bytes
43832df
 
 
 
 
d9ca7ae
43832df
 
 
 
 
 
 
 
 
 
 
 
 
d9ca7ae
43832df
 
099bb16
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9525689
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
---
license: mit
language:
- en
tags:
- text-generation
- sql
- text-to-sql
- gemma
- fine-tuned
- database
- nlp
base_model: google/gemma-7b
datasets:
- estu-research/sql-training-dataset
metrics:
- accuracy
- exact_match
library_name: transformers
pipeline_tag: text-generation
---

# Gemma-7B SQL Expert (Fine-Tuned)

Fine-tuned version of Google's Gemma-7B model for converting natural language questions to SQL queries.

## Model Details

- **Base Model**: [google/gemma-7b](https://huggingface.co/google/gemma-7b)
- **Fine-tuned by**: ESTU Research Team (Kulalı, Aydın, Alhan, Fidan)
- **Institution**: Eskisehir Technical University
- **Project**: TÜBİTAK 2209-A Research
- **License**: MIT
- **Language**: English
- **Task**: Natural Language to SQL Translation

## Performance

- **Execution Accuracy**: 76.0%
- **Exact Match**: 65.4%
- **Average Latency**: 500ms
- **Model Size**: 14.1 GB (full) / 183 MB (LoRA adapters)

## Training Details

### Training Data
- **Dataset**: [estu-research/sql-training-dataset](https://huggingface.co/datasets/estu-research/sql-training-dataset)
- **Examples**: 1,000+ natural language to SQL pairs
- **Domain**: Sales database queries (customers, orders, products, employees)

### Training Configuration
```python
{
  "base_model": "google/gemma-7b",
  "method": "LoRA",
  "rank": 16,
  "alpha": 32,
  "dropout": 0.05,
  "target_modules": ["q_proj", "k_proj", "v_proj", "o_proj"],
  "epochs": 3,
  "batch_size": 8,
  "learning_rate": 1.5e-4,
  "training_time": "10.8 hours (A100 GPU)"
}
```

### Training Results
```
Epoch 1: Loss 1.456 | Val Loss 1.512 | Accuracy 68.2%
Epoch 2: Loss 0.521 | Val Loss 0.589 | Accuracy 72.8%
Epoch 3: Loss 0.234 | Val Loss 0.267 | Accuracy 76.0%
```

## Usage

### Installation
```bash
pip install transformers torch
```

### Quick Start
```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("estu-research/gemma-7b-sql-ft")
tokenizer = AutoTokenizer.from_pretrained("estu-research/gemma-7b-sql-ft")

# Example query
question = """
Schema: CREATE TABLE customers (customerNumber INT, customerName VARCHAR(50), country VARCHAR(50));
Question: List all customers from France
"""

inputs = tokenizer(question, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=256)
sql = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(sql)
# Output: SELECT * FROM customers WHERE country = 'France';
```

### Advanced Usage with Pipeline
```python
from transformers import pipeline

pipe = pipeline("text-generation", model="estu-research/gemma-7b-sql-ft")

result = pipe(
    "Schema: CREATE TABLE products (productName VARCHAR, price DECIMAL);\nQuestion: Show top 10 expensive products",
    max_new_tokens=200,
    temperature=0.1
)
print(result[0]['generated_text'])
```

## Example Queries

| Natural Language | Generated SQL |
|------------------|---------------|
| List top 5 customers by sales | `SELECT customerName, SUM(amount) as total FROM customers JOIN orders USING(customerId) GROUP BY customerId ORDER BY total DESC LIMIT 5;` |
| Show products never ordered | `SELECT p.productName FROM products p LEFT JOIN orderDetails od ON p.productCode = od.productCode WHERE od.productCode IS NULL;` |
| Total revenue by country | `SELECT country, SUM(amount) as revenue FROM customers JOIN orders USING(customerId) GROUP BY country ORDER BY revenue DESC;` |

## Comparison with Other Models

| Model | Accuracy | Latency | Cost |
|-------|----------|---------|------|
| **Gemma-7B (FT)** | **76.0%** | 500ms | Free |
| Llama-3-8B (FT) | 78.2% | 450ms | Free |
| GPT-4o-mini (FT) | 97.8% | 800ms | $0.30/1K |
| GPT-3.5 Turbo | 78.9% | 500ms | $0.05/1K |

## Limitations

- Trained primarily on sales database schema
- May struggle with very complex nested queries
- Best performance on English language queries
- Requires GPU for optimal inference speed

## Intended Use

- **Primary**: Natural language to SQL translation for analytics
- **Secondary**: SQL query assistance and education
- **Not For**: Production databases without query validation

## Citation

```bibtex
@misc{gemma7b-sql-ft,
  title={Gemma-7B SQL Expert: Fine-Tuned Model for Text-to-SQL},
  author={Kulalı and Aydın and Alhan and Fidan},
  institution={Eskisehir Technical University},
  year={2024},
  url={https://huggingface.co/estu-research/gemma-7b-sql-ft}
}
```

## Links

- **GitHub**: [Japyh/llm-based-dbms](https://github.com/Japyh/llm-based-dbms)
- **Research Paper**: [docs/research_paper_draft.md](https://github.com/Japyh/llm-based-dbms/blob/main/docs/research_paper_draft.md)
- **Dataset**: [estu-research/sql-training-dataset](https://huggingface.co/datasets/estu-research/sql-training-dataset)
- **Organization**: [estu-research](https://huggingface.co/estu-research)

## Acknowledgments

This work was supported by TÜBİTAK 2209-A Research Grant at Eskisehir Technical University.

## License

MIT License - See LICENSE file for details