gemma-7b-sql-ft / README.md
olaflaitinen's picture
Update README.md
d9ca7ae verified
---
license: mit
language:
- en
tags:
- text-generation
- sql
- text-to-sql
- gemma
- fine-tuned
- database
- nlp
base_model: google/gemma-7b
datasets:
- estu-research/sql-training-dataset
metrics:
- accuracy
- exact_match
library_name: transformers
pipeline_tag: text-generation
---
# Gemma-7B SQL Expert (Fine-Tuned)
Fine-tuned version of Google's Gemma-7B model for converting natural language questions to SQL queries.
## Model Details
- **Base Model**: [google/gemma-7b](https://huggingface.co/google/gemma-7b)
- **Fine-tuned by**: ESTU Research Team (Kulalı, Aydın, Alhan, Fidan)
- **Institution**: Eskisehir Technical University
- **Project**: TÜBİTAK 2209-A Research
- **License**: MIT
- **Language**: English
- **Task**: Natural Language to SQL Translation
## Performance
- **Execution Accuracy**: 76.0%
- **Exact Match**: 65.4%
- **Average Latency**: 500ms
- **Model Size**: 14.1 GB (full) / 183 MB (LoRA adapters)
## Training Details
### Training Data
- **Dataset**: [estu-research/sql-training-dataset](https://huggingface.co/datasets/estu-research/sql-training-dataset)
- **Examples**: 1,000+ natural language to SQL pairs
- **Domain**: Sales database queries (customers, orders, products, employees)
### Training Configuration
```python
{
"base_model": "google/gemma-7b",
"method": "LoRA",
"rank": 16,
"alpha": 32,
"dropout": 0.05,
"target_modules": ["q_proj", "k_proj", "v_proj", "o_proj"],
"epochs": 3,
"batch_size": 8,
"learning_rate": 1.5e-4,
"training_time": "10.8 hours (A100 GPU)"
}
```
### Training Results
```
Epoch 1: Loss 1.456 | Val Loss 1.512 | Accuracy 68.2%
Epoch 2: Loss 0.521 | Val Loss 0.589 | Accuracy 72.8%
Epoch 3: Loss 0.234 | Val Loss 0.267 | Accuracy 76.0%
```
## Usage
### Installation
```bash
pip install transformers torch
```
### Quick Start
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("estu-research/gemma-7b-sql-ft")
tokenizer = AutoTokenizer.from_pretrained("estu-research/gemma-7b-sql-ft")
# Example query
question = """
Schema: CREATE TABLE customers (customerNumber INT, customerName VARCHAR(50), country VARCHAR(50));
Question: List all customers from France
"""
inputs = tokenizer(question, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=256)
sql = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(sql)
# Output: SELECT * FROM customers WHERE country = 'France';
```
### Advanced Usage with Pipeline
```python
from transformers import pipeline
pipe = pipeline("text-generation", model="estu-research/gemma-7b-sql-ft")
result = pipe(
"Schema: CREATE TABLE products (productName VARCHAR, price DECIMAL);\nQuestion: Show top 10 expensive products",
max_new_tokens=200,
temperature=0.1
)
print(result[0]['generated_text'])
```
## Example Queries
| Natural Language | Generated SQL |
|------------------|---------------|
| List top 5 customers by sales | `SELECT customerName, SUM(amount) as total FROM customers JOIN orders USING(customerId) GROUP BY customerId ORDER BY total DESC LIMIT 5;` |
| Show products never ordered | `SELECT p.productName FROM products p LEFT JOIN orderDetails od ON p.productCode = od.productCode WHERE od.productCode IS NULL;` |
| Total revenue by country | `SELECT country, SUM(amount) as revenue FROM customers JOIN orders USING(customerId) GROUP BY country ORDER BY revenue DESC;` |
## Comparison with Other Models
| Model | Accuracy | Latency | Cost |
|-------|----------|---------|------|
| **Gemma-7B (FT)** | **76.0%** | 500ms | Free |
| Llama-3-8B (FT) | 78.2% | 450ms | Free |
| GPT-4o-mini (FT) | 97.8% | 800ms | $0.30/1K |
| GPT-3.5 Turbo | 78.9% | 500ms | $0.05/1K |
## Limitations
- Trained primarily on sales database schema
- May struggle with very complex nested queries
- Best performance on English language queries
- Requires GPU for optimal inference speed
## Intended Use
- **Primary**: Natural language to SQL translation for analytics
- **Secondary**: SQL query assistance and education
- **Not For**: Production databases without query validation
## Citation
```bibtex
@misc{gemma7b-sql-ft,
title={Gemma-7B SQL Expert: Fine-Tuned Model for Text-to-SQL},
author={Kulalı and Aydın and Alhan and Fidan},
institution={Eskisehir Technical University},
year={2024},
url={https://huggingface.co/estu-research/gemma-7b-sql-ft}
}
```
## Links
- **GitHub**: [Japyh/llm-based-dbms](https://github.com/Japyh/llm-based-dbms)
- **Research Paper**: [docs/research_paper_draft.md](https://github.com/Japyh/llm-based-dbms/blob/main/docs/research_paper_draft.md)
- **Dataset**: [estu-research/sql-training-dataset](https://huggingface.co/datasets/estu-research/sql-training-dataset)
- **Organization**: [estu-research](https://huggingface.co/estu-research)
## Acknowledgments
This work was supported by TÜBİTAK 2209-A Research Grant at Eskisehir Technical University.
## License
MIT License - See LICENSE file for details