|
|
--- |
|
|
license: mit |
|
|
language: |
|
|
- en |
|
|
tags: |
|
|
- text-generation |
|
|
- sql |
|
|
- text-to-sql |
|
|
- gemma |
|
|
- fine-tuned |
|
|
- database |
|
|
- nlp |
|
|
base_model: google/gemma-7b |
|
|
datasets: |
|
|
- estu-research/sql-training-dataset |
|
|
metrics: |
|
|
- accuracy |
|
|
- exact_match |
|
|
library_name: transformers |
|
|
pipeline_tag: text-generation |
|
|
--- |
|
|
|
|
|
# Gemma-7B SQL Expert (Fine-Tuned) |
|
|
|
|
|
Fine-tuned version of Google's Gemma-7B model for converting natural language questions to SQL queries. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Base Model**: [google/gemma-7b](https://huggingface.co/google/gemma-7b) |
|
|
- **Fine-tuned by**: ESTU Research Team (Kulalı, Aydın, Alhan, Fidan) |
|
|
- **Institution**: Eskisehir Technical University |
|
|
- **Project**: TÜBİTAK 2209-A Research |
|
|
- **License**: MIT |
|
|
- **Language**: English |
|
|
- **Task**: Natural Language to SQL Translation |
|
|
|
|
|
## Performance |
|
|
|
|
|
- **Execution Accuracy**: 76.0% |
|
|
- **Exact Match**: 65.4% |
|
|
- **Average Latency**: 500ms |
|
|
- **Model Size**: 14.1 GB (full) / 183 MB (LoRA adapters) |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Training Data |
|
|
- **Dataset**: [estu-research/sql-training-dataset](https://huggingface.co/datasets/estu-research/sql-training-dataset) |
|
|
- **Examples**: 1,000+ natural language to SQL pairs |
|
|
- **Domain**: Sales database queries (customers, orders, products, employees) |
|
|
|
|
|
### Training Configuration |
|
|
```python |
|
|
{ |
|
|
"base_model": "google/gemma-7b", |
|
|
"method": "LoRA", |
|
|
"rank": 16, |
|
|
"alpha": 32, |
|
|
"dropout": 0.05, |
|
|
"target_modules": ["q_proj", "k_proj", "v_proj", "o_proj"], |
|
|
"epochs": 3, |
|
|
"batch_size": 8, |
|
|
"learning_rate": 1.5e-4, |
|
|
"training_time": "10.8 hours (A100 GPU)" |
|
|
} |
|
|
``` |
|
|
|
|
|
### Training Results |
|
|
``` |
|
|
Epoch 1: Loss 1.456 | Val Loss 1.512 | Accuracy 68.2% |
|
|
Epoch 2: Loss 0.521 | Val Loss 0.589 | Accuracy 72.8% |
|
|
Epoch 3: Loss 0.234 | Val Loss 0.267 | Accuracy 76.0% |
|
|
``` |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Installation |
|
|
```bash |
|
|
pip install transformers torch |
|
|
``` |
|
|
|
|
|
### Quick Start |
|
|
```python |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
|
|
model = AutoModelForCausalLM.from_pretrained("estu-research/gemma-7b-sql-ft") |
|
|
tokenizer = AutoTokenizer.from_pretrained("estu-research/gemma-7b-sql-ft") |
|
|
|
|
|
# Example query |
|
|
question = """ |
|
|
Schema: CREATE TABLE customers (customerNumber INT, customerName VARCHAR(50), country VARCHAR(50)); |
|
|
Question: List all customers from France |
|
|
""" |
|
|
|
|
|
inputs = tokenizer(question, return_tensors="pt") |
|
|
outputs = model.generate(**inputs, max_new_tokens=256) |
|
|
sql = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
|
|
|
|
print(sql) |
|
|
# Output: SELECT * FROM customers WHERE country = 'France'; |
|
|
``` |
|
|
|
|
|
### Advanced Usage with Pipeline |
|
|
```python |
|
|
from transformers import pipeline |
|
|
|
|
|
pipe = pipeline("text-generation", model="estu-research/gemma-7b-sql-ft") |
|
|
|
|
|
result = pipe( |
|
|
"Schema: CREATE TABLE products (productName VARCHAR, price DECIMAL);\nQuestion: Show top 10 expensive products", |
|
|
max_new_tokens=200, |
|
|
temperature=0.1 |
|
|
) |
|
|
print(result[0]['generated_text']) |
|
|
``` |
|
|
|
|
|
## Example Queries |
|
|
|
|
|
| Natural Language | Generated SQL | |
|
|
|------------------|---------------| |
|
|
| List top 5 customers by sales | `SELECT customerName, SUM(amount) as total FROM customers JOIN orders USING(customerId) GROUP BY customerId ORDER BY total DESC LIMIT 5;` | |
|
|
| Show products never ordered | `SELECT p.productName FROM products p LEFT JOIN orderDetails od ON p.productCode = od.productCode WHERE od.productCode IS NULL;` | |
|
|
| Total revenue by country | `SELECT country, SUM(amount) as revenue FROM customers JOIN orders USING(customerId) GROUP BY country ORDER BY revenue DESC;` | |
|
|
|
|
|
## Comparison with Other Models |
|
|
|
|
|
| Model | Accuracy | Latency | Cost | |
|
|
|-------|----------|---------|------| |
|
|
| **Gemma-7B (FT)** | **76.0%** | 500ms | Free | |
|
|
| Llama-3-8B (FT) | 78.2% | 450ms | Free | |
|
|
| GPT-4o-mini (FT) | 97.8% | 800ms | $0.30/1K | |
|
|
| GPT-3.5 Turbo | 78.9% | 500ms | $0.05/1K | |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- Trained primarily on sales database schema |
|
|
- May struggle with very complex nested queries |
|
|
- Best performance on English language queries |
|
|
- Requires GPU for optimal inference speed |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
- **Primary**: Natural language to SQL translation for analytics |
|
|
- **Secondary**: SQL query assistance and education |
|
|
- **Not For**: Production databases without query validation |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@misc{gemma7b-sql-ft, |
|
|
title={Gemma-7B SQL Expert: Fine-Tuned Model for Text-to-SQL}, |
|
|
author={Kulalı and Aydın and Alhan and Fidan}, |
|
|
institution={Eskisehir Technical University}, |
|
|
year={2024}, |
|
|
url={https://huggingface.co/estu-research/gemma-7b-sql-ft} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Links |
|
|
|
|
|
- **GitHub**: [Japyh/llm-based-dbms](https://github.com/Japyh/llm-based-dbms) |
|
|
- **Research Paper**: [docs/research_paper_draft.md](https://github.com/Japyh/llm-based-dbms/blob/main/docs/research_paper_draft.md) |
|
|
- **Dataset**: [estu-research/sql-training-dataset](https://huggingface.co/datasets/estu-research/sql-training-dataset) |
|
|
- **Organization**: [estu-research](https://huggingface.co/estu-research) |
|
|
|
|
|
## Acknowledgments |
|
|
|
|
|
This work was supported by TÜBİTAK 2209-A Research Grant at Eskisehir Technical University. |
|
|
|
|
|
## License |
|
|
|
|
|
MIT License - See LICENSE file for details |