gemma-7b-sql-ft / README.md

Update README.md

d9ca7ae verified 2 months ago

5 kB

	---
	license: mit
	language:
	- en
	tags:
	- text-generation
	- sql
	- text-to-sql
	- gemma
	- fine-tuned
	- database
	- nlp
	base_model: google/gemma-7b
	datasets:
	- estu-research/sql-training-dataset
	metrics:
	- accuracy
	- exact_match
	library_name: transformers
	pipeline_tag: text-generation
	---

	# Gemma-7B SQL Expert (Fine-Tuned)

	Fine-tuned version of Google's Gemma-7B model for converting natural language questions to SQL queries.

	## Model Details

	- Base Model: [google/gemma-7b](https://huggingface.co/google/gemma-7b)
	- Fine-tuned by: ESTU Research Team (Kulalı, Aydın, Alhan, Fidan)
	- Institution: Eskisehir Technical University
	- Project: TÜBİTAK 2209-A Research
	- License: MIT
	- Language: English
	- Task: Natural Language to SQL Translation

	## Performance

	- Execution Accuracy: 76.0%
	- Exact Match: 65.4%
	- Average Latency: 500ms
	- Model Size: 14.1 GB (full) / 183 MB (LoRA adapters)

	## Training Details

	### Training Data
	- Dataset: [estu-research/sql-training-dataset](https://huggingface.co/datasets/estu-research/sql-training-dataset)
	- Examples: 1,000+ natural language to SQL pairs
	- Domain: Sales database queries (customers, orders, products, employees)

	### Training Configuration
	```python
	{
	"base_model": "google/gemma-7b",
	"method": "LoRA",
	"rank": 16,
	"alpha": 32,
	"dropout": 0.05,
	"target_modules": ["q_proj", "k_proj", "v_proj", "o_proj"],
	"epochs": 3,
	"batch_size": 8,
	"learning_rate": 1.5e-4,
	"training_time": "10.8 hours (A100 GPU)"
	}
	```

	### Training Results
	```
	Epoch 1: Loss 1.456 \| Val Loss 1.512 \| Accuracy 68.2%
	Epoch 2: Loss 0.521 \| Val Loss 0.589 \| Accuracy 72.8%
	Epoch 3: Loss 0.234 \| Val Loss 0.267 \| Accuracy 76.0%
	```

	## Usage

	### Installation
	```bash
	pip install transformers torch
	```

	### Quick Start
	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model = AutoModelForCausalLM.from_pretrained("estu-research/gemma-7b-sql-ft")
	tokenizer = AutoTokenizer.from_pretrained("estu-research/gemma-7b-sql-ft")

	# Example query
	question = """
	Schema: CREATE TABLE customers (customerNumber INT, customerName VARCHAR(50), country VARCHAR(50));
	Question: List all customers from France
	"""

	inputs = tokenizer(question, return_tensors="pt")
	outputs = model.generate(**inputs, max_new_tokens=256)
	sql = tokenizer.decode(outputs[0], skip_special_tokens=True)

	print(sql)
	# Output: SELECT * FROM customers WHERE country = 'France';
	```

	### Advanced Usage with Pipeline
	```python
	from transformers import pipeline

	pipe = pipeline("text-generation", model="estu-research/gemma-7b-sql-ft")

	result = pipe(
	"Schema: CREATE TABLE products (productName VARCHAR, price DECIMAL);\nQuestion: Show top 10 expensive products",
	max_new_tokens=200,
	temperature=0.1
	)
	print(result[0]['generated_text'])
	```

	## Example Queries

	\| Natural Language \| Generated SQL \|
	\|------------------\|---------------\|
	\| List top 5 customers by sales \| `SELECT customerName, SUM(amount) as total FROM customers JOIN orders USING(customerId) GROUP BY customerId ORDER BY total DESC LIMIT 5;` \|
	\| Show products never ordered \| `SELECT p.productName FROM products p LEFT JOIN orderDetails od ON p.productCode = od.productCode WHERE od.productCode IS NULL;` \|
	\| Total revenue by country \| `SELECT country, SUM(amount) as revenue FROM customers JOIN orders USING(customerId) GROUP BY country ORDER BY revenue DESC;` \|

	## Comparison with Other Models

	\| Model \| Accuracy \| Latency \| Cost \|
	\|-------\|----------\|---------\|------\|
	\| Gemma-7B (FT) \| 76.0% \| 500ms \| Free \|
	\| Llama-3-8B (FT) \| 78.2% \| 450ms \| Free \|
	\| GPT-4o-mini (FT) \| 97.8% \| 800ms \| $0.30/1K \|
	\| GPT-3.5 Turbo \| 78.9% \| 500ms \| $0.05/1K \|

	## Limitations

	- Trained primarily on sales database schema
	- May struggle with very complex nested queries
	- Best performance on English language queries
	- Requires GPU for optimal inference speed

	## Intended Use

	- Primary: Natural language to SQL translation for analytics
	- Secondary: SQL query assistance and education
	- Not For: Production databases without query validation

	## Citation

	```bibtex
	@misc{gemma7b-sql-ft,
	title={Gemma-7B SQL Expert: Fine-Tuned Model for Text-to-SQL},
	author={Kulalı and Aydın and Alhan and Fidan},
	institution={Eskisehir Technical University},
	year={2024},
	url={https://huggingface.co/estu-research/gemma-7b-sql-ft}
	}
	```

	## Links

	- GitHub: [Japyh/llm-based-dbms](https://github.com/Japyh/llm-based-dbms)
	- Research Paper: [docs/research_paper_draft.md](https://github.com/Japyh/llm-based-dbms/blob/main/docs/research_paper_draft.md)
	- Dataset: [estu-research/sql-training-dataset](https://huggingface.co/datasets/estu-research/sql-training-dataset)
	- Organization: [estu-research](https://huggingface.co/estu-research)

	## Acknowledgments

	This work was supported by TÜBİTAK 2209-A Research Grant at Eskisehir Technical University.

	## License

	MIT License - See LICENSE file for details