File size: 1,375 Bytes
6f40b14 89f184e 6f40b14 89f184e 6f40b14 89f184e 6f40b14 89f184e 6f40b14 89f184e 6f40b14 89f184e 6f40b14 89f184e 6f40b14 89f184e 6f40b14 89f184e 6f40b14 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 |
---
language: en
tags:
- text-to-sql
- codet5
- fine-tuned
- sql-generation
datasets:
- b-mc2/sql-create-context
---
# CodeT5+ 220M — SQL Query Generator
Fine-tuned CodeT5+ 220M on the sql-create-context dataset to generate SQL queries from natural language questions and table schemas.
## Training
- Base model: Salesforce/codet5p-220m (220M parameters)
- Dataset: b-mc2/sql-create-context (55k train / 23k val)
- Method: Full fine-tuning with schema-aware prompting
- Optimizer: AdamW, lr=3e-4, 5 epochs
## Usage
```python
from transformers import T5ForConditionalGeneration, AutoTokenizer
import torch
model = T5ForConditionalGeneration.from_pretrained("your-username/codet5p-sql-generator")
tokenizer = AutoTokenizer.from_pretrained("your-username/codet5p-sql-generator")
context = "CREATE TABLE employees (id INT, name VARCHAR, salary INT)"
question = "What is the average salary?"
src = f"Schema: {context}\nQuestion: {question}"
inputs = tokenizer(src, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=128, num_beams=4)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
## Improvements over GPT-2 baseline
- Schema-aware prompting eliminates column/table hallucination
- Encoder-decoder architecture suited for seq2seq SQL generation
- 10x more training examples (78k vs 7k)
- max_length=512 prevents schema truncation
|