|
|
| --- |
| language: en |
| tags: |
| - text-to-sql |
| - codet5 |
| - fine-tuned |
| - sql-generation |
| datasets: |
| - b-mc2/sql-create-context |
| --- |
| |
| # CodeT5+ 220M — SQL Query Generator |
|
|
| Fine-tuned CodeT5+ 220M on the sql-create-context dataset to generate SQL queries from natural language questions and table schemas. |
|
|
| ## Training |
| - Base model: Salesforce/codet5p-220m (220M parameters) |
| - Dataset: b-mc2/sql-create-context (55k train / 23k val) |
| - Method: Full fine-tuning with schema-aware prompting |
| - Optimizer: AdamW, lr=3e-4, 5 epochs |
|
|
| ## Usage |
| ```python |
| from transformers import T5ForConditionalGeneration, AutoTokenizer |
| import torch |
| |
| model = T5ForConditionalGeneration.from_pretrained("your-username/codet5p-sql-generator") |
| tokenizer = AutoTokenizer.from_pretrained("your-username/codet5p-sql-generator") |
| |
| context = "CREATE TABLE employees (id INT, name VARCHAR, salary INT)" |
| question = "What is the average salary?" |
| |
| src = f"Schema: {context}\nQuestion: {question}" |
| inputs = tokenizer(src, return_tensors="pt") |
| outputs = model.generate(**inputs, max_new_tokens=128, num_beams=4) |
| print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
| ``` |
|
|
| ## Improvements over GPT-2 baseline |
| - Schema-aware prompting eliminates column/table hallucination |
| - Encoder-decoder architecture suited for seq2seq SQL generation |
| - 10x more training examples (78k vs 7k) |
| - max_length=512 prevents schema truncation |
| |