--- language: en tags: - text-to-sql - codet5 - fine-tuned - sql-generation datasets: - b-mc2/sql-create-context --- # CodeT5+ 220M — SQL Query Generator Fine-tuned CodeT5+ 220M on the sql-create-context dataset to generate SQL queries from natural language questions and table schemas. ## Training - Base model: Salesforce/codet5p-220m (220M parameters) - Dataset: b-mc2/sql-create-context (55k train / 23k val) - Method: Full fine-tuning with schema-aware prompting - Optimizer: AdamW, lr=3e-4, 5 epochs ## Usage ```python from transformers import T5ForConditionalGeneration, AutoTokenizer import torch model = T5ForConditionalGeneration.from_pretrained("your-username/codet5p-sql-generator") tokenizer = AutoTokenizer.from_pretrained("your-username/codet5p-sql-generator") context = "CREATE TABLE employees (id INT, name VARCHAR, salary INT)" question = "What is the average salary?" src = f"Schema: {context}\nQuestion: {question}" inputs = tokenizer(src, return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=128, num_beams=4) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ## Improvements over GPT-2 baseline - Schema-aware prompting eliminates column/table hallucination - Encoder-decoder architecture suited for seq2seq SQL generation - 10x more training examples (78k vs 7k) - max_length=512 prevents schema truncation