| language: en | |
| tags: | |
| - text-to-sql | |
| - gpt2 | |
| - fine-tuned | |
| - sql-generation | |
| datasets: | |
| - xlangai/spider | |
| # GPT-2 Medium — SQL Query Generator | |
| Fine-tuned GPT-2 Medium on the Spider text-to-SQL dataset to generate SQL queries from natural language questions. | |
| ## Training | |
| - Base model: GPT-2 Medium (354M parameters) | |
| - Dataset: Spider (7000 train / 1034 validation examples) | |
| - Method: Full fine-tuning | |
| - Best checkpoint: Epoch 1 (val loss 1.410) | |
| ## Usage | |
| ```python | |
| from transformers import GPT2LMHeadModel, GPT2Tokenizer | |
| model = GPT2LMHeadModel.from_pretrained("your-username/gpt2-medium-sql-generator") | |
| tokenizer = GPT2Tokenizer.from_pretrained("your-username/gpt2-medium-sql-generator") | |
| prompt = "Question: How many singers are there?\nSQL:" | |
| inputs = tokenizer(prompt, return_tensors="pt") | |
| output = model.generate(**inputs, max_new_tokens=64) | |
| print(tokenizer.decode(output[0], skip_special_tokens=True)) | |
| ``` | |
| ## Limitations | |
| - GPT-2 is a small model — output SQL may hallucinate table/column names | |
| - No schema awareness — works best on Singer/Concert domain from Spider training data | |
| - Intended as a learning project demonstrating full fine-tuning pipeline | |