--- base_model: Qwen/Qwen3-4B-Base library_name: transformers pipeline_tag: text-generation tags: - text-to-sql - sql - qwen3 - llamafactory - spider - spider-test-suite --- # qwen_4b_sql `qwen_4b_sql` is a `Qwen3-4B-Base` model finetuned for text-to-SQL generation with full SFT on a cleaned split of `PipableAI/pip-txt-to-sql-spider-bird-dataset`. This repository tracks the stronger 4B checkpoint from our H20 single-GPU training runs. In our internal comparison, this checkpoint outperformed the corresponding `Qwen3-1.7B-Base` baseline on Spider execution accuracy. ## Base Model - Base model: [`Qwen/Qwen3-4B-Base`](https://huggingface.co/Qwen/Qwen3-4B-Base) - Finetuning framework: `LLaMA-Factory` - Training mode: `Full SFT` - Task: `schema + question -> SQL only` ## Training Data - Primary dataset: [`PipableAI/pip-txt-to-sql-spider-bird-dataset`](https://huggingface.co/datasets/PipableAI/pip-txt-to-sql-spider-bird-dataset) - We used a cleaned local split derived from that dataset for train/validation ## Training Setup - Hardware: single `NVIDIA H20 96GB` - Precision: `bf16` - Context length: `2048` - Per-device train batch size: `1` - Gradient accumulation steps: `8` - Effective batch size: `8` - Learning rate: `5e-6` - Scheduler: `cosine` - Warmup steps: `300` - Epochs: `4.0` - Template: `qwen3_nothink` - Best-checkpoint selection: `load_best_model_at_end = true` ## Spider Benchmark The following numbers are from Spider dev using the official evaluation tooling: - Official `match` evaluation from `test-suite-sql-eval` - Official Spider `Test Suite` execution evaluation ### Main Results | Metric | Score | | --- | ---: | | Spider official exact match | 35.0% | | Spider Test Suite execution accuracy | 67.6% | ### Difficulty Breakdown | Difficulty | Exact Match | Test Suite Exec | | --- | ---: | ---: | | Easy | 64.9% | 87.5% | | Medium | 37.4% | 72.9% | | Hard | 16.1% | 50.0% | | Extra | 3.6% | 42.2% | ## Notes - This model is stronger under execution-based Spider evaluation than our best `Qwen3-1.7B-Base` run. - In our experiments, exact-match metrics were often stricter than execution-based metrics because semantically valid SQL rewrites do not always match the Spider gold form exactly. - A later 4B rerun with altered training settings underperformed this checkpoint on Spider and is not the checkpoint published here. ## Intended Use This model is intended for: - text-to-SQL research baselines - schema-conditioned SQL generation experiments - single-turn SQL generation from natural language plus schema text It is not validated for: - production-grade database access control - unrestricted execution over arbitrary enterprise schemas - multi-turn agent workflows without extra prompting / tooling ## Example Usage ```python from transformers import AutoTokenizer, AutoModelForCausalLM model_id = "bsq1989/qwen_4b_sql" tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype="auto", device_map="auto", trust_remote_code=True, ) prompt = """Generate SQL from the given schema and question. Output SQL only. Schema: CREATE TABLE twitter (TweetID INTEGER, UserID INTEGER, LocationID INTEGER, Lang TEXT, ...); CREATE TABLE location (LocationID INTEGER, Country TEXT, City TEXT, ...); Question: How many tweets are in English? """ messages = [{"role": "user", "content": prompt}] text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True, ) inputs = tokenizer(text, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=256) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ## Limitations - Performance drops on more open-ended and heterogeneous SQL benchmarks than Spider. - The model can still produce invalid column references on out-of-distribution schemas. - Benchmark numbers here reflect our current internal setup and should be reproduced with the same evaluation pipeline for strict comparison.