Update README with comparison table to FINER-SQL-3B-Spider

ab8226c verified 24 days ago

5.5 kB

	---
	base_model:
	- griffith-bigdata/Qwen-2.5-Coder-3B-SQL-Writer
	license: apache-2.0
	language:
	- en
	tags:
	- text-to-sql
	- bird
	- grpo
	- finer-sql
	- code
	library_name: transformers
	pipeline_tag: text-generation
	---

	# FINER-SQL-3B-BIRD

	Trained from [`griffith-bigdata/Qwen-2.5-Coder-3B-SQL-Writer`](https://huggingface.co/griffith-bigdata/Qwen-2.5-Coder-3B-SQL-Writer) using GRPO with two dense rewards from the FINER-SQL paper:

	🧠 Memory Reward — aligns reasoning with verified traces
	⚙️ Atomic Reward — measures operation-level SQL overlap

	✅ 67.5% Execution Accuracy on BIRD Dev when training only on BIRD train. Inference runs on a single 12–24 GB GPU.

	📄 See other models: https://huggingface.co/collections/griffith-bigdata/finer-sql
	📄 GitHub: https://github.com/thanhdath/finer-sql/tree/main

	---

	## Comparison: FINER-SQL-3B-BIRD vs FINER-SQL-3B-Spider

	Both models share the same Qwen-2.5-Coder-3B-SQL-Writer base. They differ only in the GRPO fine-tuning dataset (BIRD train vs Spider train).

	\| Model \| BIRD Dev (n=30, vav) \| Spider Dev (n=30, vav, +agg_hint) \| When to use \|
	\|-------\|---------------------\|----------------------------------\|-------------\|
	\| FINER-SQL-3B-BIRD (this model) \| 67.54% ✅ \| 83.8% \| Production BIRD; cross-domain SQL where train is BIRD-like \|
	\| [FINER-SQL-3B-Spider](https://huggingface.co/griffith-bigdata/FINER-SQL-3B-Spider) \| 63.04% \| 85.1% ✅ \| Production Spider / spider-style schemas \|

	Why two checkpoints? BIRD and Spider use different SQL annotation conventions (BIRD: verbose, evidence-based, alias-heavy; Spider: terse, aggregate-first GROUP BY, exact ORDER BY direction). A single model trained on either dataset specialises to its annotations and loses ~1–4 pp on the other benchmark. We tried joint training and inference-time prompt tricks; they bridge most of the gap but the last 1–2 pp on each benchmark requires the dataset-specific checkpoint.

	---

	## Inference

	### Quick start (vLLM)

	```python
	from vllm import LLM, SamplingParams

	llm = LLM(
	model="griffith-bigdata/FINER-SQL-3B-BIRD",
	dtype="bfloat16",
	max_model_len=4096,
	gpu_memory_utilization=0.85,
	)

	system_prompt = """You are a meticulous SQL expert. Generate a single, correct SQL query for the user question and the provided database schema.
	Follow this exact response format:

	Rules:
	- Output exactly one SQL statement.
	- The SQL must be executable on SQLite.
	- Do not include any explanatory text.
	- Output one SQL statement only. Do not include any extra text, tags, or code fences."""

	# Generate n=30 candidates with t=1.0, then majority-vote with VAV
	sampling = SamplingParams(n=30, temperature=1.0, max_tokens=2048)

	messages = [
	{"role": "system", "content": system_prompt},
	{"role": "user", "content": f"Database Schema:\n{schema}\n\nQuestion: {question}\n\nEvidence: {evidence}"},
	]
	output = llm.chat(messages, sampling)
	candidate_sqls = [c.text.split("</think>")[-1].strip() for c in output[0].outputs]
	# Then run majority voting (vav) — see the GitHub repo for the selector
	```

	### Recommended evaluation pipeline

	1. Generate n=30 SQL candidates per question with temperature=1.0
	2. Execute each candidate against the database, collect result tuples
	3. Group candidates by execution result; pick the candidate from the largest non-empty success group that does not match the all-zero or empty pattern (value-aware voting, "vav")
	4. Score against gold SQL with the official BIRD evaluator

	This pipeline gives 67.54% BIRD Dev EX. See [`evaluation/`](https://github.com/thanhdath/finer-sql/tree/main/evaluation) in the repo for the reference implementation.

	---

	## Detailed BIRD Dev results (n=30, vav, V2 prompts)

	\| Difficulty \| Count \| Execution Accuracy \|
	\|------------\|-------\|--------------------\|
	\| Simple \| 925 \| 74.16% \|
	\| Moderate \| 464 \| 58.41% \|
	\| Challenging \| 145 \| 54.48% \|
	\| All \| 1534 \| 67.54% \|

	Recall@30 (any-correct rate among 30 candidates): 82.2%.

	---

	## Cross-benchmark: this model on Spider Dev

	If you must use a single model across both benchmarks, this checkpoint plus a one-line system-prompt addition gets close to the Spider specialist:

	\| Setup \| Spider Official EX \|
	\|-------\|--------------------\|
	\| Default (no prompt change) \| 83.2% \|
	\| + "list aggregates BEFORE grouping columns" rule \| 83.8% \|
	\| FINER-SQL-3B-Spider (specialist) \| 85.1% \|

	To enable the Spider-friendly mode, append to the system prompt:
	```
	- When using GROUP BY, list aggregate functions (COUNT, SUM, AVG, MIN, MAX) in the SELECT clause BEFORE the grouping column(s).
	```

	---

	## Training

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Base model \| `griffith-bigdata/Qwen-2.5-Coder-3B-SQL-Writer` \|
	\| Algorithm \| GRPO \|
	\| Train data \| BIRD train (V2 prompts, top-30 GRAST schema) \|
	\| Learning rate \| 8e-6 \|
	\| Num generations per prompt \| 32 \|
	\| Gradient accumulation \| 32 \|
	\| Max completion length \| 2048 \|
	\| Max prompt length \| 4096 \|
	\| Temperature (rollout) \| 1.0 \|
	\| Selection during eval \| vav (value-aware voting) \|
	\| Rewards \| Execution + Atomic + Memory + Format \|
	\| GPU \| 1× A6000 48 GB \|

	---

	## License

	Inherits the base model's license (Apache 2.0). Not for medical, legal, or other safety-critical autonomous decision-making.

	---

	## Citation

	```bibtex
	@article{finer-sql-2026,
	title = {FINER-SQL: Fine-grained reasoning rewards for small Text-to-SQL models},
	author = {Thanh Dat and others},
	year = {2026},
	}
	```