File size: 5,503 Bytes
3a36c7b ab8226c 3a36c7b ab8226c 3a36c7b ab8226c 3a36c7b ab8226c 3a36c7b ab8226c | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 | ---
base_model:
- griffith-bigdata/Qwen-2.5-Coder-3B-SQL-Writer
license: apache-2.0
language:
- en
tags:
- text-to-sql
- bird
- grpo
- finer-sql
- code
library_name: transformers
pipeline_tag: text-generation
---
# FINER-SQL-3B-BIRD
Trained from [`griffith-bigdata/Qwen-2.5-Coder-3B-SQL-Writer`](https://huggingface.co/griffith-bigdata/Qwen-2.5-Coder-3B-SQL-Writer) using GRPO with two dense rewards from the FINER-SQL paper:
π§ **Memory Reward** β aligns reasoning with verified traces
βοΈ **Atomic Reward** β measures operation-level SQL overlap
β
**67.5% Execution Accuracy on BIRD Dev** when training only on BIRD train. Inference runs on a single 12β24 GB GPU.
π See other models: https://huggingface.co/collections/griffith-bigdata/finer-sql
π GitHub: https://github.com/thanhdath/finer-sql/tree/main
---
## Comparison: FINER-SQL-3B-BIRD vs FINER-SQL-3B-Spider
Both models share the same Qwen-2.5-Coder-3B-SQL-Writer base. They differ only in the GRPO fine-tuning dataset (BIRD train vs Spider train).
| Model | BIRD Dev (n=30, vav) | Spider Dev (n=30, vav, +agg_hint) | When to use |
|-------|---------------------|----------------------------------|-------------|
| **FINER-SQL-3B-BIRD** *(this model)* | **67.54%** β
| 83.8% | Production BIRD; cross-domain SQL where train is BIRD-like |
| [FINER-SQL-3B-Spider](https://huggingface.co/griffith-bigdata/FINER-SQL-3B-Spider) | 63.04% | **85.1%** β
| Production Spider / spider-style schemas |
**Why two checkpoints?** BIRD and Spider use different SQL annotation conventions (BIRD: verbose, evidence-based, alias-heavy; Spider: terse, aggregate-first GROUP BY, exact ORDER BY direction). A single model trained on either dataset specialises to its annotations and loses ~1β4 pp on the other benchmark. We tried joint training and inference-time prompt tricks; they bridge most of the gap but the last 1β2 pp on each benchmark requires the dataset-specific checkpoint.
---
## Inference
### Quick start (vLLM)
```python
from vllm import LLM, SamplingParams
llm = LLM(
model="griffith-bigdata/FINER-SQL-3B-BIRD",
dtype="bfloat16",
max_model_len=4096,
gpu_memory_utilization=0.85,
)
system_prompt = """You are a meticulous SQL expert. Generate a single, correct SQL query for the user question and the provided database schema.
Follow this exact response format:
Rules:
- Output exactly one SQL statement.
- The SQL must be executable on SQLite.
- Do not include any explanatory text.
- Output one SQL statement only. Do not include any extra text, tags, or code fences."""
# Generate n=30 candidates with t=1.0, then majority-vote with VAV
sampling = SamplingParams(n=30, temperature=1.0, max_tokens=2048)
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": f"Database Schema:\n{schema}\n\nQuestion: {question}\n\nEvidence: {evidence}"},
]
output = llm.chat(messages, sampling)
candidate_sqls = [c.text.split("</think>")[-1].strip() for c in output[0].outputs]
# Then run majority voting (vav) β see the GitHub repo for the selector
```
### Recommended evaluation pipeline
1. Generate n=30 SQL candidates per question with temperature=1.0
2. Execute each candidate against the database, collect result tuples
3. Group candidates by execution result; pick the candidate from the **largest non-empty success group** that does **not** match the all-zero or empty pattern (value-aware voting, "vav")
4. Score against gold SQL with the official BIRD evaluator
This pipeline gives 67.54% BIRD Dev EX. See [`evaluation/`](https://github.com/thanhdath/finer-sql/tree/main/evaluation) in the repo for the reference implementation.
---
## Detailed BIRD Dev results (n=30, vav, V2 prompts)
| Difficulty | Count | Execution Accuracy |
|------------|-------|--------------------|
| Simple | 925 | 74.16% |
| Moderate | 464 | 58.41% |
| Challenging | 145 | 54.48% |
| **All** | **1534** | **67.54%** |
Recall@30 (any-correct rate among 30 candidates): **82.2%**.
---
## Cross-benchmark: this model on Spider Dev
If you must use a single model across both benchmarks, this checkpoint plus a one-line system-prompt addition gets close to the Spider specialist:
| Setup | Spider Official EX |
|-------|--------------------|
| Default (no prompt change) | 83.2% |
| **+ "list aggregates BEFORE grouping columns" rule** | **83.8%** |
| FINER-SQL-3B-Spider (specialist) | 85.1% |
To enable the Spider-friendly mode, append to the system prompt:
```
- When using GROUP BY, list aggregate functions (COUNT, SUM, AVG, MIN, MAX) in the SELECT clause BEFORE the grouping column(s).
```
---
## Training
| Parameter | Value |
|-----------|-------|
| Base model | `griffith-bigdata/Qwen-2.5-Coder-3B-SQL-Writer` |
| Algorithm | GRPO |
| Train data | BIRD train (V2 prompts, top-30 GRAST schema) |
| Learning rate | 8e-6 |
| Num generations per prompt | 32 |
| Gradient accumulation | 32 |
| Max completion length | 2048 |
| Max prompt length | 4096 |
| Temperature (rollout) | 1.0 |
| Selection during eval | vav (value-aware voting) |
| Rewards | Execution + Atomic + Memory + Format |
| GPU | 1Γ A6000 48 GB |
---
## License
Inherits the base model's license (Apache 2.0). Not for medical, legal, or other safety-critical autonomous decision-making.
---
## Citation
```bibtex
@article{finer-sql-2026,
title = {FINER-SQL: Fine-grained reasoning rewards for small Text-to-SQL models},
author = {Thanh Dat and others},
year = {2026},
}
```
|