File size: 5,503 Bytes
3a36c7b
 
 
ab8226c
 
 
 
 
 
 
 
 
 
 
3a36c7b
 
 
 
 
 
ab8226c
 
3a36c7b
ab8226c
3a36c7b
 
ab8226c
3a36c7b
ab8226c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
---
base_model:
- griffith-bigdata/Qwen-2.5-Coder-3B-SQL-Writer
license: apache-2.0
language:
- en
tags:
- text-to-sql
- bird
- grpo
- finer-sql
- code
library_name: transformers
pipeline_tag: text-generation
---

# FINER-SQL-3B-BIRD

Trained from [`griffith-bigdata/Qwen-2.5-Coder-3B-SQL-Writer`](https://huggingface.co/griffith-bigdata/Qwen-2.5-Coder-3B-SQL-Writer) using GRPO with two dense rewards from the FINER-SQL paper:

🧠 **Memory Reward** β€” aligns reasoning with verified traces
βš™οΈ **Atomic Reward** β€” measures operation-level SQL overlap

βœ… **67.5% Execution Accuracy on BIRD Dev** when training only on BIRD train. Inference runs on a single 12–24 GB GPU.

πŸ“„ See other models: https://huggingface.co/collections/griffith-bigdata/finer-sql
πŸ“„ GitHub: https://github.com/thanhdath/finer-sql/tree/main

---

## Comparison: FINER-SQL-3B-BIRD vs FINER-SQL-3B-Spider

Both models share the same Qwen-2.5-Coder-3B-SQL-Writer base. They differ only in the GRPO fine-tuning dataset (BIRD train vs Spider train).

| Model | BIRD Dev (n=30, vav) | Spider Dev (n=30, vav, +agg_hint) | When to use |
|-------|---------------------|----------------------------------|-------------|
| **FINER-SQL-3B-BIRD** *(this model)* | **67.54%** βœ… | 83.8% | Production BIRD; cross-domain SQL where train is BIRD-like |
| [FINER-SQL-3B-Spider](https://huggingface.co/griffith-bigdata/FINER-SQL-3B-Spider) | 63.04% | **85.1%** βœ… | Production Spider / spider-style schemas |

**Why two checkpoints?** BIRD and Spider use different SQL annotation conventions (BIRD: verbose, evidence-based, alias-heavy; Spider: terse, aggregate-first GROUP BY, exact ORDER BY direction). A single model trained on either dataset specialises to its annotations and loses ~1–4 pp on the other benchmark. We tried joint training and inference-time prompt tricks; they bridge most of the gap but the last 1–2 pp on each benchmark requires the dataset-specific checkpoint.

---

## Inference

### Quick start (vLLM)

```python
from vllm import LLM, SamplingParams

llm = LLM(
    model="griffith-bigdata/FINER-SQL-3B-BIRD",
    dtype="bfloat16",
    max_model_len=4096,
    gpu_memory_utilization=0.85,
)

system_prompt = """You are a meticulous SQL expert. Generate a single, correct SQL query for the user question and the provided database schema.
Follow this exact response format:

Rules:
- Output exactly one SQL statement.
- The SQL must be executable on SQLite.
- Do not include any explanatory text.
- Output one SQL statement only. Do not include any extra text, tags, or code fences."""

# Generate n=30 candidates with t=1.0, then majority-vote with VAV
sampling = SamplingParams(n=30, temperature=1.0, max_tokens=2048)

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": f"Database Schema:\n{schema}\n\nQuestion: {question}\n\nEvidence: {evidence}"},
]
output = llm.chat(messages, sampling)
candidate_sqls = [c.text.split("</think>")[-1].strip() for c in output[0].outputs]
# Then run majority voting (vav) β€” see the GitHub repo for the selector
```

### Recommended evaluation pipeline

1. Generate n=30 SQL candidates per question with temperature=1.0
2. Execute each candidate against the database, collect result tuples
3. Group candidates by execution result; pick the candidate from the **largest non-empty success group** that does **not** match the all-zero or empty pattern (value-aware voting, "vav")
4. Score against gold SQL with the official BIRD evaluator

This pipeline gives 67.54% BIRD Dev EX. See [`evaluation/`](https://github.com/thanhdath/finer-sql/tree/main/evaluation) in the repo for the reference implementation.

---

## Detailed BIRD Dev results (n=30, vav, V2 prompts)

| Difficulty | Count | Execution Accuracy |
|------------|-------|--------------------|
| Simple | 925 | 74.16% |
| Moderate | 464 | 58.41% |
| Challenging | 145 | 54.48% |
| **All** | **1534** | **67.54%** |

Recall@30 (any-correct rate among 30 candidates): **82.2%**.

---

## Cross-benchmark: this model on Spider Dev

If you must use a single model across both benchmarks, this checkpoint plus a one-line system-prompt addition gets close to the Spider specialist:

| Setup | Spider Official EX |
|-------|--------------------|
| Default (no prompt change) | 83.2% |
| **+ "list aggregates BEFORE grouping columns" rule** | **83.8%** |
| FINER-SQL-3B-Spider (specialist) | 85.1% |

To enable the Spider-friendly mode, append to the system prompt:
```
- When using GROUP BY, list aggregate functions (COUNT, SUM, AVG, MIN, MAX) in the SELECT clause BEFORE the grouping column(s).
```

---

## Training

| Parameter | Value |
|-----------|-------|
| Base model | `griffith-bigdata/Qwen-2.5-Coder-3B-SQL-Writer` |
| Algorithm | GRPO |
| Train data | BIRD train (V2 prompts, top-30 GRAST schema) |
| Learning rate | 8e-6 |
| Num generations per prompt | 32 |
| Gradient accumulation | 32 |
| Max completion length | 2048 |
| Max prompt length | 4096 |
| Temperature (rollout) | 1.0 |
| Selection during eval | vav (value-aware voting) |
| Rewards | Execution + Atomic + Memory + Format |
| GPU | 1Γ— A6000 48 GB |

---

## License

Inherits the base model's license (Apache 2.0). Not for medical, legal, or other safety-critical autonomous decision-making.

---

## Citation

```bibtex
@article{finer-sql-2026,
  title  = {FINER-SQL: Fine-grained reasoning rewards for small Text-to-SQL models},
  author = {Thanh Dat and others},
  year   = {2026},
}
```