Update README with comparison table to FINER-SQL-3B-Spider
#1
by thanhdath - opened
README.md
CHANGED
|
@@ -1,17 +1,153 @@
|
|
| 1 |
---
|
| 2 |
base_model:
|
| 3 |
- griffith-bigdata/Qwen-2.5-Coder-3B-SQL-Writer
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4 |
---
|
| 5 |
|
| 6 |
# FINER-SQL-3B-BIRD
|
| 7 |
|
| 8 |
Trained from [`griffith-bigdata/Qwen-2.5-Coder-3B-SQL-Writer`](https://huggingface.co/griffith-bigdata/Qwen-2.5-Coder-3B-SQL-Writer) using GRPO with two dense rewards from the FINER-SQL paper:
|
| 9 |
|
| 10 |
-
π§ Memory Reward β aligns reasoning with verified traces
|
| 11 |
-
βοΈ Atomic Reward β measures operation-level SQL overlap
|
| 12 |
|
| 13 |
-
β
67.5%
|
| 14 |
|
| 15 |
π See other models: https://huggingface.co/collections/griffith-bigdata/finer-sql
|
|
|
|
| 16 |
|
| 17 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
base_model:
|
| 3 |
- griffith-bigdata/Qwen-2.5-Coder-3B-SQL-Writer
|
| 4 |
+
license: apache-2.0
|
| 5 |
+
language:
|
| 6 |
+
- en
|
| 7 |
+
tags:
|
| 8 |
+
- text-to-sql
|
| 9 |
+
- bird
|
| 10 |
+
- grpo
|
| 11 |
+
- finer-sql
|
| 12 |
+
- code
|
| 13 |
+
library_name: transformers
|
| 14 |
+
pipeline_tag: text-generation
|
| 15 |
---
|
| 16 |
|
| 17 |
# FINER-SQL-3B-BIRD
|
| 18 |
|
| 19 |
Trained from [`griffith-bigdata/Qwen-2.5-Coder-3B-SQL-Writer`](https://huggingface.co/griffith-bigdata/Qwen-2.5-Coder-3B-SQL-Writer) using GRPO with two dense rewards from the FINER-SQL paper:
|
| 20 |
|
| 21 |
+
π§ **Memory Reward** β aligns reasoning with verified traces
|
| 22 |
+
βοΈ **Atomic Reward** β measures operation-level SQL overlap
|
| 23 |
|
| 24 |
+
β
**67.5% Execution Accuracy on BIRD Dev** when training only on BIRD train. Inference runs on a single 12β24 GB GPU.
|
| 25 |
|
| 26 |
π See other models: https://huggingface.co/collections/griffith-bigdata/finer-sql
|
| 27 |
+
π GitHub: https://github.com/thanhdath/finer-sql/tree/main
|
| 28 |
|
| 29 |
+
---
|
| 30 |
+
|
| 31 |
+
## Comparison: FINER-SQL-3B-BIRD vs FINER-SQL-3B-Spider
|
| 32 |
+
|
| 33 |
+
Both models share the same Qwen-2.5-Coder-3B-SQL-Writer base. They differ only in the GRPO fine-tuning dataset (BIRD train vs Spider train).
|
| 34 |
+
|
| 35 |
+
| Model | BIRD Dev (n=30, vav) | Spider Dev (n=30, vav, +agg_hint) | When to use |
|
| 36 |
+
|-------|---------------------|----------------------------------|-------------|
|
| 37 |
+
| **FINER-SQL-3B-BIRD** *(this model)* | **67.54%** β
| 83.8% | Production BIRD; cross-domain SQL where train is BIRD-like |
|
| 38 |
+
| [FINER-SQL-3B-Spider](https://huggingface.co/griffith-bigdata/FINER-SQL-3B-Spider) | 63.04% | **85.1%** β
| Production Spider / spider-style schemas |
|
| 39 |
+
|
| 40 |
+
**Why two checkpoints?** BIRD and Spider use different SQL annotation conventions (BIRD: verbose, evidence-based, alias-heavy; Spider: terse, aggregate-first GROUP BY, exact ORDER BY direction). A single model trained on either dataset specialises to its annotations and loses ~1β4 pp on the other benchmark. We tried joint training and inference-time prompt tricks; they bridge most of the gap but the last 1β2 pp on each benchmark requires the dataset-specific checkpoint.
|
| 41 |
+
|
| 42 |
+
---
|
| 43 |
+
|
| 44 |
+
## Inference
|
| 45 |
+
|
| 46 |
+
### Quick start (vLLM)
|
| 47 |
+
|
| 48 |
+
```python
|
| 49 |
+
from vllm import LLM, SamplingParams
|
| 50 |
+
|
| 51 |
+
llm = LLM(
|
| 52 |
+
model="griffith-bigdata/FINER-SQL-3B-BIRD",
|
| 53 |
+
dtype="bfloat16",
|
| 54 |
+
max_model_len=4096,
|
| 55 |
+
gpu_memory_utilization=0.85,
|
| 56 |
+
)
|
| 57 |
+
|
| 58 |
+
system_prompt = """You are a meticulous SQL expert. Generate a single, correct SQL query for the user question and the provided database schema.
|
| 59 |
+
Follow this exact response format:
|
| 60 |
+
|
| 61 |
+
Rules:
|
| 62 |
+
- Output exactly one SQL statement.
|
| 63 |
+
- The SQL must be executable on SQLite.
|
| 64 |
+
- Do not include any explanatory text.
|
| 65 |
+
- Output one SQL statement only. Do not include any extra text, tags, or code fences."""
|
| 66 |
+
|
| 67 |
+
# Generate n=30 candidates with t=1.0, then majority-vote with VAV
|
| 68 |
+
sampling = SamplingParams(n=30, temperature=1.0, max_tokens=2048)
|
| 69 |
+
|
| 70 |
+
messages = [
|
| 71 |
+
{"role": "system", "content": system_prompt},
|
| 72 |
+
{"role": "user", "content": f"Database Schema:\n{schema}\n\nQuestion: {question}\n\nEvidence: {evidence}"},
|
| 73 |
+
]
|
| 74 |
+
output = llm.chat(messages, sampling)
|
| 75 |
+
candidate_sqls = [c.text.split("</think>")[-1].strip() for c in output[0].outputs]
|
| 76 |
+
# Then run majority voting (vav) β see the GitHub repo for the selector
|
| 77 |
+
```
|
| 78 |
+
|
| 79 |
+
### Recommended evaluation pipeline
|
| 80 |
+
|
| 81 |
+
1. Generate n=30 SQL candidates per question with temperature=1.0
|
| 82 |
+
2. Execute each candidate against the database, collect result tuples
|
| 83 |
+
3. Group candidates by execution result; pick the candidate from the **largest non-empty success group** that does **not** match the all-zero or empty pattern (value-aware voting, "vav")
|
| 84 |
+
4. Score against gold SQL with the official BIRD evaluator
|
| 85 |
+
|
| 86 |
+
This pipeline gives 67.54% BIRD Dev EX. See [`evaluation/`](https://github.com/thanhdath/finer-sql/tree/main/evaluation) in the repo for the reference implementation.
|
| 87 |
+
|
| 88 |
+
---
|
| 89 |
+
|
| 90 |
+
## Detailed BIRD Dev results (n=30, vav, V2 prompts)
|
| 91 |
+
|
| 92 |
+
| Difficulty | Count | Execution Accuracy |
|
| 93 |
+
|------------|-------|--------------------|
|
| 94 |
+
| Simple | 925 | 74.16% |
|
| 95 |
+
| Moderate | 464 | 58.41% |
|
| 96 |
+
| Challenging | 145 | 54.48% |
|
| 97 |
+
| **All** | **1534** | **67.54%** |
|
| 98 |
+
|
| 99 |
+
Recall@30 (any-correct rate among 30 candidates): **82.2%**.
|
| 100 |
+
|
| 101 |
+
---
|
| 102 |
+
|
| 103 |
+
## Cross-benchmark: this model on Spider Dev
|
| 104 |
+
|
| 105 |
+
If you must use a single model across both benchmarks, this checkpoint plus a one-line system-prompt addition gets close to the Spider specialist:
|
| 106 |
+
|
| 107 |
+
| Setup | Spider Official EX |
|
| 108 |
+
|-------|--------------------|
|
| 109 |
+
| Default (no prompt change) | 83.2% |
|
| 110 |
+
| **+ "list aggregates BEFORE grouping columns" rule** | **83.8%** |
|
| 111 |
+
| FINER-SQL-3B-Spider (specialist) | 85.1% |
|
| 112 |
+
|
| 113 |
+
To enable the Spider-friendly mode, append to the system prompt:
|
| 114 |
+
```
|
| 115 |
+
- When using GROUP BY, list aggregate functions (COUNT, SUM, AVG, MIN, MAX) in the SELECT clause BEFORE the grouping column(s).
|
| 116 |
+
```
|
| 117 |
+
|
| 118 |
+
---
|
| 119 |
+
|
| 120 |
+
## Training
|
| 121 |
+
|
| 122 |
+
| Parameter | Value |
|
| 123 |
+
|-----------|-------|
|
| 124 |
+
| Base model | `griffith-bigdata/Qwen-2.5-Coder-3B-SQL-Writer` |
|
| 125 |
+
| Algorithm | GRPO |
|
| 126 |
+
| Train data | BIRD train (V2 prompts, top-30 GRAST schema) |
|
| 127 |
+
| Learning rate | 8e-6 |
|
| 128 |
+
| Num generations per prompt | 32 |
|
| 129 |
+
| Gradient accumulation | 32 |
|
| 130 |
+
| Max completion length | 2048 |
|
| 131 |
+
| Max prompt length | 4096 |
|
| 132 |
+
| Temperature (rollout) | 1.0 |
|
| 133 |
+
| Selection during eval | vav (value-aware voting) |
|
| 134 |
+
| Rewards | Execution + Atomic + Memory + Format |
|
| 135 |
+
| GPU | 1Γ A6000 48 GB |
|
| 136 |
+
|
| 137 |
+
---
|
| 138 |
+
|
| 139 |
+
## License
|
| 140 |
+
|
| 141 |
+
Inherits the base model's license (Apache 2.0). Not for medical, legal, or other safety-critical autonomous decision-making.
|
| 142 |
+
|
| 143 |
+
---
|
| 144 |
+
|
| 145 |
+
## Citation
|
| 146 |
+
|
| 147 |
+
```bibtex
|
| 148 |
+
@article{finer-sql-2026,
|
| 149 |
+
title = {FINER-SQL: Fine-grained reasoning rewards for small Text-to-SQL models},
|
| 150 |
+
author = {Thanh Dat and others},
|
| 151 |
+
year = {2026},
|
| 152 |
+
}
|
| 153 |
+
```
|