Update model card: add GitHub link, design docs, and benchmark setup guide
Browse files
README.md
CHANGED
|
@@ -111,6 +111,20 @@ Evaluated on the **Spider 1.0 dev set** (1,034 questions) using an agentic bench
|
|
| 111 |
| **Total Questions** | 1,034 |
|
| 112 |
| **Gold Errors Excluded** | 2 |
|
| 113 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 114 |
### Detailed Breakdown
|
| 115 |
|
| 116 |
| Status | Count | % of Total | Description |
|
|
|
|
| 111 |
| **Total Questions** | 1,034 |
|
| 112 |
| **Gold Errors Excluded** | 2 |
|
| 113 |
|
| 114 |
+
### Context: Spider 1.0 Dev Set SOTA
|
| 115 |
+
|
| 116 |
+
| Model | Size | EX (Dev) | Method |
|
| 117 |
+
|-------|-----:|----------|--------|
|
| 118 |
+
| MiniSeek | — | 91.2% | Proprietary |
|
| 119 |
+
| DAIL-SQL + GPT-4 + SC | — | 86.6% | In-context learning |
|
| 120 |
+
| DIN-SQL + GPT-4 | — | 85.3% | In-context learning |
|
| 121 |
+
| SFT CodeS-7B | 7B | 85.4% | Fine-tuned |
|
| 122 |
+
| SFT CodeS-3B | 3B | 83.3% | Fine-tuned |
|
| 123 |
+
| SFT CodeS-1B | 1B | 77.9% | Fine-tuned |
|
| 124 |
+
| **Pipe SQL 1.5B (ours)** | **1.5B** | **60.7%** | **Fine-tuned, agentic tool-calling** |
|
| 125 |
+
|
| 126 |
+
Our model trails CodeS-1B by ~17 points. Key differences: (1) Pipe SQL generates a novel SQL dialect (pipe syntax) rather than standard SQL, adding a transpilation step; (2) the agentic tool-calling interface adds overhead vs. direct SQL generation; (3) our focus is on demonstrating the pipe SQL paradigm, not maximizing Spider accuracy. Sources: [Spider leaderboard](https://yale-lily.github.io/spider), [CodeS (Li et al., 2024)](https://arxiv.org/abs/2402.16347).
|
| 127 |
+
|
| 128 |
### Detailed Breakdown
|
| 129 |
|
| 130 |
| Status | Count | % of Total | Description |
|