Zheyuan Zhao commited on
Update model card: improve results detail and add benchmark reproduction guide
Browse files
README.md
CHANGED
|
@@ -87,25 +87,37 @@ Conversations were generated from the [Spider 1.0](https://yale-lily.github.io/s
|
|
| 87 |
|
| 88 |
## Evaluation Results
|
| 89 |
|
| 90 |
-
Evaluated on the **Spider 1.0 dev set** (1,034 questions) using an agentic benchmark pipeline
|
|
|
|
|
|
|
| 91 |
|
| 92 |
| Metric | Value |
|
| 93 |
|--------|-------|
|
| 94 |
| **Execution Accuracy** | **60.66%** (626 / 1,032) |
|
| 95 |
| **Prediction Rate** | 99.7% (1,031 / 1,034) |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 96 |
|
| 97 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 98 |
|
| 99 |
-
|
| 100 |
-
|--------|-------|------------|
|
| 101 |
-
| Match | 626 | 60.5% |
|
| 102 |
-
| Mismatch | 209 | 20.2% |
|
| 103 |
-
| Execution Error | 170 | 16.4% |
|
| 104 |
-
| Transpile Error | 24 | 2.3% |
|
| 105 |
-
| No Prediction | 3 | 0.3% |
|
| 106 |
-
| Gold Error (excluded) | 2 | 0.2% |
|
| 107 |
|
| 108 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 109 |
|
| 110 |
## Tools
|
| 111 |
|
|
@@ -150,12 +162,177 @@ Tables in database 'concert_singer':
|
|
| 150 |
|
| 151 |
For inference with the correct chat template, see the evaluation server code in the [sqlglot repository](https://github.com/nittygritty-zzy/sqlglot/tree/main/evaluation/server).
|
| 152 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 153 |
## Limitations
|
| 154 |
|
| 155 |
- Trained and evaluated only on Spider 1.0 (SQLite databases)
|
| 156 |
- Context window limited to 2,048 tokens during training
|
| 157 |
- The 1.5B model may generate garbled special tokens instead of proper `<tool_call>` tags — the inference server includes fallback parsing for bare function calls
|
| 158 |
- Performance on out-of-distribution databases (different schemas/domains) has not been extensively tested
|
|
|
|
| 159 |
|
| 160 |
## License
|
| 161 |
|
|
|
|
| 87 |
|
| 88 |
## Evaluation Results
|
| 89 |
|
| 90 |
+
Evaluated on the **Spider 1.0 dev set** (1,034 questions) using an agentic benchmark pipeline. The agent autonomously explores database schemas via tool calls, writes pipe SQL, and iterates until correct — matching the training workflow.
|
| 91 |
+
|
| 92 |
+
### Execution Accuracy
|
| 93 |
|
| 94 |
| Metric | Value |
|
| 95 |
|--------|-------|
|
| 96 |
| **Execution Accuracy** | **60.66%** (626 / 1,032) |
|
| 97 |
| **Prediction Rate** | 99.7% (1,031 / 1,034) |
|
| 98 |
+
| **Total Questions** | 1,034 |
|
| 99 |
+
| **Gold Errors Excluded** | 2 |
|
| 100 |
+
|
| 101 |
+
### Detailed Breakdown
|
| 102 |
|
| 103 |
+
| Status | Count | % of Total | Description |
|
| 104 |
+
|--------|------:|------------|-------------|
|
| 105 |
+
| **Match** | 626 | 60.5% | Predicted SQL produces identical results to gold SQL |
|
| 106 |
+
| **Mismatch** | 209 | 20.2% | SQL executes but results differ from gold |
|
| 107 |
+
| **Execution Error** | 170 | 16.4% | Transpiled SQL fails to execute against SQLite |
|
| 108 |
+
| **Transpile Error** | 24 | 2.3% | Pipe SQL cannot be transpiled to standard SQL |
|
| 109 |
+
| **No Prediction** | 3 | 0.3% | Agent did not produce a pipe SQL query |
|
| 110 |
+
| **Gold Error** | 2 | 0.2% | Reference gold SQL fails (excluded from denominator) |
|
| 111 |
|
| 112 |
+
### Evaluation Methodology
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 113 |
|
| 114 |
+
1. The TypeScript agent runs each question through a multi-turn tool-calling loop (max 10 turns, 120s timeout)
|
| 115 |
+
2. The agent's final `execute_pipe_sql` call is extracted as the predicted pipe SQL
|
| 116 |
+
3. Predicted pipe SQL is transpiled to standard SQL using `sqlglot.transpile()`
|
| 117 |
+
4. Both predicted and gold SQL are executed against the Spider SQLite databases
|
| 118 |
+
5. Result sets are compared using order-insensitive set comparison with numeric tolerance
|
| 119 |
+
|
| 120 |
+
> **Note**: This is an **in-distribution** evaluation — the model was trained on Spider training data, and the dev set uses the same 20 databases.
|
| 121 |
|
| 122 |
## Tools
|
| 123 |
|
|
|
|
| 162 |
|
| 163 |
For inference with the correct chat template, see the evaluation server code in the [sqlglot repository](https://github.com/nittygritty-zzy/sqlglot/tree/main/evaluation/server).
|
| 164 |
|
| 165 |
+
## Reproducing the Benchmark
|
| 166 |
+
|
| 167 |
+
### Prerequisites
|
| 168 |
+
|
| 169 |
+
- **GPU**: NVIDIA GPU with >= 6 GB VRAM (model runs in float16)
|
| 170 |
+
- **Python**: 3.11+ with pip/uv
|
| 171 |
+
- **Node.js**: 18+ with npm
|
| 172 |
+
- **Disk**: ~1 GB for Spider databases, ~3 GB for model weights
|
| 173 |
+
|
| 174 |
+
### Step 1: Clone the Repository
|
| 175 |
+
|
| 176 |
+
```bash
|
| 177 |
+
git clone https://github.com/nittygritty-zzy/sqlglot.git
|
| 178 |
+
cd sqlglot
|
| 179 |
+
```
|
| 180 |
+
|
| 181 |
+
### Step 2: Set Up Python Environment
|
| 182 |
+
|
| 183 |
+
```bash
|
| 184 |
+
# Create virtual environment
|
| 185 |
+
uv venv .venv --python 3.11
|
| 186 |
+
source .venv/bin/activate # Linux/macOS
|
| 187 |
+
# source .venv/Scripts/activate # Windows (Git Bash)
|
| 188 |
+
|
| 189 |
+
# Install sqlglot (editable)
|
| 190 |
+
uv pip install -e .
|
| 191 |
+
|
| 192 |
+
# Install evaluation server dependencies
|
| 193 |
+
uv pip install fastapi uvicorn pydantic
|
| 194 |
+
|
| 195 |
+
# Install PyTorch with CUDA support
|
| 196 |
+
uv pip install torch --index-url https://download.pytorch.org/whl/cu126
|
| 197 |
+
|
| 198 |
+
# Install model loading dependencies
|
| 199 |
+
uv pip install transformers accelerate
|
| 200 |
+
```
|
| 201 |
+
|
| 202 |
+
Verify CUDA:
|
| 203 |
+
```bash
|
| 204 |
+
python -c "import torch; print(torch.cuda.is_available(), torch.cuda.get_device_name(0))"
|
| 205 |
+
# Expected: True NVIDIA GeForce RTX ...
|
| 206 |
+
```
|
| 207 |
+
|
| 208 |
+
### Step 3: Download Spider 1.0 Dataset
|
| 209 |
+
|
| 210 |
+
The benchmark uses Spider 1.0 dev set (1,034 questions across 20 SQLite databases).
|
| 211 |
+
|
| 212 |
+
```bash
|
| 213 |
+
# Install gdown for Google Drive downloads
|
| 214 |
+
uv pip install gdown
|
| 215 |
+
|
| 216 |
+
# Download and extract Spider 1.0 (~1 GB)
|
| 217 |
+
bash scripts/setup_data.sh
|
| 218 |
+
```
|
| 219 |
+
|
| 220 |
+
Verify:
|
| 221 |
+
```bash
|
| 222 |
+
ls data/spider/dev.json # 1,034 questions
|
| 223 |
+
ls data/spider/database/ | wc -l # ~166 databases (20 used by dev set)
|
| 224 |
+
```
|
| 225 |
+
|
| 226 |
+
### Step 4: Download the Model
|
| 227 |
+
|
| 228 |
+
```bash
|
| 229 |
+
# Option A: Use huggingface_hub (recommended)
|
| 230 |
+
pip install huggingface_hub
|
| 231 |
+
python -c "
|
| 232 |
+
from huggingface_hub import snapshot_download
|
| 233 |
+
snapshot_download('enycin28/pipe-sql-1.5b', local_dir='finetuning_output/merged')
|
| 234 |
+
"
|
| 235 |
+
|
| 236 |
+
# Option B: Use git-lfs
|
| 237 |
+
git lfs install
|
| 238 |
+
git clone https://huggingface.co/enycin28/pipe-sql-1.5b finetuning_output/merged
|
| 239 |
+
```
|
| 240 |
+
|
| 241 |
+
### Step 5: Install Node.js Agent Dependencies
|
| 242 |
+
|
| 243 |
+
```bash
|
| 244 |
+
cd evaluation/agent
|
| 245 |
+
npm install
|
| 246 |
+
cd ../..
|
| 247 |
+
```
|
| 248 |
+
|
| 249 |
+
### Step 6: Run the Benchmark
|
| 250 |
+
|
| 251 |
+
#### Option A: Full Pipeline (Recommended)
|
| 252 |
+
|
| 253 |
+
```bash
|
| 254 |
+
# Run all 1,034 questions (takes ~2 hours on RTX 4080)
|
| 255 |
+
bash evaluation/run_all.sh
|
| 256 |
+
|
| 257 |
+
# Smoke test with 5 questions first
|
| 258 |
+
bash evaluation/run_all.sh --limit 5
|
| 259 |
+
```
|
| 260 |
+
|
| 261 |
+
This script:
|
| 262 |
+
1. Starts the Python evaluation server (model inference + tool execution)
|
| 263 |
+
2. Waits for the server to be ready
|
| 264 |
+
3. Runs the TypeScript agent benchmark
|
| 265 |
+
4. Evaluates results and prints execution accuracy
|
| 266 |
+
|
| 267 |
+
#### Option B: Run Components Separately
|
| 268 |
+
|
| 269 |
+
**Start the evaluation server:**
|
| 270 |
+
```bash
|
| 271 |
+
# Default: loads model from finetuning_output/merged/
|
| 272 |
+
python -m evaluation.server.app
|
| 273 |
+
|
| 274 |
+
# Custom model path:
|
| 275 |
+
MODEL_PATH=path/to/model python -m evaluation.server.app
|
| 276 |
+
```
|
| 277 |
+
|
| 278 |
+
Wait for `Server ready` in the logs, then in a separate terminal:
|
| 279 |
+
|
| 280 |
+
**Run the agent benchmark:**
|
| 281 |
+
```bash
|
| 282 |
+
cd evaluation/agent
|
| 283 |
+
npx tsx src/main.ts --benchmark # All 1,034 questions
|
| 284 |
+
npx tsx src/main.ts --benchmark --limit 5 # Smoke test
|
| 285 |
+
```
|
| 286 |
+
|
| 287 |
+
**Run single question interactively:**
|
| 288 |
+
```bash
|
| 289 |
+
cd evaluation/agent
|
| 290 |
+
npx tsx src/main.ts "How many singers do we have?" concert_singer
|
| 291 |
+
```
|
| 292 |
+
|
| 293 |
+
**Evaluate results:**
|
| 294 |
+
```bash
|
| 295 |
+
python evaluation/evaluate.py --results evaluation_output/results.json
|
| 296 |
+
```
|
| 297 |
+
|
| 298 |
+
### Step 7: Review Results
|
| 299 |
+
|
| 300 |
+
Results are saved to `evaluation_output/`:
|
| 301 |
+
|
| 302 |
+
| File | Description |
|
| 303 |
+
|------|-------------|
|
| 304 |
+
| `results.json` | Agent predictions with conversation traces |
|
| 305 |
+
| `eval_results.json` | Per-question evaluation details (match/mismatch/error) |
|
| 306 |
+
| `eval_summary.json` | Aggregate metrics |
|
| 307 |
+
|
| 308 |
+
### Configuration
|
| 309 |
+
|
| 310 |
+
| Environment Variable | Default | Description |
|
| 311 |
+
|---------------------|---------|-------------|
|
| 312 |
+
| `MODEL_PATH` | `finetuning_output/merged` | Path to merged model directory |
|
| 313 |
+
| `SPIDER_DB_DIR` | `data/spider/database` | Spider database directory |
|
| 314 |
+
| `SPIDER_DIR` | `data/spider` | Spider data directory (contains dev.json) |
|
| 315 |
+
| `PORT` | `8000` | Evaluation server port |
|
| 316 |
+
| `SERVER_URL` | `http://localhost:8000` | Agent to server connection URL |
|
| 317 |
+
| `OUTPUT_DIR` | `evaluation_output` | Agent output directory |
|
| 318 |
+
|
| 319 |
+
### Troubleshooting
|
| 320 |
+
|
| 321 |
+
**Server fails to load model**: Ensure `finetuning_output/merged/` contains `config.json`, `model.safetensors`, and `tokenizer.json`. If using a different path, set `MODEL_PATH`.
|
| 322 |
+
|
| 323 |
+
**CUDA out of memory**: The 1.5B model needs ~3 GB VRAM in float16. Close other GPU processes or use `CUDA_VISIBLE_DEVICES=0` to select a specific GPU.
|
| 324 |
+
|
| 325 |
+
**Agent produces garbled tool calls**: The 1.5B model sometimes generates garbled special tokens instead of proper `<tool_call>` tags. The inference server includes fallback parsing for bare function calls — this is handled automatically.
|
| 326 |
+
|
| 327 |
+
**Spider databases not found**: Run `bash scripts/setup_data.sh` to download Spider 1.0. The script downloads from Google Drive via `gdown`.
|
| 328 |
+
|
| 329 |
## Limitations
|
| 330 |
|
| 331 |
- Trained and evaluated only on Spider 1.0 (SQLite databases)
|
| 332 |
- Context window limited to 2,048 tokens during training
|
| 333 |
- The 1.5B model may generate garbled special tokens instead of proper `<tool_call>` tags — the inference server includes fallback parsing for bare function calls
|
| 334 |
- Performance on out-of-distribution databases (different schemas/domains) has not been extensively tested
|
| 335 |
+
- This is an in-distribution evaluation; real-world performance on unseen databases will likely be lower
|
| 336 |
|
| 337 |
## License
|
| 338 |
|