Zheyuan Zhao commited on
Commit
a49e9d7
·
verified ·
1 Parent(s): f191612

Update model card: improve results detail and add benchmark reproduction guide

Browse files
Files changed (1) hide show
  1. README.md +188 -11
README.md CHANGED
@@ -87,25 +87,37 @@ Conversations were generated from the [Spider 1.0](https://yale-lily.github.io/s
87
 
88
  ## Evaluation Results
89
 
90
- Evaluated on the **Spider 1.0 dev set** (1,034 questions) using an agentic benchmark pipeline with execution accuracy.
 
 
91
 
92
  | Metric | Value |
93
  |--------|-------|
94
  | **Execution Accuracy** | **60.66%** (626 / 1,032) |
95
  | **Prediction Rate** | 99.7% (1,031 / 1,034) |
 
 
 
 
96
 
97
- ### Status Breakdown
 
 
 
 
 
 
 
98
 
99
- | Status | Count | Percentage |
100
- |--------|-------|------------|
101
- | Match | 626 | 60.5% |
102
- | Mismatch | 209 | 20.2% |
103
- | Execution Error | 170 | 16.4% |
104
- | Transpile Error | 24 | 2.3% |
105
- | No Prediction | 3 | 0.3% |
106
- | Gold Error (excluded) | 2 | 0.2% |
107
 
108
- > **Note**: This is an **in-distribution** evaluation the model was trained on Spider training data, and the dev set uses the same 20 databases. Gold errors (2 questions where the reference SQL fails) are excluded from the accuracy denominator.
 
 
 
 
 
 
109
 
110
  ## Tools
111
 
@@ -150,12 +162,177 @@ Tables in database 'concert_singer':
150
 
151
  For inference with the correct chat template, see the evaluation server code in the [sqlglot repository](https://github.com/nittygritty-zzy/sqlglot/tree/main/evaluation/server).
152
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
153
  ## Limitations
154
 
155
  - Trained and evaluated only on Spider 1.0 (SQLite databases)
156
  - Context window limited to 2,048 tokens during training
157
  - The 1.5B model may generate garbled special tokens instead of proper `<tool_call>` tags — the inference server includes fallback parsing for bare function calls
158
  - Performance on out-of-distribution databases (different schemas/domains) has not been extensively tested
 
159
 
160
  ## License
161
 
 
87
 
88
  ## Evaluation Results
89
 
90
+ Evaluated on the **Spider 1.0 dev set** (1,034 questions) using an agentic benchmark pipeline. The agent autonomously explores database schemas via tool calls, writes pipe SQL, and iterates until correct — matching the training workflow.
91
+
92
+ ### Execution Accuracy
93
 
94
  | Metric | Value |
95
  |--------|-------|
96
  | **Execution Accuracy** | **60.66%** (626 / 1,032) |
97
  | **Prediction Rate** | 99.7% (1,031 / 1,034) |
98
+ | **Total Questions** | 1,034 |
99
+ | **Gold Errors Excluded** | 2 |
100
+
101
+ ### Detailed Breakdown
102
 
103
+ | Status | Count | % of Total | Description |
104
+ |--------|------:|------------|-------------|
105
+ | **Match** | 626 | 60.5% | Predicted SQL produces identical results to gold SQL |
106
+ | **Mismatch** | 209 | 20.2% | SQL executes but results differ from gold |
107
+ | **Execution Error** | 170 | 16.4% | Transpiled SQL fails to execute against SQLite |
108
+ | **Transpile Error** | 24 | 2.3% | Pipe SQL cannot be transpiled to standard SQL |
109
+ | **No Prediction** | 3 | 0.3% | Agent did not produce a pipe SQL query |
110
+ | **Gold Error** | 2 | 0.2% | Reference gold SQL fails (excluded from denominator) |
111
 
112
+ ### Evaluation Methodology
 
 
 
 
 
 
 
113
 
114
+ 1. The TypeScript agent runs each question through a multi-turn tool-calling loop (max 10 turns, 120s timeout)
115
+ 2. The agent's final `execute_pipe_sql` call is extracted as the predicted pipe SQL
116
+ 3. Predicted pipe SQL is transpiled to standard SQL using `sqlglot.transpile()`
117
+ 4. Both predicted and gold SQL are executed against the Spider SQLite databases
118
+ 5. Result sets are compared using order-insensitive set comparison with numeric tolerance
119
+
120
+ > **Note**: This is an **in-distribution** evaluation — the model was trained on Spider training data, and the dev set uses the same 20 databases.
121
 
122
  ## Tools
123
 
 
162
 
163
  For inference with the correct chat template, see the evaluation server code in the [sqlglot repository](https://github.com/nittygritty-zzy/sqlglot/tree/main/evaluation/server).
164
 
165
+ ## Reproducing the Benchmark
166
+
167
+ ### Prerequisites
168
+
169
+ - **GPU**: NVIDIA GPU with >= 6 GB VRAM (model runs in float16)
170
+ - **Python**: 3.11+ with pip/uv
171
+ - **Node.js**: 18+ with npm
172
+ - **Disk**: ~1 GB for Spider databases, ~3 GB for model weights
173
+
174
+ ### Step 1: Clone the Repository
175
+
176
+ ```bash
177
+ git clone https://github.com/nittygritty-zzy/sqlglot.git
178
+ cd sqlglot
179
+ ```
180
+
181
+ ### Step 2: Set Up Python Environment
182
+
183
+ ```bash
184
+ # Create virtual environment
185
+ uv venv .venv --python 3.11
186
+ source .venv/bin/activate # Linux/macOS
187
+ # source .venv/Scripts/activate # Windows (Git Bash)
188
+
189
+ # Install sqlglot (editable)
190
+ uv pip install -e .
191
+
192
+ # Install evaluation server dependencies
193
+ uv pip install fastapi uvicorn pydantic
194
+
195
+ # Install PyTorch with CUDA support
196
+ uv pip install torch --index-url https://download.pytorch.org/whl/cu126
197
+
198
+ # Install model loading dependencies
199
+ uv pip install transformers accelerate
200
+ ```
201
+
202
+ Verify CUDA:
203
+ ```bash
204
+ python -c "import torch; print(torch.cuda.is_available(), torch.cuda.get_device_name(0))"
205
+ # Expected: True NVIDIA GeForce RTX ...
206
+ ```
207
+
208
+ ### Step 3: Download Spider 1.0 Dataset
209
+
210
+ The benchmark uses Spider 1.0 dev set (1,034 questions across 20 SQLite databases).
211
+
212
+ ```bash
213
+ # Install gdown for Google Drive downloads
214
+ uv pip install gdown
215
+
216
+ # Download and extract Spider 1.0 (~1 GB)
217
+ bash scripts/setup_data.sh
218
+ ```
219
+
220
+ Verify:
221
+ ```bash
222
+ ls data/spider/dev.json # 1,034 questions
223
+ ls data/spider/database/ | wc -l # ~166 databases (20 used by dev set)
224
+ ```
225
+
226
+ ### Step 4: Download the Model
227
+
228
+ ```bash
229
+ # Option A: Use huggingface_hub (recommended)
230
+ pip install huggingface_hub
231
+ python -c "
232
+ from huggingface_hub import snapshot_download
233
+ snapshot_download('enycin28/pipe-sql-1.5b', local_dir='finetuning_output/merged')
234
+ "
235
+
236
+ # Option B: Use git-lfs
237
+ git lfs install
238
+ git clone https://huggingface.co/enycin28/pipe-sql-1.5b finetuning_output/merged
239
+ ```
240
+
241
+ ### Step 5: Install Node.js Agent Dependencies
242
+
243
+ ```bash
244
+ cd evaluation/agent
245
+ npm install
246
+ cd ../..
247
+ ```
248
+
249
+ ### Step 6: Run the Benchmark
250
+
251
+ #### Option A: Full Pipeline (Recommended)
252
+
253
+ ```bash
254
+ # Run all 1,034 questions (takes ~2 hours on RTX 4080)
255
+ bash evaluation/run_all.sh
256
+
257
+ # Smoke test with 5 questions first
258
+ bash evaluation/run_all.sh --limit 5
259
+ ```
260
+
261
+ This script:
262
+ 1. Starts the Python evaluation server (model inference + tool execution)
263
+ 2. Waits for the server to be ready
264
+ 3. Runs the TypeScript agent benchmark
265
+ 4. Evaluates results and prints execution accuracy
266
+
267
+ #### Option B: Run Components Separately
268
+
269
+ **Start the evaluation server:**
270
+ ```bash
271
+ # Default: loads model from finetuning_output/merged/
272
+ python -m evaluation.server.app
273
+
274
+ # Custom model path:
275
+ MODEL_PATH=path/to/model python -m evaluation.server.app
276
+ ```
277
+
278
+ Wait for `Server ready` in the logs, then in a separate terminal:
279
+
280
+ **Run the agent benchmark:**
281
+ ```bash
282
+ cd evaluation/agent
283
+ npx tsx src/main.ts --benchmark # All 1,034 questions
284
+ npx tsx src/main.ts --benchmark --limit 5 # Smoke test
285
+ ```
286
+
287
+ **Run single question interactively:**
288
+ ```bash
289
+ cd evaluation/agent
290
+ npx tsx src/main.ts "How many singers do we have?" concert_singer
291
+ ```
292
+
293
+ **Evaluate results:**
294
+ ```bash
295
+ python evaluation/evaluate.py --results evaluation_output/results.json
296
+ ```
297
+
298
+ ### Step 7: Review Results
299
+
300
+ Results are saved to `evaluation_output/`:
301
+
302
+ | File | Description |
303
+ |------|-------------|
304
+ | `results.json` | Agent predictions with conversation traces |
305
+ | `eval_results.json` | Per-question evaluation details (match/mismatch/error) |
306
+ | `eval_summary.json` | Aggregate metrics |
307
+
308
+ ### Configuration
309
+
310
+ | Environment Variable | Default | Description |
311
+ |---------------------|---------|-------------|
312
+ | `MODEL_PATH` | `finetuning_output/merged` | Path to merged model directory |
313
+ | `SPIDER_DB_DIR` | `data/spider/database` | Spider database directory |
314
+ | `SPIDER_DIR` | `data/spider` | Spider data directory (contains dev.json) |
315
+ | `PORT` | `8000` | Evaluation server port |
316
+ | `SERVER_URL` | `http://localhost:8000` | Agent to server connection URL |
317
+ | `OUTPUT_DIR` | `evaluation_output` | Agent output directory |
318
+
319
+ ### Troubleshooting
320
+
321
+ **Server fails to load model**: Ensure `finetuning_output/merged/` contains `config.json`, `model.safetensors`, and `tokenizer.json`. If using a different path, set `MODEL_PATH`.
322
+
323
+ **CUDA out of memory**: The 1.5B model needs ~3 GB VRAM in float16. Close other GPU processes or use `CUDA_VISIBLE_DEVICES=0` to select a specific GPU.
324
+
325
+ **Agent produces garbled tool calls**: The 1.5B model sometimes generates garbled special tokens instead of proper `<tool_call>` tags. The inference server includes fallback parsing for bare function calls — this is handled automatically.
326
+
327
+ **Spider databases not found**: Run `bash scripts/setup_data.sh` to download Spider 1.0. The script downloads from Google Drive via `gdown`.
328
+
329
  ## Limitations
330
 
331
  - Trained and evaluated only on Spider 1.0 (SQLite databases)
332
  - Context window limited to 2,048 tokens during training
333
  - The 1.5B model may generate garbled special tokens instead of proper `<tool_call>` tags — the inference server includes fallback parsing for bare function calls
334
  - Performance on out-of-distribution databases (different schemas/domains) has not been extensively tested
335
+ - This is an in-distribution evaluation; real-world performance on unseen databases will likely be lower
336
 
337
  ## License
338