File size: 8,647 Bytes
afdfdd2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
---
language:
  - en
license: apache-2.0
task_categories:
  - text2text-generation
  - text-generation
tags:
  - postgresql
  - sql
  - plpgsql
  - text-to-sql
  - code-generation
  - database
  - postgres
  - neurondb
pretty_name: "NeuronDB PostgreSQL SQL & PL/pgSQL Instruction Dataset"
size_categories:
  - 100K<n<1M
dataset_info:
  features:
    - name: question
      dtype: string
    - name: schema
      dtype: string
    - name: sql
      dtype: string
    - name: explanation
      dtype: string
    - name: validation_query
      dtype: string
    - name: source
      dtype: string
    - name: difficulty
      dtype: string
    - name: category
      dtype: string
    - name: is_postgresql_specific
      dtype: bool
    - name: sql_length
      dtype: int32
    - name: num_statements
      dtype: int32
  splits:
    - name: train
      num_examples: 194398
    - name: validation
      num_examples: 13693
    - name: test
      num_examples: 3448
configs:
  - config_name: default
    data_files:
      - split: train
        path: train.jsonl
      - split: validation
        path: validation.jsonl
      - split: test
        path: test.jsonl
---

# NeuronDB PostgreSQL SQL & PL/pgSQL Instruction Dataset

A large-scale, curated instruction dataset for training and evaluating LLMs on
**PostgreSQL-specific** SQL and PL/pgSQL generation. Every row is a
(question, schema, SQL) triplet with rich metadata for filtering and analysis.

## Dataset Summary

| Metric | Value |
|--------|-------|
| **Total rows** | 211,539 |
| **PostgreSQL-specific rows** | 11,998 (5.7%) |
| **Schema fill rate** | 82.2% |
| **Explanation fill rate** | 17.8% |
| **SQL length (median)** | 83 chars |
| **SQL length (max)** | 61,419 chars |

## Splits

| Split | Rows |
|-------|------|
| `train` | 194,398 |
| `validation` | 13,693 |
| `test` | 3,448 |

## Schema

Each row contains **11 fields**:

| Field | Type | Description |
|-------|------|-------------|
| `question` | `string` | Natural language instruction or question |
| `schema` | `string?` | DDL schema context (CREATE TABLE statements), null if not applicable |
| `sql` | `string` | Ground truth PostgreSQL SQL or PL/pgSQL answer |
| `explanation` | `string?` | Short explanation of what the SQL does |
| `validation_query` | `string?` | Query to validate the answer produces correct results |
| `source` | `string` | Origin of this instruction pair (see Sources below) |
| `difficulty` | `string` | One of: `basic`, `intermediate`, `advanced` |
| `category` | `string` | SQL category (see Categories below) |
| `is_postgresql_specific` | `bool` | True if SQL uses PostgreSQL-specific syntax |
| `sql_length` | `int32` | Character length of the SQL field |
| `num_statements` | `int32` | Number of SQL statements (semicolon count) |

## Sources

Data is aggregated from multiple high-quality sources, each tagged:

| Source | Rows |
|--------|------|
| `community_sql_datasets` | 115,811 |
| `sql_create_context` | 78,392 |
| `postgresql_regression_tests` | 11,622 |
| `pgtap_tests` | 4,181 |
| `plpgsql_source` | 1,529 |
| `synthetic_text_to_sql` | 4 |

### Source Descriptions

- **`postgresql_regression_tests`** β€” SQL extracted from PostgreSQL's own regression test suite
- **`postgresql_docs`** β€” Examples from official PostgreSQL SGML documentation
- **`postgresql_contrib`** β€” SQL from contrib modules (pg_trgm, hstore, ltree, etc.)
- **`pgtap_tests`** β€” pgTAP unit test SQL
- **`plpgsql_source`** β€” PL/pgSQL functions from the PostgreSQL source tree
- **`pgbench_scripts`** β€” pgbench benchmark scripts
- **`handcrafted_advanced`** β€” Hand-written examples covering advanced patterns (window functions, CTEs, JSONB, RLS, triggers, partitioning, custom aggregates, etc.)
- **`sql_create_context`** β€” WikiSQL/Spider-derived text-to-SQL pairs (b-mc2/sql-create-context)
- **`synthetic_text_to_sql`** β€” Synthetically generated text-to-SQL pairs (gretelai, NumbersStation)
- **`community_sql_datasets`** β€” Other community SQL datasets (Clinton/text-to-sql-v1, knowrohit07/know_sql)

## Difficulty Distribution

| Difficulty | Rows |
|------------|------|
| `basic` | 147,920 |
| `intermediate` | 56,469 |
| `advanced` | 7,150 |

## Categories

| Category | Rows |
|----------|------|
| `query_select` | 136,225 |
| `query_aggregation` | 32,050 |
| `query_join` | 10,597 |
| `dml_insert` | 8,763 |
| `other` | 4,093 |
| `dml_update` | 3,664 |
| `dml_delete` | 3,647 |
| `ddl_table` | 3,430 |
| `query_window_function` | 3,055 |
| `plpgsql_function` | 1,912 |
| `ddl_advanced` | 1,143 |
| `ddl_index` | 806 |
| `plpgsql` | 742 |
| `ddl_view` | 541 |
| `plpgsql_trigger` | 401 |
| `ddl_alter` | 235 |
| `admin_maintenance` | 125 |
| `dcl_security` | 92 |
| `query_recursive_cte` | 18 |

## Usage

```python
from datasets import load_dataset

ds = load_dataset("neurondb/neurondb-postgresql-sql")

# Filter for advanced PostgreSQL-specific queries
advanced_pg = ds["train"].filter(
    lambda x: x["difficulty"] == "advanced" and x["is_postgresql_specific"]
)

# Filter by category
window_fns = ds["train"].filter(lambda x: x["category"] == "query_window_function")

# Filter by source
gold = ds["train"].filter(
    lambda x: x["source"] in [
        "postgresql_regression_tests",
        "postgresql_docs",
        "handcrafted_advanced",
    ]
)
```

## Intended Use

- **Fine-tuning** LLMs for PostgreSQL SQL and PL/pgSQL code generation
- **Evaluating** text-to-SQL models on PostgreSQL-specific syntax
- **Benchmarking** SQL generation quality across difficulty levels
- **Building** PostgreSQL-aware coding assistants

## Data Quality

- All rows have non-empty `question` and `sql` fields
- MySQL-only and T-SQL-only syntax has been filtered out
- Duplicate (question, SQL) pairs have been removed
- Rows with trivially short SQL (< 10 chars) are excluded
- Each row is tagged with source, difficulty, and category for easy filtering

## Examples


#### Example 1 β€” basic / query_select
**Source:** `sql_create_context`

**Question:** Generate PostgreSQL SQL for: Which manufacturer made a locomotive with a type of 4-6-4t?

**Schema:**
```sql
CREATE TABLE table_name_40 (manufacturer VARCHAR, type VARCHAR)
```

**SQL:**
```sql
SELECT manufacturer FROM table_name_40 WHERE type = '4-6-4t';
```


#### Example 2 β€” intermediate / query_join
**Source:** `community_sql_datasets`

**Question:** What is the average account balance for customers who have a Shariah-compliant mortgage or a socially responsible loan?

**Schema:**
```sql
CREATE TABLE shariah_mortgages (mortgage_id INT, customer_id INT, account_balance DECIMAL); CREATE TABLE socially_responsible_loans (loan_id INT, customer_id INT, account_balance DECIMAL); CREATE TABLE shariah_loans (loan_id INT, mortgage_id INT);
```

**SQL:**
```sql
SELECT AVG(CASE WHEN sm.customer_id IS NOT NULL THEN sm.account_balance ELSE srl.account_balance END) FROM shariah_mortgages sm RIGHT JOIN socially_responsible_loans srl ON sm.customer_id = srl.customer_id JOIN shariah_loans sl ON sm.mortgage_id = sl.mortgage_id OR srl.loan_id = sl.loan_id;
```


#### Example 3 β€” advanced / plpgsql_function
**Source:** `community_sql_datasets`

**Question:** Write the PL/pgSQL object from PostgreSQL regression test 'plpgsql' (example 352).

**SQL:**
```sql
$$ language plpgsql;

select * from sc_test();

create or replace function sc_test() returns setof integer as $$
declare
  c refcursor;
```

**Explanation:** PL/pgSQL object from PostgreSQL core test for Plpgsql.


#### Example 4 β€” advanced / query_window_function
**Source:** `community_sql_datasets`

**Question:** What is the difference in the number of attendees for each community education program between the first and last occurrence?

**Schema:**
```sql
CREATE TABLE community_education (program_name VARCHAR(255), location VARCHAR(255), date DATE, num_attendees INT); INSERT INTO community_education (program_name, location, date, num_attendees) VALUES ('Wildlife Awareness', 'New York', '2020-01-01', 50), ('Wildlife Awareness', 'Florida', '2020-03-10', 75), ('Nature Walk', 'California', '2019-05-15', 25), ('Nature Walk', 'California', '2020-05-15', 35);
```

**SQL:**
```sql
SELECT program_name, num_attendees - FIRST_VALUE(num_attendees) OVER (PARTITION BY program_name ORDER BY date) as diff FROM community_education;
```



## Citation

If you use this dataset, please cite:

```bibtex
@dataset{neurondb_postgresql_sql_2026,
  title={NeuronDB PostgreSQL SQL & PL/pgSQL Instruction Dataset},
  author={NeuronDB Team},
  year={2026},
  url={https://huggingface.co/datasets/neurondb/neurondb-postgresql-sql},
}
```

## License

Apache 2.0