File size: 8,647 Bytes
afdfdd2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 | ---
language:
- en
license: apache-2.0
task_categories:
- text2text-generation
- text-generation
tags:
- postgresql
- sql
- plpgsql
- text-to-sql
- code-generation
- database
- postgres
- neurondb
pretty_name: "NeuronDB PostgreSQL SQL & PL/pgSQL Instruction Dataset"
size_categories:
- 100K<n<1M
dataset_info:
features:
- name: question
dtype: string
- name: schema
dtype: string
- name: sql
dtype: string
- name: explanation
dtype: string
- name: validation_query
dtype: string
- name: source
dtype: string
- name: difficulty
dtype: string
- name: category
dtype: string
- name: is_postgresql_specific
dtype: bool
- name: sql_length
dtype: int32
- name: num_statements
dtype: int32
splits:
- name: train
num_examples: 194398
- name: validation
num_examples: 13693
- name: test
num_examples: 3448
configs:
- config_name: default
data_files:
- split: train
path: train.jsonl
- split: validation
path: validation.jsonl
- split: test
path: test.jsonl
---
# NeuronDB PostgreSQL SQL & PL/pgSQL Instruction Dataset
A large-scale, curated instruction dataset for training and evaluating LLMs on
**PostgreSQL-specific** SQL and PL/pgSQL generation. Every row is a
(question, schema, SQL) triplet with rich metadata for filtering and analysis.
## Dataset Summary
| Metric | Value |
|--------|-------|
| **Total rows** | 211,539 |
| **PostgreSQL-specific rows** | 11,998 (5.7%) |
| **Schema fill rate** | 82.2% |
| **Explanation fill rate** | 17.8% |
| **SQL length (median)** | 83 chars |
| **SQL length (max)** | 61,419 chars |
## Splits
| Split | Rows |
|-------|------|
| `train` | 194,398 |
| `validation` | 13,693 |
| `test` | 3,448 |
## Schema
Each row contains **11 fields**:
| Field | Type | Description |
|-------|------|-------------|
| `question` | `string` | Natural language instruction or question |
| `schema` | `string?` | DDL schema context (CREATE TABLE statements), null if not applicable |
| `sql` | `string` | Ground truth PostgreSQL SQL or PL/pgSQL answer |
| `explanation` | `string?` | Short explanation of what the SQL does |
| `validation_query` | `string?` | Query to validate the answer produces correct results |
| `source` | `string` | Origin of this instruction pair (see Sources below) |
| `difficulty` | `string` | One of: `basic`, `intermediate`, `advanced` |
| `category` | `string` | SQL category (see Categories below) |
| `is_postgresql_specific` | `bool` | True if SQL uses PostgreSQL-specific syntax |
| `sql_length` | `int32` | Character length of the SQL field |
| `num_statements` | `int32` | Number of SQL statements (semicolon count) |
## Sources
Data is aggregated from multiple high-quality sources, each tagged:
| Source | Rows |
|--------|------|
| `community_sql_datasets` | 115,811 |
| `sql_create_context` | 78,392 |
| `postgresql_regression_tests` | 11,622 |
| `pgtap_tests` | 4,181 |
| `plpgsql_source` | 1,529 |
| `synthetic_text_to_sql` | 4 |
### Source Descriptions
- **`postgresql_regression_tests`** β SQL extracted from PostgreSQL's own regression test suite
- **`postgresql_docs`** β Examples from official PostgreSQL SGML documentation
- **`postgresql_contrib`** β SQL from contrib modules (pg_trgm, hstore, ltree, etc.)
- **`pgtap_tests`** β pgTAP unit test SQL
- **`plpgsql_source`** β PL/pgSQL functions from the PostgreSQL source tree
- **`pgbench_scripts`** β pgbench benchmark scripts
- **`handcrafted_advanced`** β Hand-written examples covering advanced patterns (window functions, CTEs, JSONB, RLS, triggers, partitioning, custom aggregates, etc.)
- **`sql_create_context`** β WikiSQL/Spider-derived text-to-SQL pairs (b-mc2/sql-create-context)
- **`synthetic_text_to_sql`** β Synthetically generated text-to-SQL pairs (gretelai, NumbersStation)
- **`community_sql_datasets`** β Other community SQL datasets (Clinton/text-to-sql-v1, knowrohit07/know_sql)
## Difficulty Distribution
| Difficulty | Rows |
|------------|------|
| `basic` | 147,920 |
| `intermediate` | 56,469 |
| `advanced` | 7,150 |
## Categories
| Category | Rows |
|----------|------|
| `query_select` | 136,225 |
| `query_aggregation` | 32,050 |
| `query_join` | 10,597 |
| `dml_insert` | 8,763 |
| `other` | 4,093 |
| `dml_update` | 3,664 |
| `dml_delete` | 3,647 |
| `ddl_table` | 3,430 |
| `query_window_function` | 3,055 |
| `plpgsql_function` | 1,912 |
| `ddl_advanced` | 1,143 |
| `ddl_index` | 806 |
| `plpgsql` | 742 |
| `ddl_view` | 541 |
| `plpgsql_trigger` | 401 |
| `ddl_alter` | 235 |
| `admin_maintenance` | 125 |
| `dcl_security` | 92 |
| `query_recursive_cte` | 18 |
## Usage
```python
from datasets import load_dataset
ds = load_dataset("neurondb/neurondb-postgresql-sql")
# Filter for advanced PostgreSQL-specific queries
advanced_pg = ds["train"].filter(
lambda x: x["difficulty"] == "advanced" and x["is_postgresql_specific"]
)
# Filter by category
window_fns = ds["train"].filter(lambda x: x["category"] == "query_window_function")
# Filter by source
gold = ds["train"].filter(
lambda x: x["source"] in [
"postgresql_regression_tests",
"postgresql_docs",
"handcrafted_advanced",
]
)
```
## Intended Use
- **Fine-tuning** LLMs for PostgreSQL SQL and PL/pgSQL code generation
- **Evaluating** text-to-SQL models on PostgreSQL-specific syntax
- **Benchmarking** SQL generation quality across difficulty levels
- **Building** PostgreSQL-aware coding assistants
## Data Quality
- All rows have non-empty `question` and `sql` fields
- MySQL-only and T-SQL-only syntax has been filtered out
- Duplicate (question, SQL) pairs have been removed
- Rows with trivially short SQL (< 10 chars) are excluded
- Each row is tagged with source, difficulty, and category for easy filtering
## Examples
#### Example 1 β basic / query_select
**Source:** `sql_create_context`
**Question:** Generate PostgreSQL SQL for: Which manufacturer made a locomotive with a type of 4-6-4t?
**Schema:**
```sql
CREATE TABLE table_name_40 (manufacturer VARCHAR, type VARCHAR)
```
**SQL:**
```sql
SELECT manufacturer FROM table_name_40 WHERE type = '4-6-4t';
```
#### Example 2 β intermediate / query_join
**Source:** `community_sql_datasets`
**Question:** What is the average account balance for customers who have a Shariah-compliant mortgage or a socially responsible loan?
**Schema:**
```sql
CREATE TABLE shariah_mortgages (mortgage_id INT, customer_id INT, account_balance DECIMAL); CREATE TABLE socially_responsible_loans (loan_id INT, customer_id INT, account_balance DECIMAL); CREATE TABLE shariah_loans (loan_id INT, mortgage_id INT);
```
**SQL:**
```sql
SELECT AVG(CASE WHEN sm.customer_id IS NOT NULL THEN sm.account_balance ELSE srl.account_balance END) FROM shariah_mortgages sm RIGHT JOIN socially_responsible_loans srl ON sm.customer_id = srl.customer_id JOIN shariah_loans sl ON sm.mortgage_id = sl.mortgage_id OR srl.loan_id = sl.loan_id;
```
#### Example 3 β advanced / plpgsql_function
**Source:** `community_sql_datasets`
**Question:** Write the PL/pgSQL object from PostgreSQL regression test 'plpgsql' (example 352).
**SQL:**
```sql
$$ language plpgsql;
select * from sc_test();
create or replace function sc_test() returns setof integer as $$
declare
c refcursor;
```
**Explanation:** PL/pgSQL object from PostgreSQL core test for Plpgsql.
#### Example 4 β advanced / query_window_function
**Source:** `community_sql_datasets`
**Question:** What is the difference in the number of attendees for each community education program between the first and last occurrence?
**Schema:**
```sql
CREATE TABLE community_education (program_name VARCHAR(255), location VARCHAR(255), date DATE, num_attendees INT); INSERT INTO community_education (program_name, location, date, num_attendees) VALUES ('Wildlife Awareness', 'New York', '2020-01-01', 50), ('Wildlife Awareness', 'Florida', '2020-03-10', 75), ('Nature Walk', 'California', '2019-05-15', 25), ('Nature Walk', 'California', '2020-05-15', 35);
```
**SQL:**
```sql
SELECT program_name, num_attendees - FIRST_VALUE(num_attendees) OVER (PARTITION BY program_name ORDER BY date) as diff FROM community_education;
```
## Citation
If you use this dataset, please cite:
```bibtex
@dataset{neurondb_postgresql_sql_2026,
title={NeuronDB PostgreSQL SQL & PL/pgSQL Instruction Dataset},
author={NeuronDB Team},
year={2026},
url={https://huggingface.co/datasets/neurondb/neurondb-postgresql-sql},
}
```
## License
Apache 2.0
|