neurondb's picture
Added dataset files.
afdfdd2 verified
---
language:
- en
license: apache-2.0
task_categories:
- text2text-generation
- text-generation
tags:
- postgresql
- sql
- plpgsql
- text-to-sql
- code-generation
- database
- postgres
- neurondb
pretty_name: "NeuronDB PostgreSQL SQL & PL/pgSQL Instruction Dataset"
size_categories:
- 100K<n<1M
dataset_info:
features:
- name: question
dtype: string
- name: schema
dtype: string
- name: sql
dtype: string
- name: explanation
dtype: string
- name: validation_query
dtype: string
- name: source
dtype: string
- name: difficulty
dtype: string
- name: category
dtype: string
- name: is_postgresql_specific
dtype: bool
- name: sql_length
dtype: int32
- name: num_statements
dtype: int32
splits:
- name: train
num_examples: 194398
- name: validation
num_examples: 13693
- name: test
num_examples: 3448
configs:
- config_name: default
data_files:
- split: train
path: train.jsonl
- split: validation
path: validation.jsonl
- split: test
path: test.jsonl
---
# NeuronDB PostgreSQL SQL & PL/pgSQL Instruction Dataset
A large-scale, curated instruction dataset for training and evaluating LLMs on
**PostgreSQL-specific** SQL and PL/pgSQL generation. Every row is a
(question, schema, SQL) triplet with rich metadata for filtering and analysis.
## Dataset Summary
| Metric | Value |
|--------|-------|
| **Total rows** | 211,539 |
| **PostgreSQL-specific rows** | 11,998 (5.7%) |
| **Schema fill rate** | 82.2% |
| **Explanation fill rate** | 17.8% |
| **SQL length (median)** | 83 chars |
| **SQL length (max)** | 61,419 chars |
## Splits
| Split | Rows |
|-------|------|
| `train` | 194,398 |
| `validation` | 13,693 |
| `test` | 3,448 |
## Schema
Each row contains **11 fields**:
| Field | Type | Description |
|-------|------|-------------|
| `question` | `string` | Natural language instruction or question |
| `schema` | `string?` | DDL schema context (CREATE TABLE statements), null if not applicable |
| `sql` | `string` | Ground truth PostgreSQL SQL or PL/pgSQL answer |
| `explanation` | `string?` | Short explanation of what the SQL does |
| `validation_query` | `string?` | Query to validate the answer produces correct results |
| `source` | `string` | Origin of this instruction pair (see Sources below) |
| `difficulty` | `string` | One of: `basic`, `intermediate`, `advanced` |
| `category` | `string` | SQL category (see Categories below) |
| `is_postgresql_specific` | `bool` | True if SQL uses PostgreSQL-specific syntax |
| `sql_length` | `int32` | Character length of the SQL field |
| `num_statements` | `int32` | Number of SQL statements (semicolon count) |
## Sources
Data is aggregated from multiple high-quality sources, each tagged:
| Source | Rows |
|--------|------|
| `community_sql_datasets` | 115,811 |
| `sql_create_context` | 78,392 |
| `postgresql_regression_tests` | 11,622 |
| `pgtap_tests` | 4,181 |
| `plpgsql_source` | 1,529 |
| `synthetic_text_to_sql` | 4 |
### Source Descriptions
- **`postgresql_regression_tests`** — SQL extracted from PostgreSQL's own regression test suite
- **`postgresql_docs`** — Examples from official PostgreSQL SGML documentation
- **`postgresql_contrib`** — SQL from contrib modules (pg_trgm, hstore, ltree, etc.)
- **`pgtap_tests`** — pgTAP unit test SQL
- **`plpgsql_source`** — PL/pgSQL functions from the PostgreSQL source tree
- **`pgbench_scripts`** — pgbench benchmark scripts
- **`handcrafted_advanced`** — Hand-written examples covering advanced patterns (window functions, CTEs, JSONB, RLS, triggers, partitioning, custom aggregates, etc.)
- **`sql_create_context`** — WikiSQL/Spider-derived text-to-SQL pairs (b-mc2/sql-create-context)
- **`synthetic_text_to_sql`** — Synthetically generated text-to-SQL pairs (gretelai, NumbersStation)
- **`community_sql_datasets`** — Other community SQL datasets (Clinton/text-to-sql-v1, knowrohit07/know_sql)
## Difficulty Distribution
| Difficulty | Rows |
|------------|------|
| `basic` | 147,920 |
| `intermediate` | 56,469 |
| `advanced` | 7,150 |
## Categories
| Category | Rows |
|----------|------|
| `query_select` | 136,225 |
| `query_aggregation` | 32,050 |
| `query_join` | 10,597 |
| `dml_insert` | 8,763 |
| `other` | 4,093 |
| `dml_update` | 3,664 |
| `dml_delete` | 3,647 |
| `ddl_table` | 3,430 |
| `query_window_function` | 3,055 |
| `plpgsql_function` | 1,912 |
| `ddl_advanced` | 1,143 |
| `ddl_index` | 806 |
| `plpgsql` | 742 |
| `ddl_view` | 541 |
| `plpgsql_trigger` | 401 |
| `ddl_alter` | 235 |
| `admin_maintenance` | 125 |
| `dcl_security` | 92 |
| `query_recursive_cte` | 18 |
## Usage
```python
from datasets import load_dataset
ds = load_dataset("neurondb/neurondb-postgresql-sql")
# Filter for advanced PostgreSQL-specific queries
advanced_pg = ds["train"].filter(
lambda x: x["difficulty"] == "advanced" and x["is_postgresql_specific"]
)
# Filter by category
window_fns = ds["train"].filter(lambda x: x["category"] == "query_window_function")
# Filter by source
gold = ds["train"].filter(
lambda x: x["source"] in [
"postgresql_regression_tests",
"postgresql_docs",
"handcrafted_advanced",
]
)
```
## Intended Use
- **Fine-tuning** LLMs for PostgreSQL SQL and PL/pgSQL code generation
- **Evaluating** text-to-SQL models on PostgreSQL-specific syntax
- **Benchmarking** SQL generation quality across difficulty levels
- **Building** PostgreSQL-aware coding assistants
## Data Quality
- All rows have non-empty `question` and `sql` fields
- MySQL-only and T-SQL-only syntax has been filtered out
- Duplicate (question, SQL) pairs have been removed
- Rows with trivially short SQL (< 10 chars) are excluded
- Each row is tagged with source, difficulty, and category for easy filtering
## Examples
#### Example 1 — basic / query_select
**Source:** `sql_create_context`
**Question:** Generate PostgreSQL SQL for: Which manufacturer made a locomotive with a type of 4-6-4t?
**Schema:**
```sql
CREATE TABLE table_name_40 (manufacturer VARCHAR, type VARCHAR)
```
**SQL:**
```sql
SELECT manufacturer FROM table_name_40 WHERE type = '4-6-4t';
```
#### Example 2 — intermediate / query_join
**Source:** `community_sql_datasets`
**Question:** What is the average account balance for customers who have a Shariah-compliant mortgage or a socially responsible loan?
**Schema:**
```sql
CREATE TABLE shariah_mortgages (mortgage_id INT, customer_id INT, account_balance DECIMAL); CREATE TABLE socially_responsible_loans (loan_id INT, customer_id INT, account_balance DECIMAL); CREATE TABLE shariah_loans (loan_id INT, mortgage_id INT);
```
**SQL:**
```sql
SELECT AVG(CASE WHEN sm.customer_id IS NOT NULL THEN sm.account_balance ELSE srl.account_balance END) FROM shariah_mortgages sm RIGHT JOIN socially_responsible_loans srl ON sm.customer_id = srl.customer_id JOIN shariah_loans sl ON sm.mortgage_id = sl.mortgage_id OR srl.loan_id = sl.loan_id;
```
#### Example 3 — advanced / plpgsql_function
**Source:** `community_sql_datasets`
**Question:** Write the PL/pgSQL object from PostgreSQL regression test 'plpgsql' (example 352).
**SQL:**
```sql
$$ language plpgsql;
select * from sc_test();
create or replace function sc_test() returns setof integer as $$
declare
c refcursor;
```
**Explanation:** PL/pgSQL object from PostgreSQL core test for Plpgsql.
#### Example 4 — advanced / query_window_function
**Source:** `community_sql_datasets`
**Question:** What is the difference in the number of attendees for each community education program between the first and last occurrence?
**Schema:**
```sql
CREATE TABLE community_education (program_name VARCHAR(255), location VARCHAR(255), date DATE, num_attendees INT); INSERT INTO community_education (program_name, location, date, num_attendees) VALUES ('Wildlife Awareness', 'New York', '2020-01-01', 50), ('Wildlife Awareness', 'Florida', '2020-03-10', 75), ('Nature Walk', 'California', '2019-05-15', 25), ('Nature Walk', 'California', '2020-05-15', 35);
```
**SQL:**
```sql
SELECT program_name, num_attendees - FIRST_VALUE(num_attendees) OVER (PARTITION BY program_name ORDER BY date) as diff FROM community_education;
```
## Citation
If you use this dataset, please cite:
```bibtex
@dataset{neurondb_postgresql_sql_2026,
title={NeuronDB PostgreSQL SQL & PL/pgSQL Instruction Dataset},
author={NeuronDB Team},
year={2026},
url={https://huggingface.co/datasets/neurondb/neurondb-postgresql-sql},
}
```
## License
Apache 2.0