File size: 8,647 Bytes

afdfdd2

---
language:
  - en
license: apache-2.0
task_categories:
  - text2text-generation
  - text-generation
tags:
  - postgresql
  - sql
  - plpgsql
  - text-to-sql
  - code-generation
  - database
  - postgres
  - neurondb
pretty_name: "NeuronDB PostgreSQL SQL & PL/pgSQL Instruction Dataset"
size_categories:
  - 100K<n<1M
dataset_info:
  features:
    - name: question
      dtype: string
    - name: schema
      dtype: string
    - name: sql
      dtype: string
    - name: explanation
      dtype: string
    - name: validation_query
      dtype: string
    - name: source
      dtype: string
    - name: difficulty
      dtype: string
    - name: category
      dtype: string
    - name: is_postgresql_specific
      dtype: bool
    - name: sql_length
      dtype: int32
    - name: num_statements
      dtype: int32
  splits:
    - name: train
      num_examples: 194398
    - name: validation
      num_examples: 13693
    - name: test
      num_examples: 3448
configs:
  - config_name: default
    data_files:
      - split: train
        path: train.jsonl
      - split: validation
        path: validation.jsonl
      - split: test
        path: test.jsonl
---

# NeuronDB PostgreSQL SQL & PL/pgSQL Instruction Dataset

A large-scale, curated instruction dataset for training and evaluating LLMs on
**PostgreSQL-specific** SQL and PL/pgSQL generation. Every row is a
(question, schema, SQL) triplet with rich metadata for filtering and analysis.

## Dataset Summary

| Metric | Value |
|--------|-------|
| **Total rows** | 211,539 |
| **PostgreSQL-specific rows** | 11,998 (5.7%) |
| **Schema fill rate** | 82.2% |
| **Explanation fill rate** | 17.8% |
| **SQL length (median)** | 83 chars |
| **SQL length (max)** | 61,419 chars |

## Splits

| Split | Rows |
|-------|------|
| `train` | 194,398 |
| `validation` | 13,693 |
| `test` | 3,448 |

## Schema

Each row contains **11 fields**:

| Field | Type | Description |
|-------|------|-------------|
| `question` | `string` | Natural language instruction or question |
| `schema` | `string?` | DDL schema context (CREATE TABLE statements), null if not applicable |
| `sql` | `string` | Ground truth PostgreSQL SQL or PL/pgSQL answer |
| `explanation` | `string?` | Short explanation of what the SQL does |
| `validation_query` | `string?` | Query to validate the answer produces correct results |
| `source` | `string` | Origin of this instruction pair (see Sources below) |
| `difficulty` | `string` | One of: `basic`, `intermediate`, `advanced` |
| `category` | `string` | SQL category (see Categories below) |
| `is_postgresql_specific` | `bool` | True if SQL uses PostgreSQL-specific syntax |
| `sql_length` | `int32` | Character length of the SQL field |
| `num_statements` | `int32` | Number of SQL statements (semicolon count) |

## Sources

Data is aggregated from multiple high-quality sources, each tagged:

| Source | Rows |
|--------|------|
| `community_sql_datasets` | 115,811 |
| `sql_create_context` | 78,392 |
| `postgresql_regression_tests` | 11,622 |
| `pgtap_tests` | 4,181 |
| `plpgsql_source` | 1,529 |
| `synthetic_text_to_sql` | 4 |

### Source Descriptions

- **`postgresql_regression_tests`** — SQL extracted from PostgreSQL's own regression test suite
- **`postgresql_docs`** — Examples from official PostgreSQL SGML documentation
- **`postgresql_contrib`** — SQL from contrib modules (pg_trgm, hstore, ltree, etc.)
- **`pgtap_tests`** — pgTAP unit test SQL
- **`plpgsql_source`** — PL/pgSQL functions from the PostgreSQL source tree
- **`pgbench_scripts`** — pgbench benchmark scripts
- **`handcrafted_advanced`** — Hand-written examples covering advanced patterns (window functions, CTEs, JSONB, RLS, triggers, partitioning, custom aggregates, etc.)
- **`sql_create_context`** — WikiSQL/Spider-derived text-to-SQL pairs (b-mc2/sql-create-context)
- **`synthetic_text_to_sql`** — Synthetically generated text-to-SQL pairs (gretelai, NumbersStation)
- **`community_sql_datasets`** — Other community SQL datasets (Clinton/text-to-sql-v1, knowrohit07/know_sql)

## Difficulty Distribution

| Difficulty | Rows |
|------------|------|
| `basic` | 147,920 |
| `intermediate` | 56,469 |
| `advanced` | 7,150 |

## Categories

| Category | Rows |
|----------|------|
| `query_select` | 136,225 |
| `query_aggregation` | 32,050 |
| `query_join` | 10,597 |
| `dml_insert` | 8,763 |
| `other` | 4,093 |
| `dml_update` | 3,664 |
| `dml_delete` | 3,647 |
| `ddl_table` | 3,430 |
| `query_window_function` | 3,055 |
| `plpgsql_function` | 1,912 |
| `ddl_advanced` | 1,143 |
| `ddl_index` | 806 |
| `plpgsql` | 742 |
| `ddl_view` | 541 |
| `plpgsql_trigger` | 401 |
| `ddl_alter` | 235 |
| `admin_maintenance` | 125 |
| `dcl_security` | 92 |
| `query_recursive_cte` | 18 |

## Usage

```python
from datasets import load_dataset

ds = load_dataset("neurondb/neurondb-postgresql-sql")

# Filter for advanced PostgreSQL-specific queries
advanced_pg = ds["train"].filter(
    lambda x: x["difficulty"] == "advanced" and x["is_postgresql_specific"]
)

# Filter by category
window_fns = ds["train"].filter(lambda x: x["category"] == "query_window_function")

# Filter by source
gold = ds["train"].filter(
    lambda x: x["source"] in [
        "postgresql_regression_tests",
        "postgresql_docs",
        "handcrafted_advanced",
    ]
)
```

## Intended Use

- **Fine-tuning** LLMs for PostgreSQL SQL and PL/pgSQL code generation
- **Evaluating** text-to-SQL models on PostgreSQL-specific syntax
- **Benchmarking** SQL generation quality across difficulty levels
- **Building** PostgreSQL-aware coding assistants

## Data Quality

- All rows have non-empty `question` and `sql` fields
- MySQL-only and T-SQL-only syntax has been filtered out
- Duplicate (question, SQL) pairs have been removed
- Rows with trivially short SQL (< 10 chars) are excluded
- Each row is tagged with source, difficulty, and category for easy filtering

## Examples


#### Example 1 — basic / query_select
**Source:** `sql_create_context`

**Question:** Generate PostgreSQL SQL for: Which manufacturer made a locomotive with a type of 4-6-4t?

**Schema:**
```sql
CREATE TABLE table_name_40 (manufacturer VARCHAR, type VARCHAR)
```

**SQL:**
```sql
SELECT manufacturer FROM table_name_40 WHERE type = '4-6-4t';
```


#### Example 2 — intermediate / query_join
**Source:** `community_sql_datasets`

**Question:** What is the average account balance for customers who have a Shariah-compliant mortgage or a socially responsible loan?

**Schema:**
```sql
CREATE TABLE shariah_mortgages (mortgage_id INT, customer_id INT, account_balance DECIMAL); CREATE TABLE socially_responsible_loans (loan_id INT, customer_id INT, account_balance DECIMAL); CREATE TABLE shariah_loans (loan_id INT, mortgage_id INT);
```

**SQL:**
```sql
SELECT AVG(CASE WHEN sm.customer_id IS NOT NULL THEN sm.account_balance ELSE srl.account_balance END) FROM shariah_mortgages sm RIGHT JOIN socially_responsible_loans srl ON sm.customer_id = srl.customer_id JOIN shariah_loans sl ON sm.mortgage_id = sl.mortgage_id OR srl.loan_id = sl.loan_id;
```


#### Example 3 — advanced / plpgsql_function
**Source:** `community_sql_datasets`

**Question:** Write the PL/pgSQL object from PostgreSQL regression test 'plpgsql' (example 352).

**SQL:**
```sql
$$ language plpgsql;

select * from sc_test();

create or replace function sc_test() returns setof integer as $$
declare
  c refcursor;
```

**Explanation:** PL/pgSQL object from PostgreSQL core test for Plpgsql.


#### Example 4 — advanced / query_window_function
**Source:** `community_sql_datasets`

**Question:** What is the difference in the number of attendees for each community education program between the first and last occurrence?

**Schema:**
```sql
CREATE TABLE community_education (program_name VARCHAR(255), location VARCHAR(255), date DATE, num_attendees INT); INSERT INTO community_education (program_name, location, date, num_attendees) VALUES ('Wildlife Awareness', 'New York', '2020-01-01', 50), ('Wildlife Awareness', 'Florida', '2020-03-10', 75), ('Nature Walk', 'California', '2019-05-15', 25), ('Nature Walk', 'California', '2020-05-15', 35);
```

**SQL:**
```sql
SELECT program_name, num_attendees - FIRST_VALUE(num_attendees) OVER (PARTITION BY program_name ORDER BY date) as diff FROM community_education;
```



## Citation

If you use this dataset, please cite:

```bibtex
@dataset{neurondb_postgresql_sql_2026,
  title={NeuronDB PostgreSQL SQL & PL/pgSQL Instruction Dataset},
  author={NeuronDB Team},
  year={2026},
  url={https://huggingface.co/datasets/neurondb/neurondb-postgresql-sql},
}
```

## License

Apache 2.0