| | --- |
| | language: |
| | - en |
| | license: apache-2.0 |
| | task_categories: |
| | - text2text-generation |
| | - text-generation |
| | tags: |
| | - postgresql |
| | - sql |
| | - plpgsql |
| | - text-to-sql |
| | - code-generation |
| | - database |
| | - postgres |
| | - neurondb |
| | pretty_name: "NeuronDB PostgreSQL SQL & PL/pgSQL Instruction Dataset" |
| | size_categories: |
| | - 100K<n<1M |
| | dataset_info: |
| | features: |
| | - name: question |
| | dtype: string |
| | - name: schema |
| | dtype: string |
| | - name: sql |
| | dtype: string |
| | - name: explanation |
| | dtype: string |
| | - name: validation_query |
| | dtype: string |
| | - name: source |
| | dtype: string |
| | - name: difficulty |
| | dtype: string |
| | - name: category |
| | dtype: string |
| | - name: is_postgresql_specific |
| | dtype: bool |
| | - name: sql_length |
| | dtype: int32 |
| | - name: num_statements |
| | dtype: int32 |
| | splits: |
| | - name: train |
| | num_examples: 194398 |
| | - name: validation |
| | num_examples: 13693 |
| | - name: test |
| | num_examples: 3448 |
| | configs: |
| | - config_name: default |
| | data_files: |
| | - split: train |
| | path: train.jsonl |
| | - split: validation |
| | path: validation.jsonl |
| | - split: test |
| | path: test.jsonl |
| | --- |
| | |
| | # NeuronDB PostgreSQL SQL & PL/pgSQL Instruction Dataset |
| |
|
| | A large-scale, curated instruction dataset for training and evaluating LLMs on |
| | **PostgreSQL-specific** SQL and PL/pgSQL generation. Every row is a |
| | (question, schema, SQL) triplet with rich metadata for filtering and analysis. |
| |
|
| | ## Dataset Summary |
| |
|
| | | Metric | Value | |
| | |--------|-------| |
| | | **Total rows** | 211,539 | |
| | | **PostgreSQL-specific rows** | 11,998 (5.7%) | |
| | | **Schema fill rate** | 82.2% | |
| | | **Explanation fill rate** | 17.8% | |
| | | **SQL length (median)** | 83 chars | |
| | | **SQL length (max)** | 61,419 chars | |
| |
|
| | ## Splits |
| |
|
| | | Split | Rows | |
| | |-------|------| |
| | | `train` | 194,398 | |
| | | `validation` | 13,693 | |
| | | `test` | 3,448 | |
| |
|
| | ## Schema |
| |
|
| | Each row contains **11 fields**: |
| |
|
| | | Field | Type | Description | |
| | |-------|------|-------------| |
| | | `question` | `string` | Natural language instruction or question | |
| | | `schema` | `string?` | DDL schema context (CREATE TABLE statements), null if not applicable | |
| | | `sql` | `string` | Ground truth PostgreSQL SQL or PL/pgSQL answer | |
| | | `explanation` | `string?` | Short explanation of what the SQL does | |
| | | `validation_query` | `string?` | Query to validate the answer produces correct results | |
| | | `source` | `string` | Origin of this instruction pair (see Sources below) | |
| | | `difficulty` | `string` | One of: `basic`, `intermediate`, `advanced` | |
| | | `category` | `string` | SQL category (see Categories below) | |
| | | `is_postgresql_specific` | `bool` | True if SQL uses PostgreSQL-specific syntax | |
| | | `sql_length` | `int32` | Character length of the SQL field | |
| | | `num_statements` | `int32` | Number of SQL statements (semicolon count) | |
| |
|
| | ## Sources |
| |
|
| | Data is aggregated from multiple high-quality sources, each tagged: |
| |
|
| | | Source | Rows | |
| | |--------|------| |
| | | `community_sql_datasets` | 115,811 | |
| | | `sql_create_context` | 78,392 | |
| | | `postgresql_regression_tests` | 11,622 | |
| | | `pgtap_tests` | 4,181 | |
| | | `plpgsql_source` | 1,529 | |
| | | `synthetic_text_to_sql` | 4 | |
| |
|
| | ### Source Descriptions |
| |
|
| | - **`postgresql_regression_tests`** — SQL extracted from PostgreSQL's own regression test suite |
| | - **`postgresql_docs`** — Examples from official PostgreSQL SGML documentation |
| | - **`postgresql_contrib`** — SQL from contrib modules (pg_trgm, hstore, ltree, etc.) |
| | - **`pgtap_tests`** — pgTAP unit test SQL |
| | - **`plpgsql_source`** — PL/pgSQL functions from the PostgreSQL source tree |
| | - **`pgbench_scripts`** — pgbench benchmark scripts |
| | - **`handcrafted_advanced`** — Hand-written examples covering advanced patterns (window functions, CTEs, JSONB, RLS, triggers, partitioning, custom aggregates, etc.) |
| | - **`sql_create_context`** — WikiSQL/Spider-derived text-to-SQL pairs (b-mc2/sql-create-context) |
| | - **`synthetic_text_to_sql`** — Synthetically generated text-to-SQL pairs (gretelai, NumbersStation) |
| | - **`community_sql_datasets`** — Other community SQL datasets (Clinton/text-to-sql-v1, knowrohit07/know_sql) |
| |
|
| | ## Difficulty Distribution |
| |
|
| | | Difficulty | Rows | |
| | |------------|------| |
| | | `basic` | 147,920 | |
| | | `intermediate` | 56,469 | |
| | | `advanced` | 7,150 | |
| |
|
| | ## Categories |
| |
|
| | | Category | Rows | |
| | |----------|------| |
| | | `query_select` | 136,225 | |
| | | `query_aggregation` | 32,050 | |
| | | `query_join` | 10,597 | |
| | | `dml_insert` | 8,763 | |
| | | `other` | 4,093 | |
| | | `dml_update` | 3,664 | |
| | | `dml_delete` | 3,647 | |
| | | `ddl_table` | 3,430 | |
| | | `query_window_function` | 3,055 | |
| | | `plpgsql_function` | 1,912 | |
| | | `ddl_advanced` | 1,143 | |
| | | `ddl_index` | 806 | |
| | | `plpgsql` | 742 | |
| | | `ddl_view` | 541 | |
| | | `plpgsql_trigger` | 401 | |
| | | `ddl_alter` | 235 | |
| | | `admin_maintenance` | 125 | |
| | | `dcl_security` | 92 | |
| | | `query_recursive_cte` | 18 | |
| |
|
| | ## Usage |
| |
|
| | ```python |
| | from datasets import load_dataset |
| | |
| | ds = load_dataset("neurondb/neurondb-postgresql-sql") |
| | |
| | # Filter for advanced PostgreSQL-specific queries |
| | advanced_pg = ds["train"].filter( |
| | lambda x: x["difficulty"] == "advanced" and x["is_postgresql_specific"] |
| | ) |
| | |
| | # Filter by category |
| | window_fns = ds["train"].filter(lambda x: x["category"] == "query_window_function") |
| | |
| | # Filter by source |
| | gold = ds["train"].filter( |
| | lambda x: x["source"] in [ |
| | "postgresql_regression_tests", |
| | "postgresql_docs", |
| | "handcrafted_advanced", |
| | ] |
| | ) |
| | ``` |
| |
|
| | ## Intended Use |
| |
|
| | - **Fine-tuning** LLMs for PostgreSQL SQL and PL/pgSQL code generation |
| | - **Evaluating** text-to-SQL models on PostgreSQL-specific syntax |
| | - **Benchmarking** SQL generation quality across difficulty levels |
| | - **Building** PostgreSQL-aware coding assistants |
| |
|
| | ## Data Quality |
| |
|
| | - All rows have non-empty `question` and `sql` fields |
| | - MySQL-only and T-SQL-only syntax has been filtered out |
| | - Duplicate (question, SQL) pairs have been removed |
| | - Rows with trivially short SQL (< 10 chars) are excluded |
| | - Each row is tagged with source, difficulty, and category for easy filtering |
| |
|
| | ## Examples |
| |
|
| |
|
| | #### Example 1 — basic / query_select |
| | **Source:** `sql_create_context` |
| | |
| | **Question:** Generate PostgreSQL SQL for: Which manufacturer made a locomotive with a type of 4-6-4t? |
| | |
| | **Schema:** |
| | ```sql |
| | CREATE TABLE table_name_40 (manufacturer VARCHAR, type VARCHAR) |
| | ``` |
| | |
| | **SQL:** |
| | ```sql |
| | SELECT manufacturer FROM table_name_40 WHERE type = '4-6-4t'; |
| | ``` |
| | |
| | |
| | #### Example 2 — intermediate / query_join |
| | **Source:** `community_sql_datasets` |
| |
|
| | **Question:** What is the average account balance for customers who have a Shariah-compliant mortgage or a socially responsible loan? |
| |
|
| | **Schema:** |
| | ```sql |
| | CREATE TABLE shariah_mortgages (mortgage_id INT, customer_id INT, account_balance DECIMAL); CREATE TABLE socially_responsible_loans (loan_id INT, customer_id INT, account_balance DECIMAL); CREATE TABLE shariah_loans (loan_id INT, mortgage_id INT); |
| | ``` |
| |
|
| | **SQL:** |
| | ```sql |
| | SELECT AVG(CASE WHEN sm.customer_id IS NOT NULL THEN sm.account_balance ELSE srl.account_balance END) FROM shariah_mortgages sm RIGHT JOIN socially_responsible_loans srl ON sm.customer_id = srl.customer_id JOIN shariah_loans sl ON sm.mortgage_id = sl.mortgage_id OR srl.loan_id = sl.loan_id; |
| | ``` |
| |
|
| |
|
| | #### Example 3 — advanced / plpgsql_function |
| | **Source:** `community_sql_datasets` |
| | |
| | **Question:** Write the PL/pgSQL object from PostgreSQL regression test 'plpgsql' (example 352). |
| | |
| | **SQL:** |
| | ```sql |
| | $$ language plpgsql; |
| | |
| | select * from sc_test(); |
| |
|
| | create or replace function sc_test() returns setof integer as $$ |
| | declare |
| | c refcursor; |
| | ``` |
| | |
| | **Explanation:** PL/pgSQL object from PostgreSQL core test for Plpgsql. |
| | |
| | |
| | #### Example 4 — advanced / query_window_function |
| | **Source:** `community_sql_datasets` |
| | |
| | **Question:** What is the difference in the number of attendees for each community education program between the first and last occurrence? |
| | |
| | **Schema:** |
| | ```sql |
| | CREATE TABLE community_education (program_name VARCHAR(255), location VARCHAR(255), date DATE, num_attendees INT); INSERT INTO community_education (program_name, location, date, num_attendees) VALUES ('Wildlife Awareness', 'New York', '2020-01-01', 50), ('Wildlife Awareness', 'Florida', '2020-03-10', 75), ('Nature Walk', 'California', '2019-05-15', 25), ('Nature Walk', 'California', '2020-05-15', 35); |
| | ``` |
| | |
| | **SQL:** |
| | ```sql |
| | SELECT program_name, num_attendees - FIRST_VALUE(num_attendees) OVER (PARTITION BY program_name ORDER BY date) as diff FROM community_education; |
| | ``` |
| | |
| | |
| | |
| | ## Citation |
| | |
| | If you use this dataset, please cite: |
| | |
| | ```bibtex |
| | @dataset{neurondb_postgresql_sql_2026, |
| | title={NeuronDB PostgreSQL SQL & PL/pgSQL Instruction Dataset}, |
| | author={NeuronDB Team}, |
| | year={2026}, |
| | url={https://huggingface.co/datasets/neurondb/neurondb-postgresql-sql}, |
| | } |
| | ``` |
| | |
| | ## License |
| | |
| | Apache 2.0 |
| | |