Added dataset files.

afdfdd2 verified 14 days ago

8.65 kB

	---
	language:
	- en
	license: apache-2.0
	task_categories:
	- text2text-generation
	- text-generation
	tags:
	- postgresql
	- sql
	- plpgsql
	- text-to-sql
	- code-generation
	- database
	- postgres
	- neurondb
	pretty_name: "NeuronDB PostgreSQL SQL & PL/pgSQL Instruction Dataset"
	size_categories:
	- 100K<n<1M
	dataset_info:
	features:
	- name: question
	dtype: string
	- name: schema
	dtype: string
	- name: sql
	dtype: string
	- name: explanation
	dtype: string
	- name: validation_query
	dtype: string
	- name: source
	dtype: string
	- name: difficulty
	dtype: string
	- name: category
	dtype: string
	- name: is_postgresql_specific
	dtype: bool
	- name: sql_length
	dtype: int32
	- name: num_statements
	dtype: int32
	splits:
	- name: train
	num_examples: 194398
	- name: validation
	num_examples: 13693
	- name: test
	num_examples: 3448
	configs:
	- config_name: default
	data_files:
	- split: train
	path: train.jsonl
	- split: validation
	path: validation.jsonl
	- split: test
	path: test.jsonl
	---

	# NeuronDB PostgreSQL SQL & PL/pgSQL Instruction Dataset

	A large-scale, curated instruction dataset for training and evaluating LLMs on
	PostgreSQL-specific SQL and PL/pgSQL generation. Every row is a
	(question, schema, SQL) triplet with rich metadata for filtering and analysis.

	## Dataset Summary

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Total rows \| 211,539 \|
	\| PostgreSQL-specific rows \| 11,998 (5.7%) \|
	\| Schema fill rate \| 82.2% \|
	\| Explanation fill rate \| 17.8% \|
	\| SQL length (median) \| 83 chars \|
	\| SQL length (max) \| 61,419 chars \|

	## Splits

	\| Split \| Rows \|
	\|-------\|------\|
	\| `train` \| 194,398 \|
	\| `validation` \| 13,693 \|
	\| `test` \| 3,448 \|

	## Schema

	Each row contains 11 fields:

	\| Field \| Type \| Description \|
	\|-------\|------\|-------------\|
	\| `question` \| `string` \| Natural language instruction or question \|
	\| `schema` \| `string?` \| DDL schema context (CREATE TABLE statements), null if not applicable \|
	\| `sql` \| `string` \| Ground truth PostgreSQL SQL or PL/pgSQL answer \|
	\| `explanation` \| `string?` \| Short explanation of what the SQL does \|
	\| `validation_query` \| `string?` \| Query to validate the answer produces correct results \|
	\| `source` \| `string` \| Origin of this instruction pair (see Sources below) \|
	\| `difficulty` \| `string` \| One of: `basic`, `intermediate`, `advanced` \|
	\| `category` \| `string` \| SQL category (see Categories below) \|
	\| `is_postgresql_specific` \| `bool` \| True if SQL uses PostgreSQL-specific syntax \|
	\| `sql_length` \| `int32` \| Character length of the SQL field \|
	\| `num_statements` \| `int32` \| Number of SQL statements (semicolon count) \|

	## Sources

	Data is aggregated from multiple high-quality sources, each tagged:

	\| Source \| Rows \|
	\|--------\|------\|
	\| `community_sql_datasets` \| 115,811 \|
	\| `sql_create_context` \| 78,392 \|
	\| `postgresql_regression_tests` \| 11,622 \|
	\| `pgtap_tests` \| 4,181 \|
	\| `plpgsql_source` \| 1,529 \|
	\| `synthetic_text_to_sql` \| 4 \|

	### Source Descriptions

	- `postgresql_regression_tests` — SQL extracted from PostgreSQL's own regression test suite
	- `postgresql_docs` — Examples from official PostgreSQL SGML documentation
	- `postgresql_contrib` — SQL from contrib modules (pg_trgm, hstore, ltree, etc.)
	- `pgtap_tests` — pgTAP unit test SQL
	- `plpgsql_source` — PL/pgSQL functions from the PostgreSQL source tree
	- `pgbench_scripts` — pgbench benchmark scripts
	- `handcrafted_advanced` — Hand-written examples covering advanced patterns (window functions, CTEs, JSONB, RLS, triggers, partitioning, custom aggregates, etc.)
	- `sql_create_context` — WikiSQL/Spider-derived text-to-SQL pairs (b-mc2/sql-create-context)
	- `synthetic_text_to_sql` — Synthetically generated text-to-SQL pairs (gretelai, NumbersStation)
	- `community_sql_datasets` — Other community SQL datasets (Clinton/text-to-sql-v1, knowrohit07/know_sql)

	## Difficulty Distribution

	\| Difficulty \| Rows \|
	\|------------\|------\|
	\| `basic` \| 147,920 \|
	\| `intermediate` \| 56,469 \|
	\| `advanced` \| 7,150 \|

	## Categories

	\| Category \| Rows \|
	\|----------\|------\|
	\| `query_select` \| 136,225 \|
	\| `query_aggregation` \| 32,050 \|
	\| `query_join` \| 10,597 \|
	\| `dml_insert` \| 8,763 \|
	\| `other` \| 4,093 \|
	\| `dml_update` \| 3,664 \|
	\| `dml_delete` \| 3,647 \|
	\| `ddl_table` \| 3,430 \|
	\| `query_window_function` \| 3,055 \|
	\| `plpgsql_function` \| 1,912 \|
	\| `ddl_advanced` \| 1,143 \|
	\| `ddl_index` \| 806 \|
	\| `plpgsql` \| 742 \|
	\| `ddl_view` \| 541 \|
	\| `plpgsql_trigger` \| 401 \|
	\| `ddl_alter` \| 235 \|
	\| `admin_maintenance` \| 125 \|
	\| `dcl_security` \| 92 \|
	\| `query_recursive_cte` \| 18 \|

	## Usage

	```python
	from datasets import load_dataset

	ds = load_dataset("neurondb/neurondb-postgresql-sql")

	# Filter for advanced PostgreSQL-specific queries
	advanced_pg = ds["train"].filter(
	lambda x: x["difficulty"] == "advanced" and x["is_postgresql_specific"]
	)

	# Filter by category
	window_fns = ds["train"].filter(lambda x: x["category"] == "query_window_function")

	# Filter by source
	gold = ds["train"].filter(
	lambda x: x["source"] in [
	"postgresql_regression_tests",
	"postgresql_docs",
	"handcrafted_advanced",
	]
	)
	```

	## Intended Use

	- Fine-tuning LLMs for PostgreSQL SQL and PL/pgSQL code generation
	- Evaluating text-to-SQL models on PostgreSQL-specific syntax
	- Benchmarking SQL generation quality across difficulty levels
	- Building PostgreSQL-aware coding assistants

	## Data Quality

	- All rows have non-empty `question` and `sql` fields
	- MySQL-only and T-SQL-only syntax has been filtered out
	- Duplicate (question, SQL) pairs have been removed
	- Rows with trivially short SQL (< 10 chars) are excluded
	- Each row is tagged with source, difficulty, and category for easy filtering

	## Examples


	#### Example 1 — basic / query_select
	Source: `sql_create_context`

	Question: Generate PostgreSQL SQL for: Which manufacturer made a locomotive with a type of 4-6-4t?

	Schema:
	```sql
	CREATE TABLE table_name_40 (manufacturer VARCHAR, type VARCHAR)
	```

	SQL:
	```sql
	SELECT manufacturer FROM table_name_40 WHERE type = '4-6-4t';
	```


	#### Example 2 — intermediate / query_join
	Source: `community_sql_datasets`

	Question: What is the average account balance for customers who have a Shariah-compliant mortgage or a socially responsible loan?

	Schema:
	```sql
	CREATE TABLE shariah_mortgages (mortgage_id INT, customer_id INT, account_balance DECIMAL); CREATE TABLE socially_responsible_loans (loan_id INT, customer_id INT, account_balance DECIMAL); CREATE TABLE shariah_loans (loan_id INT, mortgage_id INT);
	```

	SQL:
	```sql
	SELECT AVG(CASE WHEN sm.customer_id IS NOT NULL THEN sm.account_balance ELSE srl.account_balance END) FROM shariah_mortgages sm RIGHT JOIN socially_responsible_loans srl ON sm.customer_id = srl.customer_id JOIN shariah_loans sl ON sm.mortgage_id = sl.mortgage_id OR srl.loan_id = sl.loan_id;
	```


	#### Example 3 — advanced / plpgsql_function
	Source: `community_sql_datasets`

	Question: Write the PL/pgSQL object from PostgreSQL regression test 'plpgsql' (example 352).

	SQL:
	```sql
	$$ language plpgsql;

	select * from sc_test();

	create or replace function sc_test() returns setof integer as $$
	declare
	c refcursor;
	```

	Explanation: PL/pgSQL object from PostgreSQL core test for Plpgsql.


	#### Example 4 — advanced / query_window_function
	Source: `community_sql_datasets`

	Question: What is the difference in the number of attendees for each community education program between the first and last occurrence?

	Schema:
	```sql
	CREATE TABLE community_education (program_name VARCHAR(255), location VARCHAR(255), date DATE, num_attendees INT); INSERT INTO community_education (program_name, location, date, num_attendees) VALUES ('Wildlife Awareness', 'New York', '2020-01-01', 50), ('Wildlife Awareness', 'Florida', '2020-03-10', 75), ('Nature Walk', 'California', '2019-05-15', 25), ('Nature Walk', 'California', '2020-05-15', 35);
	```

	SQL:
	```sql
	SELECT program_name, num_attendees - FIRST_VALUE(num_attendees) OVER (PARTITION BY program_name ORDER BY date) as diff FROM community_education;
	```



	## Citation

	If you use this dataset, please cite:

	```bibtex
	@dataset{neurondb_postgresql_sql_2026,
	title={NeuronDB PostgreSQL SQL & PL/pgSQL Instruction Dataset},
	author={NeuronDB Team},
	year={2026},
	url={https://huggingface.co/datasets/neurondb/neurondb-postgresql-sql},
	}
	```

	## License

	Apache 2.0