Added dataset files.

Browse files

Files changed (5) hide show

.gitattributes +1 -0
README.md +289 -3
test.jsonl +0 -0
train.jsonl +3 -0
validation.jsonl +0 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+train.jsonl filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -1,3 +1,289 @@
----
-license: apache-2.0
----

+---
+language:
+  - en
+license: apache-2.0
+task_categories:
+  - text2text-generation
+  - text-generation
+tags:
+  - postgresql
+  - sql
+  - plpgsql
+  - text-to-sql
+  - code-generation
+  - database
+  - postgres
+  - neurondb
+pretty_name: "NeuronDB PostgreSQL SQL & PL/pgSQL Instruction Dataset"
+size_categories:
+  - 100K<n<1M
+dataset_info:
+  features:
+    - name: question
+      dtype: string
+    - name: schema
+      dtype: string
+    - name: sql
+      dtype: string
+    - name: explanation
+      dtype: string
+    - name: validation_query
+      dtype: string
+    - name: source
+      dtype: string
+    - name: difficulty
+      dtype: string
+    - name: category
+      dtype: string
+    - name: is_postgresql_specific
+      dtype: bool
+    - name: sql_length
+      dtype: int32
+    - name: num_statements
+      dtype: int32
+  splits:
+    - name: train
+      num_examples: 194398
+    - name: validation
+      num_examples: 13693
+    - name: test
+      num_examples: 3448
+configs:
+  - config_name: default
+    data_files:
+      - split: train
+        path: train.jsonl
+      - split: validation
+        path: validation.jsonl
+      - split: test
+        path: test.jsonl
+---
+# NeuronDB PostgreSQL SQL & PL/pgSQL Instruction Dataset
+A large-scale, curated instruction dataset for training and evaluating LLMs on
+**PostgreSQL-specific** SQL and PL/pgSQL generation. Every row is a
+(question, schema, SQL) triplet with rich metadata for filtering and analysis.
+## Dataset Summary
+| Metric | Value |
+|--------|-------|
+| **Total rows** | 211,539 |
+| **PostgreSQL-specific rows** | 11,998 (5.7%) |
+| **Schema fill rate** | 82.2% |
+| **Explanation fill rate** | 17.8% |
+| **SQL length (median)** | 83 chars |
+| **SQL length (max)** | 61,419 chars |
+## Splits
+| Split | Rows |
+|-------|------|
+| `train` | 194,398 |
+| `validation` | 13,693 |
+| `test` | 3,448 |
+## Schema
+Each row contains **11 fields**:
+| Field | Type | Description |
+|-------|------|-------------|
+| `question` | `string` | Natural language instruction or question |
+| `schema` | `string?` | DDL schema context (CREATE TABLE statements), null if not applicable |
+| `sql` | `string` | Ground truth PostgreSQL SQL or PL/pgSQL answer |
+| `explanation` | `string?` | Short explanation of what the SQL does |
+| `validation_query` | `string?` | Query to validate the answer produces correct results |
+| `source` | `string` | Origin of this instruction pair (see Sources below) |
+| `difficulty` | `string` | One of: `basic`, `intermediate`, `advanced` |
+| `category` | `string` | SQL category (see Categories below) |
+| `is_postgresql_specific` | `bool` | True if SQL uses PostgreSQL-specific syntax |
+| `sql_length` | `int32` | Character length of the SQL field |
+| `num_statements` | `int32` | Number of SQL statements (semicolon count) |
+## Sources
+Data is aggregated from multiple high-quality sources, each tagged:
+| Source | Rows |
+|--------|------|
+| `community_sql_datasets` | 115,811 |
+| `sql_create_context` | 78,392 |
+| `postgresql_regression_tests` | 11,622 |
+| `pgtap_tests` | 4,181 |
+| `plpgsql_source` | 1,529 |
+| `synthetic_text_to_sql` | 4 |
+### Source Descriptions
+- **`postgresql_regression_tests`** — SQL extracted from PostgreSQL's own regression test suite
+- **`postgresql_docs`** — Examples from official PostgreSQL SGML documentation
+- **`postgresql_contrib`** — SQL from contrib modules (pg_trgm, hstore, ltree, etc.)
+- **`pgtap_tests`** — pgTAP unit test SQL
+- **`plpgsql_source`** — PL/pgSQL functions from the PostgreSQL source tree
+- **`pgbench_scripts`** — pgbench benchmark scripts
+- **`handcrafted_advanced`** — Hand-written examples covering advanced patterns (window functions, CTEs, JSONB, RLS, triggers, partitioning, custom aggregates, etc.)
+- **`sql_create_context`** — WikiSQL/Spider-derived text-to-SQL pairs (b-mc2/sql-create-context)
+- **`synthetic_text_to_sql`** — Synthetically generated text-to-SQL pairs (gretelai, NumbersStation)
+- **`community_sql_datasets`** — Other community SQL datasets (Clinton/text-to-sql-v1, knowrohit07/know_sql)
+## Difficulty Distribution
+| Difficulty | Rows |
+|------------|------|
+| `basic` | 147,920 |
+| `intermediate` | 56,469 |
+| `advanced` | 7,150 |
+## Categories
+| Category | Rows |
+|----------|------|
+| `query_select` | 136,225 |
+| `query_aggregation` | 32,050 |
+| `query_join` | 10,597 |
+| `dml_insert` | 8,763 |
+| `other` | 4,093 |
+| `dml_update` | 3,664 |
+| `dml_delete` | 3,647 |
+| `ddl_table` | 3,430 |
+| `query_window_function` | 3,055 |
+| `plpgsql_function` | 1,912 |
+| `ddl_advanced` | 1,143 |
+| `ddl_index` | 806 |
+| `plpgsql` | 742 |
+| `ddl_view` | 541 |
+| `plpgsql_trigger` | 401 |
+| `ddl_alter` | 235 |
+| `admin_maintenance` | 125 |
+| `dcl_security` | 92 |
+| `query_recursive_cte` | 18 |
+## Usage
+```python
+from datasets import load_dataset
+ds = load_dataset("neurondb/neurondb-postgresql-sql")
+# Filter for advanced PostgreSQL-specific queries
+advanced_pg = ds["train"].filter(
+    lambda x: x["difficulty"] == "advanced" and x["is_postgresql_specific"]
+)
+# Filter by category
+window_fns = ds["train"].filter(lambda x: x["category"] == "query_window_function")
+# Filter by source
+gold = ds["train"].filter(
+    lambda x: x["source"] in [
+        "postgresql_regression_tests",
+        "postgresql_docs",
+        "handcrafted_advanced",
+    ]
+)
+```
+## Intended Use
+- **Fine-tuning** LLMs for PostgreSQL SQL and PL/pgSQL code generation
+- **Evaluating** text-to-SQL models on PostgreSQL-specific syntax
+- **Benchmarking** SQL generation quality across difficulty levels
+- **Building** PostgreSQL-aware coding assistants
+## Data Quality
+- All rows have non-empty `question` and `sql` fields
+- MySQL-only and T-SQL-only syntax has been filtered out
+- Duplicate (question, SQL) pairs have been removed
+- Rows with trivially short SQL (< 10 chars) are excluded
+- Each row is tagged with source, difficulty, and category for easy filtering
+## Examples
+#### Example 1 — basic / query_select
+**Source:** `sql_create_context`
+**Question:** Generate PostgreSQL SQL for: Which manufacturer made a locomotive with a type of 4-6-4t?
+**Schema:**
+```sql
+CREATE TABLE table_name_40 (manufacturer VARCHAR, type VARCHAR)
+```
+**SQL:**
+```sql
+SELECT manufacturer FROM table_name_40 WHERE type = '4-6-4t';
+```
+#### Example 2 — intermediate / query_join
+**Source:** `community_sql_datasets`
+**Question:** What is the average account balance for customers who have a Shariah-compliant mortgage or a socially responsible loan?
+**Schema:**
+```sql
+CREATE TABLE shariah_mortgages (mortgage_id INT, customer_id INT, account_balance DECIMAL); CREATE TABLE socially_responsible_loans (loan_id INT, customer_id INT, account_balance DECIMAL); CREATE TABLE shariah_loans (loan_id INT, mortgage_id INT);
+```
+**SQL:**
+```sql
+SELECT AVG(CASE WHEN sm.customer_id IS NOT NULL THEN sm.account_balance ELSE srl.account_balance END) FROM shariah_mortgages sm RIGHT JOIN socially_responsible_loans srl ON sm.customer_id = srl.customer_id JOIN shariah_loans sl ON sm.mortgage_id = sl.mortgage_id OR srl.loan_id = sl.loan_id;
+```
+#### Example 3 — advanced / plpgsql_function
+**Source:** `community_sql_datasets`
+**Question:** Write the PL/pgSQL object from PostgreSQL regression test 'plpgsql' (example 352).
+**SQL:**
+```sql
+$$ language plpgsql;
+select * from sc_test();
+create or replace function sc_test() returns setof integer as $$
+declare
+  c refcursor;
+```
+**Explanation:** PL/pgSQL object from PostgreSQL core test for Plpgsql.
+#### Example 4 — advanced / query_window_function
+**Source:** `community_sql_datasets`
+**Question:** What is the difference in the number of attendees for each community education program between the first and last occurrence?
+**Schema:**
+```sql
+CREATE TABLE community_education (program_name VARCHAR(255), location VARCHAR(255), date DATE, num_attendees INT); INSERT INTO community_education (program_name, location, date, num_attendees) VALUES ('Wildlife Awareness', 'New York', '2020-01-01', 50), ('Wildlife Awareness', 'Florida', '2020-03-10', 75), ('Nature Walk', 'California', '2019-05-15', 25), ('Nature Walk', 'California', '2020-05-15', 35);
+```
+**SQL:**
+```sql
+SELECT program_name, num_attendees - FIRST_VALUE(num_attendees) OVER (PARTITION BY program_name ORDER BY date) as diff FROM community_education;
+```
+## Citation
+If you use this dataset, please cite:
+```bibtex
+@dataset{neurondb_postgresql_sql_2026,
+  title={NeuronDB PostgreSQL SQL & PL/pgSQL Instruction Dataset},
+  author={NeuronDB Team},
+  year={2026},
+  url={https://huggingface.co/datasets/neurondb/neurondb-postgresql-sql},
+}
+```
+## License
+Apache 2.0

test.jsonl ADDED Viewed

The diff for this file is too large to render. See raw diff

train.jsonl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:05eec03e064a64097d8ca3386ef8e9f50d2a9e4763d659cfc30e5a3285e0b0d5
+size 121583836

validation.jsonl ADDED Viewed

The diff for this file is too large to render. See raw diff