neurondb commited on
Commit
afdfdd2
Β·
verified Β·
1 Parent(s): 623978d

Added dataset files.

Browse files
Files changed (5) hide show
  1. .gitattributes +1 -0
  2. README.md +289 -3
  3. test.jsonl +0 -0
  4. train.jsonl +3 -0
  5. validation.jsonl +0 -0
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ train.jsonl filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,3 +1,289 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: apache-2.0
5
+ task_categories:
6
+ - text2text-generation
7
+ - text-generation
8
+ tags:
9
+ - postgresql
10
+ - sql
11
+ - plpgsql
12
+ - text-to-sql
13
+ - code-generation
14
+ - database
15
+ - postgres
16
+ - neurondb
17
+ pretty_name: "NeuronDB PostgreSQL SQL & PL/pgSQL Instruction Dataset"
18
+ size_categories:
19
+ - 100K<n<1M
20
+ dataset_info:
21
+ features:
22
+ - name: question
23
+ dtype: string
24
+ - name: schema
25
+ dtype: string
26
+ - name: sql
27
+ dtype: string
28
+ - name: explanation
29
+ dtype: string
30
+ - name: validation_query
31
+ dtype: string
32
+ - name: source
33
+ dtype: string
34
+ - name: difficulty
35
+ dtype: string
36
+ - name: category
37
+ dtype: string
38
+ - name: is_postgresql_specific
39
+ dtype: bool
40
+ - name: sql_length
41
+ dtype: int32
42
+ - name: num_statements
43
+ dtype: int32
44
+ splits:
45
+ - name: train
46
+ num_examples: 194398
47
+ - name: validation
48
+ num_examples: 13693
49
+ - name: test
50
+ num_examples: 3448
51
+ configs:
52
+ - config_name: default
53
+ data_files:
54
+ - split: train
55
+ path: train.jsonl
56
+ - split: validation
57
+ path: validation.jsonl
58
+ - split: test
59
+ path: test.jsonl
60
+ ---
61
+
62
+ # NeuronDB PostgreSQL SQL & PL/pgSQL Instruction Dataset
63
+
64
+ A large-scale, curated instruction dataset for training and evaluating LLMs on
65
+ **PostgreSQL-specific** SQL and PL/pgSQL generation. Every row is a
66
+ (question, schema, SQL) triplet with rich metadata for filtering and analysis.
67
+
68
+ ## Dataset Summary
69
+
70
+ | Metric | Value |
71
+ |--------|-------|
72
+ | **Total rows** | 211,539 |
73
+ | **PostgreSQL-specific rows** | 11,998 (5.7%) |
74
+ | **Schema fill rate** | 82.2% |
75
+ | **Explanation fill rate** | 17.8% |
76
+ | **SQL length (median)** | 83 chars |
77
+ | **SQL length (max)** | 61,419 chars |
78
+
79
+ ## Splits
80
+
81
+ | Split | Rows |
82
+ |-------|------|
83
+ | `train` | 194,398 |
84
+ | `validation` | 13,693 |
85
+ | `test` | 3,448 |
86
+
87
+ ## Schema
88
+
89
+ Each row contains **11 fields**:
90
+
91
+ | Field | Type | Description |
92
+ |-------|------|-------------|
93
+ | `question` | `string` | Natural language instruction or question |
94
+ | `schema` | `string?` | DDL schema context (CREATE TABLE statements), null if not applicable |
95
+ | `sql` | `string` | Ground truth PostgreSQL SQL or PL/pgSQL answer |
96
+ | `explanation` | `string?` | Short explanation of what the SQL does |
97
+ | `validation_query` | `string?` | Query to validate the answer produces correct results |
98
+ | `source` | `string` | Origin of this instruction pair (see Sources below) |
99
+ | `difficulty` | `string` | One of: `basic`, `intermediate`, `advanced` |
100
+ | `category` | `string` | SQL category (see Categories below) |
101
+ | `is_postgresql_specific` | `bool` | True if SQL uses PostgreSQL-specific syntax |
102
+ | `sql_length` | `int32` | Character length of the SQL field |
103
+ | `num_statements` | `int32` | Number of SQL statements (semicolon count) |
104
+
105
+ ## Sources
106
+
107
+ Data is aggregated from multiple high-quality sources, each tagged:
108
+
109
+ | Source | Rows |
110
+ |--------|------|
111
+ | `community_sql_datasets` | 115,811 |
112
+ | `sql_create_context` | 78,392 |
113
+ | `postgresql_regression_tests` | 11,622 |
114
+ | `pgtap_tests` | 4,181 |
115
+ | `plpgsql_source` | 1,529 |
116
+ | `synthetic_text_to_sql` | 4 |
117
+
118
+ ### Source Descriptions
119
+
120
+ - **`postgresql_regression_tests`** β€” SQL extracted from PostgreSQL's own regression test suite
121
+ - **`postgresql_docs`** β€” Examples from official PostgreSQL SGML documentation
122
+ - **`postgresql_contrib`** β€” SQL from contrib modules (pg_trgm, hstore, ltree, etc.)
123
+ - **`pgtap_tests`** β€” pgTAP unit test SQL
124
+ - **`plpgsql_source`** β€” PL/pgSQL functions from the PostgreSQL source tree
125
+ - **`pgbench_scripts`** β€” pgbench benchmark scripts
126
+ - **`handcrafted_advanced`** β€” Hand-written examples covering advanced patterns (window functions, CTEs, JSONB, RLS, triggers, partitioning, custom aggregates, etc.)
127
+ - **`sql_create_context`** β€” WikiSQL/Spider-derived text-to-SQL pairs (b-mc2/sql-create-context)
128
+ - **`synthetic_text_to_sql`** β€” Synthetically generated text-to-SQL pairs (gretelai, NumbersStation)
129
+ - **`community_sql_datasets`** β€” Other community SQL datasets (Clinton/text-to-sql-v1, knowrohit07/know_sql)
130
+
131
+ ## Difficulty Distribution
132
+
133
+ | Difficulty | Rows |
134
+ |------------|------|
135
+ | `basic` | 147,920 |
136
+ | `intermediate` | 56,469 |
137
+ | `advanced` | 7,150 |
138
+
139
+ ## Categories
140
+
141
+ | Category | Rows |
142
+ |----------|------|
143
+ | `query_select` | 136,225 |
144
+ | `query_aggregation` | 32,050 |
145
+ | `query_join` | 10,597 |
146
+ | `dml_insert` | 8,763 |
147
+ | `other` | 4,093 |
148
+ | `dml_update` | 3,664 |
149
+ | `dml_delete` | 3,647 |
150
+ | `ddl_table` | 3,430 |
151
+ | `query_window_function` | 3,055 |
152
+ | `plpgsql_function` | 1,912 |
153
+ | `ddl_advanced` | 1,143 |
154
+ | `ddl_index` | 806 |
155
+ | `plpgsql` | 742 |
156
+ | `ddl_view` | 541 |
157
+ | `plpgsql_trigger` | 401 |
158
+ | `ddl_alter` | 235 |
159
+ | `admin_maintenance` | 125 |
160
+ | `dcl_security` | 92 |
161
+ | `query_recursive_cte` | 18 |
162
+
163
+ ## Usage
164
+
165
+ ```python
166
+ from datasets import load_dataset
167
+
168
+ ds = load_dataset("neurondb/neurondb-postgresql-sql")
169
+
170
+ # Filter for advanced PostgreSQL-specific queries
171
+ advanced_pg = ds["train"].filter(
172
+ lambda x: x["difficulty"] == "advanced" and x["is_postgresql_specific"]
173
+ )
174
+
175
+ # Filter by category
176
+ window_fns = ds["train"].filter(lambda x: x["category"] == "query_window_function")
177
+
178
+ # Filter by source
179
+ gold = ds["train"].filter(
180
+ lambda x: x["source"] in [
181
+ "postgresql_regression_tests",
182
+ "postgresql_docs",
183
+ "handcrafted_advanced",
184
+ ]
185
+ )
186
+ ```
187
+
188
+ ## Intended Use
189
+
190
+ - **Fine-tuning** LLMs for PostgreSQL SQL and PL/pgSQL code generation
191
+ - **Evaluating** text-to-SQL models on PostgreSQL-specific syntax
192
+ - **Benchmarking** SQL generation quality across difficulty levels
193
+ - **Building** PostgreSQL-aware coding assistants
194
+
195
+ ## Data Quality
196
+
197
+ - All rows have non-empty `question` and `sql` fields
198
+ - MySQL-only and T-SQL-only syntax has been filtered out
199
+ - Duplicate (question, SQL) pairs have been removed
200
+ - Rows with trivially short SQL (< 10 chars) are excluded
201
+ - Each row is tagged with source, difficulty, and category for easy filtering
202
+
203
+ ## Examples
204
+
205
+
206
+ #### Example 1 β€” basic / query_select
207
+ **Source:** `sql_create_context`
208
+
209
+ **Question:** Generate PostgreSQL SQL for: Which manufacturer made a locomotive with a type of 4-6-4t?
210
+
211
+ **Schema:**
212
+ ```sql
213
+ CREATE TABLE table_name_40 (manufacturer VARCHAR, type VARCHAR)
214
+ ```
215
+
216
+ **SQL:**
217
+ ```sql
218
+ SELECT manufacturer FROM table_name_40 WHERE type = '4-6-4t';
219
+ ```
220
+
221
+
222
+ #### Example 2 β€” intermediate / query_join
223
+ **Source:** `community_sql_datasets`
224
+
225
+ **Question:** What is the average account balance for customers who have a Shariah-compliant mortgage or a socially responsible loan?
226
+
227
+ **Schema:**
228
+ ```sql
229
+ CREATE TABLE shariah_mortgages (mortgage_id INT, customer_id INT, account_balance DECIMAL); CREATE TABLE socially_responsible_loans (loan_id INT, customer_id INT, account_balance DECIMAL); CREATE TABLE shariah_loans (loan_id INT, mortgage_id INT);
230
+ ```
231
+
232
+ **SQL:**
233
+ ```sql
234
+ SELECT AVG(CASE WHEN sm.customer_id IS NOT NULL THEN sm.account_balance ELSE srl.account_balance END) FROM shariah_mortgages sm RIGHT JOIN socially_responsible_loans srl ON sm.customer_id = srl.customer_id JOIN shariah_loans sl ON sm.mortgage_id = sl.mortgage_id OR srl.loan_id = sl.loan_id;
235
+ ```
236
+
237
+
238
+ #### Example 3 β€” advanced / plpgsql_function
239
+ **Source:** `community_sql_datasets`
240
+
241
+ **Question:** Write the PL/pgSQL object from PostgreSQL regression test 'plpgsql' (example 352).
242
+
243
+ **SQL:**
244
+ ```sql
245
+ $$ language plpgsql;
246
+
247
+ select * from sc_test();
248
+
249
+ create or replace function sc_test() returns setof integer as $$
250
+ declare
251
+ c refcursor;
252
+ ```
253
+
254
+ **Explanation:** PL/pgSQL object from PostgreSQL core test for Plpgsql.
255
+
256
+
257
+ #### Example 4 β€” advanced / query_window_function
258
+ **Source:** `community_sql_datasets`
259
+
260
+ **Question:** What is the difference in the number of attendees for each community education program between the first and last occurrence?
261
+
262
+ **Schema:**
263
+ ```sql
264
+ CREATE TABLE community_education (program_name VARCHAR(255), location VARCHAR(255), date DATE, num_attendees INT); INSERT INTO community_education (program_name, location, date, num_attendees) VALUES ('Wildlife Awareness', 'New York', '2020-01-01', 50), ('Wildlife Awareness', 'Florida', '2020-03-10', 75), ('Nature Walk', 'California', '2019-05-15', 25), ('Nature Walk', 'California', '2020-05-15', 35);
265
+ ```
266
+
267
+ **SQL:**
268
+ ```sql
269
+ SELECT program_name, num_attendees - FIRST_VALUE(num_attendees) OVER (PARTITION BY program_name ORDER BY date) as diff FROM community_education;
270
+ ```
271
+
272
+
273
+
274
+ ## Citation
275
+
276
+ If you use this dataset, please cite:
277
+
278
+ ```bibtex
279
+ @dataset{neurondb_postgresql_sql_2026,
280
+ title={NeuronDB PostgreSQL SQL & PL/pgSQL Instruction Dataset},
281
+ author={NeuronDB Team},
282
+ year={2026},
283
+ url={https://huggingface.co/datasets/neurondb/neurondb-postgresql-sql},
284
+ }
285
+ ```
286
+
287
+ ## License
288
+
289
+ Apache 2.0
test.jsonl ADDED
The diff for this file is too large to render. See raw diff
 
train.jsonl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:05eec03e064a64097d8ca3386ef8e9f50d2a9e4763d659cfc30e5a3285e0b0d5
3
+ size 121583836
validation.jsonl ADDED
The diff for this file is too large to render. See raw diff