Buckets:

denverbaumgartner
/

SynQL-Spider-Train-bucket

Files

xet

denverbaumgartner/SynQL-Spider-Train-bucket / README.md

denverbaumgartner

5 days ago

preview code

download

raw

4.82 kB

	---
	dataset_info:
	features:
	- name: question
	dtype: string
	- name: query
	dtype: string
	- name: db_id
	dtype: string
	- name: topic_id
	dtype: int64
	- name: query_id
	dtype: int64
	splits:
	- name: train
	num_bytes: 34772083
	num_examples: 114955
	download_size: 9670141
	dataset_size: 34772083
	configs:
	- config_name: default
	data_files:
	- split: train
	path: data/train-*
	---

	# Dataset Card for SynQL-Spider-Train
	- Developed by: Semiotic Labs
	- Dataset type: [Text-to-SQL]
	- License: [Apache-2.0]

	## Dataset Details
	Example view of data:
	```json
	[
	{
	"question": "What are the names of browsers that have a market share greater than 10% but less than 50%?",
	"query": "SELECT name FROM browser WHERE market_share > 10 AND market_share < 50",
	"db_id": "browser_web",
	"topic_id": "2",
	"query_id": "19"
	},
	...
	{
	"question": "<Generated Question>",
	"query": "<Generated Query>",
	"db_id": "<Database ID Used For Generation>",
	"topic_id": "<Topic ID Used For Generation>",
	"query_id": "<Query ID Used For Generation>"
	},
	]
	```

	- The topics used for generation can be found in the `semiotic/SynQL-Spider-Train-Topics` dataset ([link](https://huggingface.co/datasets/semiotic/SynQL-Spider-Train-Topics)).
	- The templates used for generation can be found in the `semiotic/SynQL-Spider-Train-Source-Templates` dataset ([link](https://huggingface.co/datasets/semiotic/SynQL-Spider-Train-Source-Templates)).
	- The database schemas used for generation can be found in the Spider dataset ([link](https://yale-lily.github.io/spider)).

	An example prompt used for generation is as follows:

	```
	System Prompt:
	Your task is to create a SQL query and an associated question based on a given subject, query structure, and
	schema. ∗∗The query must strictly adhere to the provided query structure and be a valid SQL query. The
	question should be relevant to the subject and accurately answered by the query∗∗. Follow these guidelines:

	1) The query must be valid and logical SQL.
	2) The query must match the query structure exactly.
	3) The question must match the topic of the subject.
	4) The query must answer the question.
	5) The query must comply with the given table schema.
	6) Do not ask overly vague or specific questions that a user would not typically ask.

	Do not modify the query structure. Do not keep any placeholder (’?’) values. For example:
	Query Structure: SELECT ? FROM ? WHERE ? = ?;
	Generated Query: SELECT column_one FROM table_one WHERE column_two = 1

	The response must be in the following JSON format:
	Response Format: {"question": "<generated question>", "query": "<generated query>"}

	User Prompt:
	Given the following topic, query structure, and schema, generate a unique question and SQL query. The
	generated SQL query must strictly adhere to the provided query structure and be valid, logical, SQL. The
	question should be relevant to the topic, and the query should accurately answer the question using the given
	schema.
	∗∗Do not generate low-quality questions or queries∗∗. These include queries that have irrelevant structure, such
	as unnecessary joins. ∗∗The SQL query must be valid∗∗, both in its syntax and relation to the database schema.
	- Schema:
	CREATE TABLE "Web_client_accelerator" (
	"id" int,
	"name" text,
	"Operating_system" text,
	"Client" text,
	"Connection" text,
	PRIMARY key("id")
	)
	CREATE TABLE "browser" (
	"id" int,
	"name" text,
	"market_share" real,
	PRIMARY key("id")
	)
	CREATE TABLE "accelerator_compatible_browser" (
	"accelerator_id" int,
	"browser_id" int,
	"compatible_since_year" int,
	PRIMARY key("accelerator_id", "browser_id"),
	FOREIGN KEY ("accelerator_id") REFERENCES ‘Web_client_accelerator‘("id"),
	FOREIGN KEY ("browser_id") REFERENCES ‘browser‘("id")
	)
	- Question Topic: Web Client Accelerator Information (Questions specifically related to the web client
	accelerator. Avoid questions related to browser or compatibility)
	- Query Structure: SELECT COUNT(DISTINCT columnOne) FROM tableOne WHERE columnTwo = 1

	Response Format: {question: <generated question>, query: <generated query>}
	```

	### Dataset Composition and Inputs
	\| Dataset/Split \| # Databases \| # Tables/DB \| # QQPs \| # Topics \| # SQL Templates \|
	\|--------------\|-------------\|-------------\|---------\|----------\|-----------------\|
	\| SYNQL-Spider/train \| 140 \| 5.26 \| 114,955 \| 764 \| 15,775 \|

	### SQL Query Difficulty Distribution
	\| Dataset/Split \| Easy \| Medium \| Hard \| Extra \|
	\|--------------\|------\|--------\|------\|-------\|
	\| SYNQL-Spider/train \| 2.2% \| 16.6% \| 16.1% \| 65.1% \|

Xet Storage Details

Size:: 4.82 kB
Xet hash:: 42802256f99d70cc0fca449ac8b4998b4360fff6cc2a4b9f15cf19df27930926

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.