Buckets:
| dataset_info: | |
| features: | |
| - name: question | |
| dtype: string | |
| - name: query | |
| dtype: string | |
| - name: db_id | |
| dtype: string | |
| - name: topic_id | |
| dtype: int64 | |
| - name: query_id | |
| dtype: int64 | |
| splits: | |
| - name: train | |
| num_bytes: 34772083 | |
| num_examples: 114955 | |
| download_size: 9670141 | |
| dataset_size: 34772083 | |
| configs: | |
| - config_name: default | |
| data_files: | |
| - split: train | |
| path: data/train-* | |
| # Dataset Card for SynQL-Spider-Train | |
| - Developed by: Semiotic Labs | |
| - Dataset type: [Text-to-SQL] | |
| - License: [Apache-2.0] | |
| ## Dataset Details | |
| Example view of data: | |
| ```json | |
| [ | |
| { | |
| "question": "What are the names of browsers that have a market share greater than 10% but less than 50%?", | |
| "query": "SELECT name FROM browser WHERE market_share > 10 AND market_share < 50", | |
| "db_id": "browser_web", | |
| "topic_id": "2", | |
| "query_id": "19" | |
| }, | |
| ... | |
| { | |
| "question": "<Generated Question>", | |
| "query": "<Generated Query>", | |
| "db_id": "<Database ID Used For Generation>", | |
| "topic_id": "<Topic ID Used For Generation>", | |
| "query_id": "<Query ID Used For Generation>" | |
| }, | |
| ] | |
| ``` | |
| - The topics used for generation can be found in the `semiotic/SynQL-Spider-Train-Topics` dataset ([link](https://huggingface.co/datasets/semiotic/SynQL-Spider-Train-Topics)). | |
| - The templates used for generation can be found in the `semiotic/SynQL-Spider-Train-Source-Templates` dataset ([link](https://huggingface.co/datasets/semiotic/SynQL-Spider-Train-Source-Templates)). | |
| - The database schemas used for generation can be found in the Spider dataset ([link](https://yale-lily.github.io/spider)). | |
| An example prompt used for generation is as follows: | |
| ``` | |
| **System Prompt:** | |
| Your task is to create a SQL query and an associated question based on a given subject, query structure, and | |
| schema. ∗∗The query must strictly adhere to the provided query structure and be a valid SQL query. The | |
| question should be relevant to the subject and accurately answered by the query∗∗. Follow these guidelines: | |
| 1) The query must be valid and logical SQL. | |
| 2) The query must match the query structure exactly. | |
| 3) The question must match the topic of the subject. | |
| 4) The query must answer the question. | |
| 5) The query must comply with the given table schema. | |
| 6) Do not ask overly vague or specific questions that a user would not typically ask. | |
| Do not modify the query structure. Do not keep any placeholder (’?’) values. For example: | |
| Query Structure: SELECT ? FROM ? WHERE ? = ?; | |
| Generated Query: SELECT column_one FROM table_one WHERE column_two = 1 | |
| The response must be in the following JSON format: | |
| Response Format: {"question": "<generated question>", "query": "<generated query>"} | |
| **User Prompt:** | |
| Given the following topic, query structure, and schema, generate a unique question and SQL query. The | |
| generated SQL query must strictly adhere to the provided query structure and be valid, logical, SQL. The | |
| question should be relevant to the topic, and the query should accurately answer the question using the given | |
| schema. | |
| ∗∗Do not generate low-quality questions or queries∗∗. These include queries that have irrelevant structure, such | |
| as unnecessary joins. ∗∗The SQL query must be valid∗∗, both in its syntax and relation to the database schema. | |
| - Schema: | |
| CREATE TABLE "Web_client_accelerator" ( | |
| "id" int, | |
| "name" text, | |
| "Operating_system" text, | |
| "Client" text, | |
| "Connection" text, | |
| PRIMARY key("id") | |
| ) | |
| CREATE TABLE "browser" ( | |
| "id" int, | |
| "name" text, | |
| "market_share" real, | |
| PRIMARY key("id") | |
| ) | |
| CREATE TABLE "accelerator_compatible_browser" ( | |
| "accelerator_id" int, | |
| "browser_id" int, | |
| "compatible_since_year" int, | |
| PRIMARY key("accelerator_id", "browser_id"), | |
| FOREIGN KEY ("accelerator_id") REFERENCES ‘Web_client_accelerator‘("id"), | |
| FOREIGN KEY ("browser_id") REFERENCES ‘browser‘("id") | |
| ) | |
| - Question Topic: Web Client Accelerator Information (Questions specifically related to the web client | |
| accelerator. Avoid questions related to browser or compatibility) | |
| - Query Structure: SELECT COUNT(DISTINCT columnOne) FROM tableOne WHERE columnTwo = 1 | |
| Response Format: {question: <generated question>, query: <generated query>} | |
| ``` | |
| ### Dataset Composition and Inputs | |
| | Dataset/Split | # Databases | # Tables/DB | # QQPs | # Topics | # SQL Templates | | |
| |--------------|-------------|-------------|---------|----------|-----------------| | |
| | SYNQL-Spider/train | 140 | 5.26 | 114,955 | 764 | 15,775 | | |
| ### SQL Query Difficulty Distribution | |
| | Dataset/Split | Easy | Medium | Hard | Extra | | |
| |--------------|------|--------|------|-------| | |
| | SYNQL-Spider/train | 2.2% | 16.6% | 16.1% | 65.1% | |
Xet Storage Details
- Size:
- 4.82 kB
- Xet hash:
- 42802256f99d70cc0fca449ac8b4998b4360fff6cc2a4b9f15cf19df27930926
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.