Buckets:
| # SQL Console: Query Hugging Face datasets in your browser | |
| You can run SQL queries on the dataset in the browser using the SQL Console. The SQL Console is powered by [DuckDB](https://duckdb.org/) WASM and runs entirely in the browser. You can access the SQL Console from the Data Studio. | |
| To learn more about the SQL Console, see the SQL Console blog post. | |
| Through the SQL Console, you can: | |
| - Run [DuckDB SQL queries](https://duckdb.org/docs/sql/query_syntax/select) on the dataset (_checkout [SQL Snippets](https://huggingface.co/spaces/cfahlgren1/sql-snippets) for useful queries_) | |
| - Share results of the query with others via a link (_check out [this example](https://huggingface.co/datasets/gretelai/synthetic-gsm8k-reflection-405b?sql_console=true&sql=FROM+histogram%28%0A++train%2C%0A++topic%2C%0A++bin_count+%3A%3D+10%0A%29)_) | |
| - Download the results of the query to a Parquet or CSV file | |
| - Embed the results of the query in your own webpage using an iframe | |
| - Query datasets with natural language | |
| > [!TIP] | |
| > You can also use the DuckDB locally through the CLI to query the dataset via the `hf://` protocol. See the DuckDB Datasets documentation for more information. The SQL Console provides a convenient `Copy to DuckDB CLI` button that generates the SQL query for creating views and executing your query in the DuckDB CLI. | |
| ## Examples | |
| ### Filtering | |
| The SQL Console makes filtering datasets really easy. For example, if you want to filter the `SkunkworksAI/reasoning-0.01` dataset for instructions and responses with a reasoning length of at least 10, you can use the following query: | |
| Here's the SQL to sort by length of the reasoning | |
| ```sql | |
| SELECT * | |
| FROM train | |
| WHERE LENGTH(reasoning_chains) > 10; | |
| ``` | |
| ### Histogram | |
| Many dataset authors choose to include statistics about the distribution of the data in the dataset. Using the DuckDB `histogram` function, we can plot a histogram of a column's values. | |
| For example, to plot a histogram of the `Rating` column in the [Lichess/chess-puzzles](https://huggingface.co/datasets/Lichess/chess-puzzles) dataset, you can use the following query: | |
| Learn more about the `histogram` function and parameters here. | |
| ```sql | |
| from histogram(train, Rating) | |
| ``` | |
| ### Regex Matching | |
| One of the most powerful features of DuckDB is the deep support for regular expressions. You can use the `regexp` function to match patterns in your data. | |
| Using the [regexp_matches](https://duckdb.org/docs/sql/functions/char.html#regexp_matchesstring-pattern) function, we can filter the [GeneralReasoning/GeneralThought-195k](https://huggingface.co/datasets/GeneralReasoning/GeneralThought-195K) dataset for instructions that contain markdown code blocks. | |
| Learn more about the DuckDB regex functions here. | |
| ```sql | |
| SELECT * | |
| FROM train | |
| WHERE regexp_matches(model_answer, '```') | |
| LIMIT 10; | |
| ``` | |
| ### Saved Queries and Embeds API | |
| You can create, update, and delete SQL Console embeds programmatically. Embeds are saved queries that can be shared via link or embedded in other pages. | |
| **Create an embed:** | |
| ``` | |
| POST /api/datasets/{namespace}/{repo}/sql-console/embed | |
| Content-Type: application/json | |
| Authorization: Bearer {token} | |
| { | |
| "sql": "SELECT * FROM train LIMIT 10", | |
| "title": "Sample rows", | |
| "private": false, | |
| "views": [{"key": "default/train", "displayName": "Train", "viewName": "train"}] | |
| } | |
| ``` | |
| **Update an embed:** | |
| ``` | |
| PATCH /api/datasets/{namespace}/{repo}/sql-console/embed/{embed_id} | |
| Content-Type: application/json | |
| Authorization: Bearer {token} | |
| { | |
| "sql": "SELECT * FROM train LIMIT 20", | |
| "title": "Updated title", | |
| "private": true | |
| } | |
| ``` | |
| **Delete an embed:** | |
| ``` | |
| DELETE /api/datasets/{namespace}/{repo}/sql-console/embed/{embed_id} | |
| Authorization: Bearer {token} | |
| ``` | |
| ### Leakage Detection | |
| Leakage detection is the process of identifying whether data in a dataset is present in multiple splits, for example, whether the test set is present in the training set. | |
| Learn more about leakage detection here. | |
| ```sql | |
| WITH | |
| overlapping_rows AS ( | |
| SELECT COALESCE( | |
| (SELECT COUNT(*) AS overlap_count | |
| FROM train | |
| INTERSECT | |
| SELECT COUNT(*) AS overlap_count | |
| FROM test), | |
| 0 | |
| ) AS overlap_count | |
| ), | |
| total_unique_rows AS ( | |
| SELECT COUNT(*) AS total_count | |
| FROM ( | |
| SELECT * FROM train | |
| UNION | |
| SELECT * FROM test | |
| ) combined | |
| ) | |
| SELECT | |
| overlap_count, | |
| total_count, | |
| CASE | |
| WHEN total_count > 0 THEN (overlap_count * 100.0 / total_count) | |
| ELSE 0 | |
| END AS overlap_percentage | |
| FROM overlapping_rows, total_unique_rows; | |
| ``` | |
Xet Storage Details
- Size:
- 4.71 kB
- Xet hash:
- d559719a27964e862ded99e6c5c3729dde58c52435fce3908b6ec321c55ec43e
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.