Buckets:

hf-doc-build/doc-dev / hub /pr_2396 /en /datasets-viewer-sql-console.md
HuggingFaceDocBuilder's picture
|
download
raw
4.71 kB
# SQL Console: Query Hugging Face datasets in your browser
You can run SQL queries on the dataset in the browser using the SQL Console. The SQL Console is powered by [DuckDB](https://duckdb.org/) WASM and runs entirely in the browser. You can access the SQL Console from the Data Studio.
To learn more about the SQL Console, see the SQL Console blog post.
Through the SQL Console, you can:
- Run [DuckDB SQL queries](https://duckdb.org/docs/sql/query_syntax/select) on the dataset (_checkout [SQL Snippets](https://huggingface.co/spaces/cfahlgren1/sql-snippets) for useful queries_)
- Share results of the query with others via a link (_check out [this example](https://huggingface.co/datasets/gretelai/synthetic-gsm8k-reflection-405b?sql_console=true&sql=FROM+histogram%28%0A++train%2C%0A++topic%2C%0A++bin_count+%3A%3D+10%0A%29)_)
- Download the results of the query to a Parquet or CSV file
- Embed the results of the query in your own webpage using an iframe
- Query datasets with natural language
> [!TIP]
> You can also use the DuckDB locally through the CLI to query the dataset via the `hf://` protocol. See the DuckDB Datasets documentation for more information. The SQL Console provides a convenient `Copy to DuckDB CLI` button that generates the SQL query for creating views and executing your query in the DuckDB CLI.
## Examples
### Filtering
The SQL Console makes filtering datasets really easy. For example, if you want to filter the `SkunkworksAI/reasoning-0.01` dataset for instructions and responses with a reasoning length of at least 10, you can use the following query:
Here's the SQL to sort by length of the reasoning
```sql
SELECT *
FROM train
WHERE LENGTH(reasoning_chains) > 10;
```
### Histogram
Many dataset authors choose to include statistics about the distribution of the data in the dataset. Using the DuckDB `histogram` function, we can plot a histogram of a column's values.
For example, to plot a histogram of the `Rating` column in the [Lichess/chess-puzzles](https://huggingface.co/datasets/Lichess/chess-puzzles) dataset, you can use the following query:
Learn more about the `histogram` function and parameters here.
```sql
from histogram(train, Rating)
```
### Regex Matching
One of the most powerful features of DuckDB is the deep support for regular expressions. You can use the `regexp` function to match patterns in your data.
Using the [regexp_matches](https://duckdb.org/docs/sql/functions/char.html#regexp_matchesstring-pattern) function, we can filter the [GeneralReasoning/GeneralThought-195k](https://huggingface.co/datasets/GeneralReasoning/GeneralThought-195K) dataset for instructions that contain markdown code blocks.
Learn more about the DuckDB regex functions here.
```sql
SELECT *
FROM train
WHERE regexp_matches(model_answer, '```')
LIMIT 10;
```
### Saved Queries and Embeds API
You can create, update, and delete SQL Console embeds programmatically. Embeds are saved queries that can be shared via link or embedded in other pages.
**Create an embed:**
```
POST /api/datasets/{namespace}/{repo}/sql-console/embed
Content-Type: application/json
Authorization: Bearer {token}
{
"sql": "SELECT * FROM train LIMIT 10",
"title": "Sample rows",
"private": false,
"views": [{"key": "default/train", "displayName": "Train", "viewName": "train"}]
}
```
**Update an embed:**
```
PATCH /api/datasets/{namespace}/{repo}/sql-console/embed/{embed_id}
Content-Type: application/json
Authorization: Bearer {token}
{
"sql": "SELECT * FROM train LIMIT 20",
"title": "Updated title",
"private": true
}
```
**Delete an embed:**
```
DELETE /api/datasets/{namespace}/{repo}/sql-console/embed/{embed_id}
Authorization: Bearer {token}
```
### Leakage Detection
Leakage detection is the process of identifying whether data in a dataset is present in multiple splits, for example, whether the test set is present in the training set.
Learn more about leakage detection here.
```sql
WITH
overlapping_rows AS (
SELECT COALESCE(
(SELECT COUNT(*) AS overlap_count
FROM train
INTERSECT
SELECT COUNT(*) AS overlap_count
FROM test),
0
) AS overlap_count
),
total_unique_rows AS (
SELECT COUNT(*) AS total_count
FROM (
SELECT * FROM train
UNION
SELECT * FROM test
) combined
)
SELECT
overlap_count,
total_count,
CASE
WHEN total_count > 0 THEN (overlap_count * 100.0 / total_count)
ELSE 0
END AS overlap_percentage
FROM overlapping_rows, total_unique_rows;
```

Xet Storage Details

Size:
4.71 kB
·
Xet hash:
d559719a27964e862ded99e6c5c3729dde58c52435fce3908b6ec321c55ec43e

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.