Buckets:
| # Analyze a dataset on the Hub | |
| In the Quickstart, you were introduced to various endpoints for interacting with datasets on the Hub. One of the most useful ones is the `/parquet` endpoint, which allows you to get a dataset stored on the Hub and analyze it. This is a great way to explore the dataset, and get a better understanding of it's contents. | |
| To demonstrate, this guide will show you an end-to-end example of how to retrieve a dataset from the Hub and do some basic data analysis with the Pandas library. | |
| ## Get a dataset | |
| The [Hub](https://huggingface.co/datasets) is home to more than 200,000 datasets across a wide variety of tasks, sizes, and languages. For this example, you'll use the [`codeparrot/codecomplex`](https://huggingface.co/datasets/codeparrot/codecomplex) dataset, but feel free to explore and find another dataset that interests you! The dataset contains Java code from programming competitions, and the time complexity of the code is labeled by a group of algorithm experts. | |
| Let's say you're interested in the average length of the submitted code as it relates to the time complexity. Here's how you can get started. | |
| Use the `/parquet` endpoint to convert the dataset to a Parquet file and return the URL to it: | |
| ```python | |
| import requests | |
| API_URL = "https://datasets-server.huggingface.co/parquet?dataset=codeparrot/codecomplex" | |
| def query(): | |
| response = requests.get(API_URL) | |
| return response.json() | |
| data = query() | |
| ``` | |
| ```js | |
| import fetch from "node-fetch"; | |
| async function query(data) { | |
| const response = await fetch( | |
| "https://datasets-server.huggingface.co/parquet?dataset=codeparrot/codecomplex", | |
| { | |
| method: "GET" | |
| } | |
| ); | |
| const result = await response.json(); | |
| return result; | |
| } | |
| query().then((response) => { | |
| console.log(JSON.stringify(response)); | |
| }); | |
| ``` | |
| ```curl | |
| curl https://datasets-server.huggingface.co/parquet?dataset=codeparrot/codecomplex \ | |
| -X GET | |
| ``` | |
| ```json | |
| {"parquet_files": | |
| [ | |
| {"dataset": "codeparrot/codecomplex", "config": "default", "split": "train", "url": "https://huggingface.co/datasets/codeparrot/codecomplex/resolve/refs%2Fconvert%2Fparquet/default/train/0000.parquet", "filename": "0000.parquet", "size": 4115908} | |
| ], | |
| "pending": [], "failed": [], "partial": false | |
| } | |
| ``` | |
| ## Read dataset with Pandas | |
| With the URL, you can read the Parquet file into a Pandas DataFrame: | |
| ```py | |
| import pandas as pd | |
| url = "https://huggingface.co/datasets/codeparrot/codecomplex/resolve/refs%2Fconvert%2Fparquet/default/train/0000.parquet" | |
| df = pd.read_parquet(url) | |
| df.head(5) | |
| ``` | |
| | src | complexity | problem | from | | |
| |--------------------------------------------------:|-----------:|--------------------------------:|-----------:| | |
| | import java.io.*;\nimport java.math.BigInteger... | quadratic | 1179_B. Tolik and His Uncle | CODEFORCES | | |
| | import java.util.Scanner;\n \npublic class pil... | linear | 1197_B. Pillars | CODEFORCES | | |
| | import java.io.BufferedReader;\nimport java.io... | linear | 1059_C. Sequence Transformation | CODEFORCES | | |
| | import java.util.*;\n\nimport java.io.*;\npubl... | linear | 1011_A. Stages | CODEFORCES | | |
| | import java.io.OutputStream;\nimport java.io.I... | linear | 1190_C. Tokitsukaze and Duel | CODEFORCES | | |
| ## Calculate mean code length by time complexity | |
| Pandas is a powerful library for data analysis; group the dataset by time complexity, apply a function to calculate the average length of the code snippet, and plot the results: | |
| ```py | |
| df.groupby('complexity')['src'].apply(lambda x: x.str.len().mean()).sort_values(ascending=False).plot.barh(color="orange") | |
| ``` | |
Xet Storage Details
- Size:
- 3.77 kB
- Xet hash:
- 8aaa6194a8836714b60d86e5994c34e79e2af25981abcc112c63f0c75f8dba59
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.