Buckets:

rtrm's picture
|
download
raw
4.04 kB

Check dataset validity

Before you download a dataset from the Hub, it is helpful to know if a specific dataset you're interested in is available. The dataset viewer provides the /is-valid endpoint to check if a specific dataset works without any errors.

The API endpoint will return an error for datasets that cannot be loaded with the 🤗 Datasets library, for example, because the data hasn't been uploaded or the format is not supported.

The largest datasets are partially supported by the dataset viewer. If they are{" "} streamable, Datasets Server can extract the first 100 rows without downloading the whole dataset. This is especially useful for previewing large datasets where downloading the whole dataset may take hours! See the preview field in the response of /is-valid to check if a dataset is partially supported.

This guide shows you how to check dataset validity programmatically, but free to try it out with Postman, RapidAPI, or ReDoc.

Check if a dataset is valid

/is-valid checks whether a specific dataset loads without any error. This endpoint's query parameter requires you to specify the name of the dataset:

import requests
headers = {"Authorization": f"Bearer {API_TOKEN}"}
API_URL = "https://datasets-server.huggingface.co/is-valid?dataset=cornell-movie-review-data/rotten_tomatoes"
def query():
    response = requests.get(API_URL, headers=headers)
    return response.json()
data = query()
import fetch from "node-fetch";
async function query(data) {
    const response = await fetch(
        "https://datasets-server.huggingface.co/is-valid?dataset=cornell-movie-review-data/rotten_tomatoes",
        {
            headers: { Authorization: `Bearer ${API_TOKEN}` },
            method: "GET"
        }
    );
    const result = await response.json();
    return result;
}
query().then((response) => {
    console.log(JSON.stringify(response));
});
curl https://datasets-server.huggingface.co/is-valid?dataset=cornell-movie-review-data/rotten_tomatoes \
        -X GET \
        -H "Authorization: Bearer ${API_TOKEN}"

The response looks like this if a dataset is valid:

{
  "viewer": true,
  "preview": true,
  "search": true,
  "filter": true,
  "statistics": true,
}

The response looks like this if a dataset is valid but /search is not available for it:

{
  "viewer": true,
  "preview": true,
  "search": false,
  "filter": true,
  "statistics": true,
}

The response looks like this if a dataset is valid but /filter is not available for it:

{
  "viewer": true,
  "preview": true,
  "search": true,
  "filter": false,
  "statistics": true,
}

Similarly, if the statistics are not available:

{
  "viewer": true,
  "preview": true,
  "search": true,
  "filter": true,
  "statistics": false,
}

If only the first rows of a dataset are available, then the response looks like:

{
  "viewer": false,
  "preview": true,
  "search": true,
  "filter": true,
  "statistics": true,
}

Finally, if the dataset is not valid at all, then the response is:

{
  "viewer": false,
  "preview": false,
  "search": false,
  "filter": false,
  "statistics": false,
}

Some cases where a dataset is not valid are:

  • the dataset viewer is disabled

  • the dataset is gated but the access is not granted: no token is passed or the passed token is not authorized

  • the dataset is private but the owner is not a PRO user or an Enterprise Hub org

  • the dataset contains no data or the data format is not supported

    Remember if a dataset is gated, you'll need to provide your user token to submit a successful query!

Xet Storage Details

Size:
4.04 kB
·
Xet hash:
8ec54b083159ca05de7c2749ae76de8a6502bc1c054e4d70bb807e2283098394

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.