Buckets:

hf-doc-build/doc-dev / hub /main /en /datasets-data-files-configuration.html
rtrm's picture
download
raw
14 kB
<meta charset="utf-8" /><meta name="hf:doc:metadata" content="{&quot;title&quot;:&quot;Data files Configuration&quot;,&quot;local&quot;:&quot;data-files-configuration&quot;,&quot;sections&quot;:[{&quot;title&quot;:&quot;What are splits and subsets?&quot;,&quot;local&quot;:&quot;what-are-splits-and-subsets&quot;,&quot;sections&quot;:[],&quot;depth&quot;:2},{&quot;title&quot;:&quot;File names and splits&quot;,&quot;local&quot;:&quot;file-names-and-splits&quot;,&quot;sections&quot;:[],&quot;depth&quot;:2},{&quot;title&quot;:&quot;Manual configuration&quot;,&quot;local&quot;:&quot;manual-configuration&quot;,&quot;sections&quot;:[],&quot;depth&quot;:2},{&quot;title&quot;:&quot;Supported file formats&quot;,&quot;local&quot;:&quot;supported-file-formats&quot;,&quot;sections&quot;:[],&quot;depth&quot;:2},{&quot;title&quot;:&quot;Image and Audio datasets&quot;,&quot;local&quot;:&quot;image-and-audio-datasets&quot;,&quot;sections&quot;:[],&quot;depth&quot;:2}],&quot;depth&quot;:1}">
<link href="/docs/hub/main/en/_app/immutable/assets/0.e3b0c442.css" rel="modulepreload">
<link rel="modulepreload" href="/docs/hub/main/en/_app/immutable/entry/start.d0cd5065.js">
<link rel="modulepreload" href="/docs/hub/main/en/_app/immutable/chunks/scheduler.d6170356.js">
<link rel="modulepreload" href="/docs/hub/main/en/_app/immutable/chunks/singletons.d032f1eb.js">
<link rel="modulepreload" href="/docs/hub/main/en/_app/immutable/chunks/paths.752f1c6b.js">
<link rel="modulepreload" href="/docs/hub/main/en/_app/immutable/entry/app.b6abe3c1.js">
<link rel="modulepreload" href="/docs/hub/main/en/_app/immutable/chunks/index.fcd4cc08.js">
<link rel="modulepreload" href="/docs/hub/main/en/_app/immutable/nodes/0.f045427f.js">
<link rel="modulepreload" href="/docs/hub/main/en/_app/immutable/nodes/17.d73144be.js">
<link rel="modulepreload" href="/docs/hub/main/en/_app/immutable/chunks/EditOnGithub.da2b595c.js"><!-- HEAD_svelte-u9bgzb_START --><meta name="hf:doc:metadata" content="{&quot;title&quot;:&quot;Data files Configuration&quot;,&quot;local&quot;:&quot;data-files-configuration&quot;,&quot;sections&quot;:[{&quot;title&quot;:&quot;What are splits and subsets?&quot;,&quot;local&quot;:&quot;what-are-splits-and-subsets&quot;,&quot;sections&quot;:[],&quot;depth&quot;:2},{&quot;title&quot;:&quot;File names and splits&quot;,&quot;local&quot;:&quot;file-names-and-splits&quot;,&quot;sections&quot;:[],&quot;depth&quot;:2},{&quot;title&quot;:&quot;Manual configuration&quot;,&quot;local&quot;:&quot;manual-configuration&quot;,&quot;sections&quot;:[],&quot;depth&quot;:2},{&quot;title&quot;:&quot;Supported file formats&quot;,&quot;local&quot;:&quot;supported-file-formats&quot;,&quot;sections&quot;:[],&quot;depth&quot;:2},{&quot;title&quot;:&quot;Image and Audio datasets&quot;,&quot;local&quot;:&quot;image-and-audio-datasets&quot;,&quot;sections&quot;:[],&quot;depth&quot;:2}],&quot;depth&quot;:1}"><!-- HEAD_svelte-u9bgzb_END --> <p></p> <h1 class="relative group"><a id="data-files-configuration" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#data-files-configuration"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Data files Configuration</span></h1> <p data-svelte-h="svelte-1sicc2g">There are no constraints on how to structure dataset repositories.</p> <p data-svelte-h="svelte-rffpu1">However, if you want the Dataset Viewer to show certain data files, or to separate your dataset in train/validation/test splits, you need to structure your dataset accordingly.
Often it is as simple as naming your data files according to their split names, e.g. <code>train.csv</code> and <code>test.csv</code>.</p> <h2 class="relative group"><a id="what-are-splits-and-subsets" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#what-are-splits-and-subsets"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>What are splits and subsets?</span></h2> <p data-svelte-h="svelte-1scij8h">Machine learning datasets typically have splits and may also have subsets. A dataset is generally made of <em>splits</em> (e.g. <code>train</code> and <code>test</code>) that are used during different stages of training and evaluating a model. A <em>subset</em> (also called <em>configuration</em>) is a sub-dataset contained within a larger dataset. Subsets are especially common in multilingual speech datasets where there may be a different subset for each language. If you’re interested in learning more about splits and subsets, check out the <a href="/docs/datasets-server/configs_and_splits">Splits and subsets</a> guide!</p> <p data-svelte-h="svelte-1p3cuco"><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/split-configs-server.gif" alt="split-configs-server"></p> <h2 class="relative group"><a id="file-names-and-splits" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#file-names-and-splits"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>File names and splits</span></h2> <p data-svelte-h="svelte-1gvdkiz">To structure your dataset by naming your data files or directories according to their split names, see the <a href="./datasets-file-names-and-splits">File names and splits</a> documentation and the <a href="https://huggingface.co/collections/datasets-examples/file-names-and-splits-655e28af4471bd95709eb135" rel="nofollow">companion collection of example datasets</a>.</p> <h2 class="relative group"><a id="manual-configuration" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#manual-configuration"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Manual configuration</span></h2> <p data-svelte-h="svelte-irk0xr">You can choose the data files to show in the Dataset Viewer for your dataset using YAML.
It is useful if you want to specify which file goes into which split manually.</p> <p data-svelte-h="svelte-24qwdr">You can also define multiple subsets for your dataset, and pass dataset building parameters (e.g. the separator to use for CSV files).</p> <p data-svelte-h="svelte-1bzrp3h">See the documentation on <a href="./datasets-manual-configuration">Manual configuration</a> for more information. Look also to the <a href="https://huggingface.co/collections/datasets-examples/manual-configuration-655e293cea26da0acab95b87" rel="nofollow">example datasets</a>.</p> <h2 class="relative group"><a id="supported-file-formats" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#supported-file-formats"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Supported file formats</span></h2> <p data-svelte-h="svelte-13m965x">See the <a href="./datasets-adding#file-formats">File formats</a> doc page to find the list of supported formats and recommendations for your dataset. If your dataset uses CSV or TSV files, you can find more information in the <a href="https://huggingface.co/collections/datasets-examples/format-csv-and-tsv-655f681cb9673a4249cccb3d" rel="nofollow">example datasets</a>.</p> <h2 class="relative group"><a id="image-and-audio-datasets" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#image-and-audio-datasets"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Image and Audio datasets</span></h2> <p data-svelte-h="svelte-r9o772">For image and audio classification datasets, you can also use directories to name the image and audio classes.
And if your images/audio files have metadata (e.g. captions, bounding boxes, transcriptions, etc.), you can have metadata files next to them.</p> <p data-svelte-h="svelte-1s3m8si">We provide two guides that you can check out:</p> <ul data-svelte-h="svelte-zbloxx"><li><a href="./datasets-image">How to create an image dataset</a> (<a href="https://huggingface.co/collections/datasets-examples/image-dataset-6568e7cf28639db76eb92d65" rel="nofollow">example datasets</a>)</li> <li><a href="./datasets-audio">How to create an audio dataset</a> (<a href="https://huggingface.co/collections/datasets-examples/audio-dataset-66aca0b73e8f69e3d069e607" rel="nofollow">example datasets</a>)</li></ul> <a class="!text-gray-400 !no-underline text-sm flex items-center not-prose mt-4" href="https://github.com/huggingface/hub-docs/blob/main/docs/hub/datasets-data-files-configuration.md" target="_blank"><span data-svelte-h="svelte-1kd6by1">&lt;</span> <span data-svelte-h="svelte-x0xyl0">&gt;</span> <span data-svelte-h="svelte-1dajgef"><span class="underline ml-1.5">Update</span> on GitHub</span></a> <p></p>
<script>
{
__sveltekit_1vatp3t = {
assets: "/docs/hub/main/en",
base: "/docs/hub/main/en",
env: {}
};
const element = document.currentScript.parentElement;
const data = [null,null];
Promise.all([
import("/docs/hub/main/en/_app/immutable/entry/start.d0cd5065.js"),
import("/docs/hub/main/en/_app/immutable/entry/app.b6abe3c1.js")
]).then(([kit, app]) => {
kit.start(app, element, {
node_ids: [0, 17],
data,
form: null,
error: null
});
});
}
</script>

Xet Storage Details

Size:
14 kB
·
Xet hash:
c05b174c268a160604685cd33fda5eed342cd2925b02a2092c7d033e55ce3800

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.