Buckets:
| <meta charset="utf-8" /><meta name="hf:doc:metadata" content="{"title":"Bagaimana jika dataset saya tidak ada di Hub?","local":"what-if-my-dataset-isnt-on-the-hub","sections":[{"title":"Bekerja dengan dataset lokal dan jarak jauh","local":"working-with-local-and-remote-datasets","sections":[],"depth":2},{"title":"Memuat dataset lokal","local":"loading-a-local-dataset","sections":[],"depth":2},{"title":"Memuat dataset jarak jauh","local":"loading-a-remote-dataset","sections":[],"depth":2}],"depth":1}"> | |
| <link href="/docs/course/pr_1054/id/_app/immutable/assets/0.e3b0c442.css" rel="modulepreload"> | |
| <link rel="modulepreload" href="/docs/course/pr_1054/id/_app/immutable/entry/start.4f92af03.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1054/id/_app/immutable/chunks/scheduler.36a0863c.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1054/id/_app/immutable/chunks/singletons.7dc7b9a4.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1054/id/_app/immutable/chunks/index.733708bb.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1054/id/_app/immutable/chunks/paths.cf097d06.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1054/id/_app/immutable/entry/app.19cef1b6.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1054/id/_app/immutable/chunks/index.156fee99.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1054/id/_app/immutable/nodes/0.1203e4a0.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1054/id/_app/immutable/chunks/each.e59479a4.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1054/id/_app/immutable/nodes/37.c0458f12.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1054/id/_app/immutable/chunks/Tip.8a648467.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1054/id/_app/immutable/chunks/Youtube.a5d6d567.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1054/id/_app/immutable/chunks/CodeBlock.4cf998e6.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1054/id/_app/immutable/chunks/CourseFloatingBanner.16bb8bff.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1054/id/_app/immutable/chunks/getInferenceSnippets.472bc46d.js"><!-- HEAD_svelte-u9bgzb_START --><meta name="hf:doc:metadata" content="{"title":"Bagaimana jika dataset saya tidak ada di Hub?","local":"what-if-my-dataset-isnt-on-the-hub","sections":[{"title":"Bekerja dengan dataset lokal dan jarak jauh","local":"working-with-local-and-remote-datasets","sections":[],"depth":2},{"title":"Memuat dataset lokal","local":"loading-a-local-dataset","sections":[],"depth":2},{"title":"Memuat dataset jarak jauh","local":"loading-a-remote-dataset","sections":[],"depth":2}],"depth":1}"><!-- HEAD_svelte-u9bgzb_END --> <p></p> <h1 class="relative group"><a id="what-if-my-dataset-isnt-on-the-hub" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#what-if-my-dataset-isnt-on-the-hub"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Bagaimana jika dataset saya tidak ada di Hub?</span></h1> <div class="flex space-x-1 absolute z-10 right-0 top-0"><a href="https://discuss.huggingface.co/t/chapter-5-questions" target="_blank"><img alt="Ask a Question" class="!m-0" src="https://img.shields.io/badge/Ask%20a%20question-ffcb4c.svg?logo=data:image/svg+xml;base64,PHN2ZyB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciIHZpZXdCb3g9IjAgLTEgMTA0IDEwNiI+PGRlZnM+PHN0eWxlPi5jbHMtMXtmaWxsOiMyMzFmMjA7fS5jbHMtMntmaWxsOiNmZmY5YWU7fS5jbHMtM3tmaWxsOiMwMGFlZWY7fS5jbHMtNHtmaWxsOiMwMGE5NGY7fS5jbHMtNXtmaWxsOiNmMTVkMjI7fS5jbHMtNntmaWxsOiNlMzFiMjM7fTwvc3R5bGU+PC9kZWZzPjx0aXRsZT5EaXNjb3Vyc2VfbG9nbzwvdGl0bGU+PGcgaWQ9IkxheWVyXzIiPjxnIGlkPSJMYXllcl8zIj48cGF0aCBjbGFzcz0iY2xzLTEiIGQ9Ik01MS44NywwQzIzLjcxLDAsMCwyMi44MywwLDUxYzAsLjkxLDAsNTIuODEsMCw1Mi44MWw1MS44Ni0uMDVjMjguMTYsMCw1MS0yMy43MSw1MS01MS44N1M4MCwwLDUxLjg3LDBaIi8+PHBhdGggY2xhc3M9ImNscy0yIiBkPSJNNTIuMzcsMTkuNzRBMzEuNjIsMzEuNjIsMCwwLDAsMjQuNTgsNjYuNDFsLTUuNzIsMTguNEwzOS40LDgwLjE3YTMxLjYxLDMxLjYxLDAsMSwwLDEzLTYwLjQzWiIvPjxwYXRoIGNsYXNzPSJjbHMtMyIgZD0iTTc3LjQ1LDMyLjEyYTMxLjYsMzEuNiwwLDAsMS0zOC4wNSw0OEwxOC44Niw4NC44MmwyMC45MS0yLjQ3QTMxLjYsMzEuNiwwLDAsMCw3Ny40NSwzMi4xMloiLz48cGF0aCBjbGFzcz0iY2xzLTQiIGQ9Ik03MS42MywyNi4yOUEzMS42LDMxLjYsMCwwLDEsMzguOCw3OEwxOC44Niw4NC44MiwzOS40LDgwLjE3QTMxLjYsMzEuNiwwLDAsMCw3MS42MywyNi4yOVoiLz48cGF0aCBjbGFzcz0iY2xzLTUiIGQ9Ik0yNi40Nyw2Ny4xMWEzMS42MSwzMS42MSwwLDAsMSw1MS0zNUEzMS42MSwzMS42MSwwLDAsMCwyNC41OCw2Ni40MWwtNS43MiwxOC40WiIvPjxwYXRoIGNsYXNzPSJjbHMtNiIgZD0iTTI0LjU4LDY2LjQxQTMxLjYxLDMxLjYxLDAsMCwxLDcxLjYzLDI2LjI5YTMxLjYxLDMxLjYxLDAsMCwwLTQ5LDM5LjYzbC0zLjc2LDE4LjlaIi8+PC9nPjwvZz48L3N2Zz4="></a> <a href="https://colab.research.google.com/github/huggingface/notebooks/blob/master/course/en/chapter5/section2.ipynb" target="_blank"><img alt="Open In Colab" class="!m-0" src="https://colab.research.google.com/assets/colab-badge.svg"></a> <a href="https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/master/course/en/chapter5/section2.ipynb" target="_blank"><img alt="Open In Studio Lab" class="!m-0" src="https://studiolab.sagemaker.aws/studiolab.svg"></a></div> <p data-svelte-h="svelte-1hj3pmj">Kamu sudah tahu cara menggunakan <a href="https://huggingface.co/datasets" rel="nofollow">Hugging Face Hub</a> untuk mengunduh dataset, tetapi sering kali kamu akan bekerja dengan data yang disimpan di laptop kamu atau di server jarak jauh. Dalam bagian ini, kami akan menunjukkan bagaimana 🤗 Datasets dapat digunakan untuk memuat dataset yang tidak tersedia di Hugging Face Hub.</p> <iframe class="w-full xl:w-4/6 h-80" src="https://www.youtube-nocookie.com/embed/HyQgpJTkRdE" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> <h2 class="relative group"><a id="working-with-local-and-remote-datasets" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#working-with-local-and-remote-datasets"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Bekerja dengan dataset lokal dan jarak jauh</span></h2> <p data-svelte-h="svelte-1bi69je">🤗 Datasets menyediakan skrip pemuatan untuk menangani pemuatan dataset lokal dan jarak jauh. Ini mendukung beberapa format data umum, seperti:</p> <table data-svelte-h="svelte-1auq8p8"><thead><tr><th align="center">Format Data</th> <th align="center">Skrip Pemuatan</th> <th align="center">Contoh</th></tr></thead> <tbody><tr><td align="center">CSV & TSV</td> <td align="center"><code>csv</code></td> <td align="center"><code>load_dataset("csv", data_files="my_file.csv")</code></td></tr> <tr><td align="center">File Teks</td> <td align="center"><code>text</code></td> <td align="center"><code>load_dataset("text", data_files="my_file.txt")</code></td></tr> <tr><td align="center">JSON & JSON Lines</td> <td align="center"><code>json</code></td> <td align="center"><code>load_dataset("json", data_files="my_file.jsonl")</code></td></tr> <tr><td align="center">Pickled DataFrames</td> <td align="center"><code>pandas</code></td> <td align="center"><code>load_dataset("pandas", data_files="my_dataframe.pkl")</code></td></tr></tbody></table> <p data-svelte-h="svelte-1pnjwf4">Seperti yang ditunjukkan dalam tabel, untuk setiap format data kita hanya perlu menentukan jenis skrip pemuatan di fungsi <code>load_dataset()</code>, bersama dengan argumen <code>data_files</code> yang menentukan jalur ke satu atau lebih file. Mari kita mulai dengan memuat dataset dari file lokal; nanti kita akan melihat cara melakukan hal yang sama dengan file jarak jauh.</p> <h2 class="relative group"><a id="loading-a-local-dataset" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#loading-a-local-dataset"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Memuat dataset lokal</span></h2> <p data-svelte-h="svelte-sqilue">Untuk contoh ini, kita akan menggunakan <a href="https://github.com/crux82/squad-it/" rel="nofollow">dataset SQuAD-it</a>, yang merupakan dataset skala besar untuk tugas tanya-jawab dalam bahasa Italia.</p> <p data-svelte-h="svelte-1fkqfks">Bagian pelatihan dan pengujian di-host di GitHub, jadi kita bisa mengunduhnya dengan perintah <code>wget</code> sederhana:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->!wget https://github.com/crux82/squad-it/raw/master/SQuAD_it-train.json.gz | |
| !wget https://github.com/crux82/squad-it/raw/master/SQuAD_it-test.json.gz<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1a1fjqm">Ini akan mengunduh dua file terkompresi bernama <em>SQuAD_it-train.json.gz</em> dan <em>SQuAD_it-test.json.gz</em>, yang bisa kita dekompresi dengan perintah <code>gzip</code> di Linux:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->!gzip -dkv SQuAD_it-*.json.gz<!-- HTML_TAG_END --></pre></div> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->SQuAD_it-test.json.gz: 87.4% -- diganti dengan SQuAD_it-test.json | |
| SQuAD_it-train.json.gz: 82.2% -- diganti dengan SQuAD_it-train.json<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-14vm9hh">Kita dapat melihat bahwa file terkompresi telah diganti dengan <em>SQuAD_it-train.json</em> dan <em>SQuAD_it-test.json</em>, dan datanya disimpan dalam format JSON.</p> <div class="course-tip bg-gradient-to-br dark:bg-gradient-to-r before:border-green-500 dark:before:border-green-800 from-green-50 dark:from-gray-900 to-white dark:to-gray-950 border border-green-50 text-green-700 dark:text-gray-400"><p data-svelte-h="svelte-spzb0m">✎ Jika kamu bertanya-tanya kenapa ada karakter <code>!</code> di perintah shell di atas, itu karena kita menjalankannya di dalam Jupyter notebook. Cukup hapus awalan tersebut jika kamu ingin mengunduh dan mengekstrak dataset dari terminal biasa.</p></div> <p data-svelte-h="svelte-1y398gf">Untuk memuat file JSON dengan fungsi <code>load_dataset()</code>, kita hanya perlu tahu apakah kita berurusan dengan JSON biasa (mirip dengan kamus bertingkat) atau JSON Lines (baris per baris JSON). Seperti banyak dataset tanya-jawab, SQuAD-it menggunakan format bertingkat, dengan semua teks disimpan di field <code>data</code>. Ini berarti kita dapat memuat dataset dengan menentukan argumen <code>field</code> seperti berikut:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">from</span> datasets <span class="hljs-keyword">import</span> load_dataset | |
| squad_it_dataset = load_dataset(<span class="hljs-string">"json"</span>, data_files=<span class="hljs-string">"SQuAD_it-train.json"</span>, field=<span class="hljs-string">"data"</span>)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-111zd44">Secara default, memuat file lokal akan menghasilkan objek <code>DatasetDict</code> dengan split <code>train</code>. Kita dapat melihat ini dengan memeriksa objek <code>squad_it_dataset</code>:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->squad_it_dataset<!-- HTML_TAG_END --></pre></div> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->DatasetDict({ | |
| train: Dataset({ | |
| features: [<span class="hljs-string">'title'</span>, <span class="hljs-string">'paragraphs'</span>], | |
| num_rows: <span class="hljs-number">442</span> | |
| }) | |
| })<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-esfsni">Ini menunjukkan jumlah baris dan nama kolom yang terkait dengan set pelatihan. Kita bisa melihat salah satu contoh dengan mengakses indeks dari split <code>train</code> seperti ini:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->squad_it_dataset[<span class="hljs-string">"train"</span>][<span class="hljs-number">0</span>]<!-- HTML_TAG_END --></pre></div> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->{ | |
| <span class="hljs-string">"title"</span>: <span class="hljs-string">"Terremoto del Sichuan del 2008"</span>, | |
| <span class="hljs-string">"paragraphs"</span>: [ | |
| { | |
| <span class="hljs-string">"context"</span>: <span class="hljs-string">"Il terremoto del Sichuan del 2008 o il terremoto..."</span>, | |
| <span class="hljs-string">"qas"</span>: [ | |
| { | |
| <span class="hljs-string">"answers"</span>: [{<span class="hljs-string">"answer_start"</span>: <span class="hljs-number">29</span>, <span class="hljs-string">"text"</span>: <span class="hljs-string">"2008"</span>}], | |
| <span class="hljs-string">"id"</span>: <span class="hljs-string">"56cdca7862d2951400fa6826"</span>, | |
| <span class="hljs-string">"question"</span>: <span class="hljs-string">"In quale anno si è verificato il terremoto nel Sichuan?"</span>, | |
| }, | |
| ... | |
| ], | |
| }, | |
| ... | |
| ], | |
| }<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-7rcn3r">Mantap, kita telah berhasil memuat dataset lokal pertama kita! Tapi walaupun ini bekerja untuk set pelatihan, yang sebenarnya kita inginkan adalah memasukkan split <code>train</code> dan <code>test</code> dalam satu objek <code>DatasetDict</code> sehingga kita bisa menerapkan fungsi <code>Dataset.map()</code> ke keduanya sekaligus. Untuk melakukan ini, kita bisa memberikan dictionary ke argumen <code>data_files</code> yang memetakan nama setiap split ke file yang sesuai:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->data_files = {<span class="hljs-string">"train"</span>: <span class="hljs-string">"SQuAD_it-train.json"</span>, <span class="hljs-string">"test"</span>: <span class="hljs-string">"SQuAD_it-test.json"</span>} | |
| squad_it_dataset = load_dataset(<span class="hljs-string">"json"</span>, data_files=data_files, field=<span class="hljs-string">"data"</span>) | |
| squad_it_dataset<!-- HTML_TAG_END --></pre></div> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->DatasetDict({ | |
| train: Dataset({ | |
| features: [<span class="hljs-string">'title'</span>, <span class="hljs-string">'paragraphs'</span>], | |
| num_rows: <span class="hljs-number">442</span> | |
| }) | |
| test: Dataset({ | |
| features: [<span class="hljs-string">'title'</span>, <span class="hljs-string">'paragraphs'</span>], | |
| num_rows: <span class="hljs-number">48</span> | |
| }) | |
| })<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-cbcjyq">Ini adalah hasil yang kita inginkan. Sekarang, kita bisa menerapkan berbagai teknik praproses untuk membersihkan data, melakukan tokenisasi, dan sebagainya.</p> <div class="course-tip bg-gradient-to-br dark:bg-gradient-to-r before:border-green-500 dark:before:border-green-800 from-green-50 dark:from-gray-900 to-white dark:to-gray-950 border border-green-50 text-green-700 dark:text-gray-400"><p data-svelte-h="svelte-wznez">Argumen <code>data_files</code> dari fungsi <code>load_dataset()</code> cukup fleksibel dan bisa berupa path file tunggal, daftar path file, atau dictionary yang memetakan nama split ke path file. Kamu juga bisa menggunakan glob untuk mencocokkan pola tertentu sesuai aturan shell Unix (misalnya, kamu bisa memuat semua file JSON dalam direktori sebagai satu split dengan <code>data_files="*.json"</code>). Lihat dokumentasi 🤗 Datasets <a href="https://huggingface.co/docs/datasets/loading#local-and-remote-files" rel="nofollow">di sini</a> untuk info lebih lanjut.</p></div> <p data-svelte-h="svelte-1xni715">Skrip pemuatan di 🤗 Datasets sebenarnya mendukung dekompresi otomatis dari file input, jadi kita bisa saja melewati langkah penggunaan <code>gzip</code> dan langsung mengarah ke file terkompresi di argumen <code>data_files</code>:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->data_files = {<span class="hljs-string">"train"</span>: <span class="hljs-string">"SQuAD_it-train.json.gz"</span>, <span class="hljs-string">"test"</span>: <span class="hljs-string">"SQuAD_it-test.json.gz"</span>} | |
| squad_it_dataset = load_dataset(<span class="hljs-string">"json"</span>, data_files=data_files, field=<span class="hljs-string">"data"</span>)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-qjx6yw">Ini sangat berguna jika kamu tidak ingin mengekstrak file GZIP secara manual. Dekompressi otomatis ini juga berlaku untuk format umum lainnya seperti ZIP dan TAR, jadi kamu cukup arahkan <code>data_files</code> ke file terkompresi dan langsung bisa digunakan!</p> <p data-svelte-h="svelte-1k42t2f">Sekarang setelah kamu tahu cara memuat file lokal di laptop atau desktop, mari kita lihat cara memuat file dari server jarak jauh.</p> <h2 class="relative group"><a id="loading-a-remote-dataset" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#loading-a-remote-dataset"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Memuat dataset jarak jauh</span></h2> <p data-svelte-h="svelte-1dp71vw">Jika kamu bekerja sebagai data scientist atau programmer di perusahaan, ada kemungkinan besar bahwa dataset yang ingin kamu analisis disimpan di server jarak jauh. Untungnya, memuat file jarak jauh semudah memuat file lokal! Alih-alih memberikan path ke file lokal, kita arahkan argumen <code>data_files</code> dari <code>load_dataset()</code> ke satu atau lebih URL tempat file disimpan. Sebagai contoh, untuk dataset SQuAD-it yang di-host di GitHub, kita cukup arahkan <code>data_files</code> ke URL <em>SQuAD_it-*.json.gz</em> seperti ini:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->url = <span class="hljs-string">"https://github.com/crux82/squad-it/raw/master/"</span> | |
| data_files = { | |
| <span class="hljs-string">"train"</span>: url + <span class="hljs-string">"SQuAD_it-train.json.gz"</span>, | |
| <span class="hljs-string">"test"</span>: url + <span class="hljs-string">"SQuAD_it-test.json.gz"</span>, | |
| } | |
| squad_it_dataset = load_dataset(<span class="hljs-string">"json"</span>, data_files=data_files, field=<span class="hljs-string">"data"</span>)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1r6rbq9">Ini akan menghasilkan objek <code>DatasetDict</code> yang sama seperti sebelumnya, tetapi menghemat langkah pengunduhan dan dekompresi file <em>SQuAD_it-*.json.gz</em> secara manual. Ini mengakhiri pembahasan kita tentang berbagai cara memuat dataset yang tidak di-host di Hugging Face Hub. Sekarang setelah kita punya dataset untuk dimainkan, mari kita mulai mengolah data dengan berbagai teknik data wrangling!</p> <div class="course-tip bg-gradient-to-br dark:bg-gradient-to-r before:border-green-500 dark:before:border-green-800 from-green-50 dark:from-gray-900 to-white dark:to-gray-950 border border-green-50 text-green-700 dark:text-gray-400"><p data-svelte-h="svelte-1o5i5nu">✏️ <strong>Coba sendiri!</strong> Pilih dataset lain yang di-host di GitHub atau di <a href="https://archive.ics.uci.edu/ml/index.php" rel="nofollow">UCI Machine Learning Repository</a> dan coba muat baik secara lokal maupun jarak jauh menggunakan teknik yang telah dijelaskan di atas. Untuk tantangan tambahan, coba muat dataset yang disimpan dalam format CSV atau teks (lihat <a href="https://huggingface.co/docs/datasets/loading#local-and-remote-files" rel="nofollow">dokumentasi</a> untuk info lebih lanjut tentang format-format ini).</p></div> <a class="!text-gray-400 !no-underline text-sm flex items-center not-prose mt-4" href="https://github.com/huggingface/course/blob/main/chapters/id/chapter5/2.mdx" target="_blank"><span data-svelte-h="svelte-1kd6by1"><</span> <span data-svelte-h="svelte-x0xyl0">></span> <span data-svelte-h="svelte-1dajgef"><span class="underline ml-1.5">Update</span> on GitHub</span></a> <p></p> | |
| <script> | |
| { | |
| __sveltekit_ojy514 = { | |
| assets: "/docs/course/pr_1054/id", | |
| base: "/docs/course/pr_1054/id", | |
| env: {} | |
| }; | |
| const element = document.currentScript.parentElement; | |
| const data = [null,null]; | |
| Promise.all([ | |
| import("/docs/course/pr_1054/id/_app/immutable/entry/start.4f92af03.js"), | |
| import("/docs/course/pr_1054/id/_app/immutable/entry/app.19cef1b6.js") | |
| ]).then(([kit, app]) => { | |
| kit.start(app, element, { | |
| node_ids: [0, 37], | |
| data, | |
| form: null, | |
| error: null | |
| }); | |
| }); | |
| } | |
| </script> | |
Xet Storage Details
- Size:
- 38.2 kB
- Xet hash:
- 08c570edaef7364827c92ed80a59864460b776619c2c268cc174a9c3f5707dad
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.