Buckets:
| <meta charset="utf-8" /><meta name="hf:doc:metadata" content="{"title":"Saatnya memotong dan memilah data","local":"time-to-slice-and-dice","sections":[{"title":"Memotong dan memilah data","local":"slicing-and-dicing-our-data","sections":[],"depth":2},{"title":"Membuat kolom baru","local":"creating-new-columns","sections":[],"depth":2},{"title":"Kekuatan super dari metode map()","local":"the-map-methods-superpowers","sections":[],"depth":2},{"title":"Dari Dataset ke DataFrame dan sebaliknya","local":"from-datasets-to-dataframes-and-back","sections":[],"depth":2},{"title":"Membuat validation set","local":"creating-a-validation-set","sections":[],"depth":2},{"title":"Menyimpan dataset","local":"saving-a-dataset","sections":[],"depth":2}],"depth":1}"> | |
| <link href="/docs/course/pr_1054/id/_app/immutable/assets/0.e3b0c442.css" rel="modulepreload"> | |
| <link rel="modulepreload" href="/docs/course/pr_1054/id/_app/immutable/entry/start.4f92af03.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1054/id/_app/immutable/chunks/scheduler.36a0863c.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1054/id/_app/immutable/chunks/singletons.7dc7b9a4.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1054/id/_app/immutable/chunks/index.733708bb.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1054/id/_app/immutable/chunks/paths.cf097d06.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1054/id/_app/immutable/entry/app.19cef1b6.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1054/id/_app/immutable/chunks/index.156fee99.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1054/id/_app/immutable/nodes/0.1203e4a0.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1054/id/_app/immutable/chunks/each.e59479a4.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1054/id/_app/immutable/nodes/38.4d17df8d.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1054/id/_app/immutable/chunks/Tip.8a648467.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1054/id/_app/immutable/chunks/Youtube.a5d6d567.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1054/id/_app/immutable/chunks/CodeBlock.4cf998e6.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1054/id/_app/immutable/chunks/CourseFloatingBanner.16bb8bff.js"> | |
| <link rel="modulepreload" href="/docs/course/pr_1054/id/_app/immutable/chunks/getInferenceSnippets.472bc46d.js"><!-- HEAD_svelte-u9bgzb_START --><meta name="hf:doc:metadata" content="{"title":"Saatnya memotong dan memilah data","local":"time-to-slice-and-dice","sections":[{"title":"Memotong dan memilah data","local":"slicing-and-dicing-our-data","sections":[],"depth":2},{"title":"Membuat kolom baru","local":"creating-new-columns","sections":[],"depth":2},{"title":"Kekuatan super dari metode map()","local":"the-map-methods-superpowers","sections":[],"depth":2},{"title":"Dari Dataset ke DataFrame dan sebaliknya","local":"from-datasets-to-dataframes-and-back","sections":[],"depth":2},{"title":"Membuat validation set","local":"creating-a-validation-set","sections":[],"depth":2},{"title":"Menyimpan dataset","local":"saving-a-dataset","sections":[],"depth":2}],"depth":1}"><!-- HEAD_svelte-u9bgzb_END --> <p></p> <h1 class="relative group"><a id="time-to-slice-and-dice" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#time-to-slice-and-dice"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Saatnya memotong dan memilah data</span></h1> <div class="flex space-x-1 absolute z-10 right-0 top-0"><a href="https://discuss.huggingface.co/t/chapter-5-questions" target="_blank"><img alt="Ask a Question" class="!m-0" src="https://img.shields.io/badge/Ask%20a%20question-ffcb4c.svg?logo=data:image/svg+xml;base64,PHN2ZyB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciIHZpZXdCb3g9IjAgLTEgMTA0IDEwNiI+PGRlZnM+PHN0eWxlPi5jbHMtMXtmaWxsOiMyMzFmMjA7fS5jbHMtMntmaWxsOiNmZmY5YWU7fS5jbHMtM3tmaWxsOiMwMGFlZWY7fS5jbHMtNHtmaWxsOiMwMGE5NGY7fS5jbHMtNXtmaWxsOiNmMTVkMjI7fS5jbHMtNntmaWxsOiNlMzFiMjM7fTwvc3R5bGU+PC9kZWZzPjx0aXRsZT5EaXNjb3Vyc2VfbG9nbzwvdGl0bGU+PGcgaWQ9IkxheWVyXzIiPjxnIGlkPSJMYXllcl8zIj48cGF0aCBjbGFzcz0iY2xzLTEiIGQ9Ik01MS44NywwQzIzLjcxLDAsMCwyMi44MywwLDUxYzAsLjkxLDAsNTIuODEsMCw1Mi44MWw1MS44Ni0uMDVjMjguMTYsMCw1MS0yMy43MSw1MS01MS44N1M4MCwwLDUxLjg3LDBaIi8+PHBhdGggY2xhc3M9ImNscy0yIiBkPSJNNTIuMzcsMTkuNzRBMzEuNjIsMzEuNjIsMCwwLDAsMjQuNTgsNjYuNDFsLTUuNzIsMTguNEwzOS40LDgwLjE3YTMxLjYxLDMxLjYxLDAsMSwwLDEzLTYwLjQzWiIvPjxwYXRoIGNsYXNzPSJjbHMtMyIgZD0iTTc3LjQ1LDMyLjEyYTMxLjYsMzEuNiwwLDAsMS0zOC4wNSw0OEwxOC44Niw4NC44MmwyMC45MS0yLjQ3QTMxLjYsMzEuNiwwLDAsMCw3Ny40NSwzMi4xMloiLz48cGF0aCBjbGFzcz0iY2xzLTQiIGQ9Ik03MS42MywyNi4yOUEzMS42LDMxLjYsMCwwLDEsMzguOCw3OEwxOC44Niw4NC44MiwzOS40LDgwLjE3QTMxLjYsMzEuNiwwLDAsMCw3MS42MywyNi4yOVoiLz48cGF0aCBjbGFzcz0iY2xzLTUiIGQ9Ik0yNi40Nyw2Ny4xMWEzMS42MSwzMS42MSwwLDAsMSw1MS0zNUEzMS42MSwzMS42MSwwLDAsMCwyNC41OCw2Ni40MWwtNS43MiwxOC40WiIvPjxwYXRoIGNsYXNzPSJjbHMtNiIgZD0iTTI0LjU4LDY2LjQxQTMxLjYxLDMxLjYxLDAsMCwxLDcxLjYzLDI2LjI5YTMxLjYxLDMxLjYxLDAsMCwwLTQ5LDM5LjYzbC0zLjc2LDE4LjlaIi8+PC9nPjwvZz48L3N2Zz4="></a> <a href="https://colab.research.google.com/github/huggingface/notebooks/blob/master/course/en/chapter5/section3.ipynb" target="_blank"><img alt="Open In Colab" class="!m-0" src="https://colab.research.google.com/assets/colab-badge.svg"></a> <a href="https://studiolab.sagemaker.aws/import/github/huggingface/notebooks/blob/master/course/en/chapter5/section3.ipynb" target="_blank"><img alt="Open In Studio Lab" class="!m-0" src="https://studiolab.sagemaker.aws/studiolab.svg"></a></div> <p data-svelte-h="svelte-d3pqtk">Sebagian besar waktu, data yang kamu kerjakan tidak akan langsung siap untuk digunakan dalam pelatihan model. Di bagian ini kita akan menjelajahi berbagai fitur yang disediakan oleh 🤗 Datasets untuk membersihkan dataset-mu.</p> <iframe class="w-full xl:w-4/6 h-80" src="https://www.youtube-nocookie.com/embed/tqfSFcPMgOI" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> <h2 class="relative group"><a id="slicing-and-dicing-our-data" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#slicing-and-dicing-our-data"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Memotong dan memilah data</span></h2> <p data-svelte-h="svelte-k5ge8a">Mirip dengan Pandas, 🤗 Datasets menyediakan beberapa fungsi untuk memanipulasi isi objek <code>Dataset</code> dan <code>DatasetDict</code>. Kita sudah pernah bertemu dengan metode <code>Dataset.map()</code> di <a href="/course/chapter3">Bab 3</a>, dan di bagian ini kita akan menjelajahi beberapa fungsi lainnya yang tersedia.</p> <p data-svelte-h="svelte-wm7h0y">Untuk contoh ini kita akan menggunakan <a href="https://archive.ics.uci.edu/ml/datasets/Drug+Review+Dataset+%28Drugs.com%29" rel="nofollow">Drug Review Dataset</a> yang dihosting di <a href="https://archive.ics.uci.edu/ml/index.php" rel="nofollow">UC Irvine Machine Learning Repository</a>, yang berisi ulasan pasien tentang berbagai obat, kondisi yang diobati, dan penilaian kepuasan dari 1 sampai 10 bintang.</p> <p data-svelte-h="svelte-e47dxt">Pertama kita perlu mengunduh dan mengekstrak data, yang bisa dilakukan dengan perintah <code>wget</code> dan <code>unzip</code>:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->!wget <span class="hljs-string">"https://archive.ics.uci.edu/ml/machine-learning-databases/00462/drugsCom_raw.zip"</span> | |
| !unzip drugsCom_raw.<span class="hljs-built_in">zip</span><!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-13y73fd">Karena TSV adalah varian dari CSV yang menggunakan tab sebagai pemisah, kita dapat memuat file ini menggunakan skrip pemuatan <code>csv</code> dan menetapkan argumen <code>delimiter</code> dalam fungsi <code>load_dataset()</code> seperti berikut:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">from</span> datasets <span class="hljs-keyword">import</span> load_dataset | |
| data_files = {<span class="hljs-string">"train"</span>: <span class="hljs-string">"drugsComTrain_raw.tsv"</span>, <span class="hljs-string">"test"</span>: <span class="hljs-string">"drugsComTest_raw.tsv"</span>} | |
| <span class="hljs-comment"># \t adalah karakter tab di Python</span> | |
| drug_dataset = load_dataset(<span class="hljs-string">"csv"</span>, data_files=data_files, delimiter=<span class="hljs-string">"\t"</span>)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-116dxnq">Praktik yang baik saat melakukan analisis data adalah mengambil sampel acak kecil untuk memahami jenis data yang sedang kamu kerjakan. Di 🤗 Datasets, kita bisa membuat sampel acak dengan menggabungkan fungsi <code>Dataset.shuffle()</code> dan <code>Dataset.select()</code>:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->drug_sample = drug_dataset[<span class="hljs-string">"train"</span>].shuffle(seed=<span class="hljs-number">42</span>).select(<span class="hljs-built_in">range</span>(<span class="hljs-number">1000</span>)) | |
| <span class="hljs-comment"># Lihat beberapa contoh pertama</span> | |
| drug_sample[:<span class="hljs-number">3</span>]<!-- HTML_TAG_END --></pre></div> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->{<span class="hljs-string">'Unnamed: 0'</span>: [<span class="hljs-number">87571</span>, <span class="hljs-number">178045</span>, <span class="hljs-number">80482</span>], | |
| <span class="hljs-string">'drugName'</span>: [<span class="hljs-string">'Naproxen'</span>, <span class="hljs-string">'Duloxetine'</span>, <span class="hljs-string">'Mobic'</span>], | |
| <span class="hljs-string">'condition'</span>: [<span class="hljs-string">'Gout, Acute'</span>, <span class="hljs-string">'ibromyalgia'</span>, <span class="hljs-string">'Inflammatory Conditions'</span>], | |
| <span class="hljs-string">'review'</span>: [<span class="hljs-string">'"like the previous person mention, I&#039;m a strong believer of aleve, it works faster for my gout than the prescription meds I take. No more going to the doctor for refills.....Aleve works!"'</span>, | |
| <span class="hljs-string">'"I have taken Cymbalta for about a year and a half for fibromyalgia pain. It is great\r\nas a pain reducer and an anti-depressant, however, the side effects outweighed \r\nany benefit I got from it. I had trouble with restlessness, being tired constantly,\r\ndizziness, dry mouth, numbness and tingling in my feet, and horrible sweating. I am\r\nbeing weaned off of it now. Went from 60 mg to 30mg and now to 15 mg. I will be\r\noff completely in about a week. The fibro pain is coming back, but I would rather deal with it than the side effects."'</span>, | |
| <span class="hljs-string">'"I have been taking Mobic for over a year with no side effects other than an elevated blood pressure. I had severe knee and ankle pain which completely went away after taking Mobic. I attempted to stop the medication however pain returned after a few days."'</span>], | |
| <span class="hljs-string">'rating'</span>: [<span class="hljs-number">9.0</span>, <span class="hljs-number">3.0</span>, <span class="hljs-number">10.0</span>], | |
| <span class="hljs-string">'date'</span>: [<span class="hljs-string">'September 2, 2015'</span>, <span class="hljs-string">'November 7, 2011'</span>, <span class="hljs-string">'June 5, 2013'</span>], | |
| <span class="hljs-string">'usefulCount'</span>: [<span class="hljs-number">36</span>, <span class="hljs-number">13</span>, <span class="hljs-number">128</span>]}<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-njm8r6">Perhatikan bahwa kita menetapkan seed pada <code>Dataset.shuffle()</code> untuk keperluan replikasi. <code>Dataset.select()</code> mengharapkan iterable berupa indeks, jadi kita gunakan <code>range(1000)</code> untuk mengambil 1.000 contoh pertama dari dataset yang sudah diacak. Dari sampel ini kita sudah bisa melihat beberapa hal menarik:</p> <ul data-svelte-h="svelte-1jhw97h"><li>Kolom <code>Unnamed: 0</code> terlihat seperti ID anonim untuk tiap pasien.</li> <li>Kolom <code>condition</code> mencampur label huruf besar dan kecil.</li> <li>Ulasan memiliki panjang yang bervariasi dan mengandung pemisah baris Python (<code>\r\n</code>) serta karakter HTML seperti <code>&\#039;</code>.</li></ul> <p data-svelte-h="svelte-1ocdf4r">Mari kita lihat bagaimana cara menggunakan 🤗 Datasets untuk mengatasi masing-masing masalah ini. Untuk menguji hipotesis bahwa kolom <code>Unnamed: 0</code> adalah ID pasien, kita bisa menggunakan fungsi <code>Dataset.unique()</code> untuk memverifikasi apakah jumlah ID cocok dengan jumlah baris di setiap split:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">for</span> split <span class="hljs-keyword">in</span> drug_dataset.keys(): | |
| <span class="hljs-keyword">assert</span> <span class="hljs-built_in">len</span>(drug_dataset[split]) == <span class="hljs-built_in">len</span>(drug_dataset[split].unique(<span class="hljs-string">"Unnamed: 0"</span>))<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-weab9w">Hipotesis ini tampaknya benar, jadi mari kita rapikan dataset sedikit dengan mengganti nama kolom <code>Unnamed: 0</code> menjadi sesuatu yang lebih bermakna. Kita bisa menggunakan fungsi <code>DatasetDict.rename_column()</code> untuk mengganti nama kolom di kedua split sekaligus:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->drug_dataset = drug_dataset.rename_column( | |
| original_column_name=<span class="hljs-string">"Unnamed: 0"</span>, new_column_name=<span class="hljs-string">"patient_id"</span> | |
| ) | |
| drug_dataset<!-- HTML_TAG_END --></pre></div> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->DatasetDict({ | |
| train: Dataset({ | |
| features: [<span class="hljs-string">'patient_id'</span>, <span class="hljs-string">'drugName'</span>, <span class="hljs-string">'condition'</span>, <span class="hljs-string">'review'</span>, <span class="hljs-string">'rating'</span>, <span class="hljs-string">'date'</span>, <span class="hljs-string">'usefulCount'</span>], | |
| num_rows: <span class="hljs-number">161297</span> | |
| }) | |
| test: Dataset({ | |
| features: [<span class="hljs-string">'patient_id'</span>, <span class="hljs-string">'drugName'</span>, <span class="hljs-string">'condition'</span>, <span class="hljs-string">'review'</span>, <span class="hljs-string">'rating'</span>, <span class="hljs-string">'date'</span>, <span class="hljs-string">'usefulCount'</span>], | |
| num_rows: <span class="hljs-number">53766</span> | |
| }) | |
| })<!-- HTML_TAG_END --></pre></div> <div class="course-tip bg-gradient-to-br dark:bg-gradient-to-r before:border-green-500 dark:before:border-green-800 from-green-50 dark:from-gray-900 to-white dark:to-gray-950 border border-green-50 text-green-700 dark:text-gray-400"><p data-svelte-h="svelte-w28ap4">✏️ <strong>Coba sendiri!</strong> Gunakan fungsi <code>Dataset.unique()</code> untuk menemukan jumlah obat dan kondisi unik di data pelatihan dan pengujian.</p></div> <p data-svelte-h="svelte-1oer9u4">Selanjutnya, mari kita normalisasi semua label di kolom <code>condition</code> menggunakan <code>Dataset.map()</code>. Seperti saat kita melakukan tokenisasi di <a href="/course/chapter3">Bab 3</a>, kita dapat mendefinisikan fungsi sederhana yang diterapkan ke semua baris dalam setiap split:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">def</span> <span class="hljs-title function_">lowercase_condition</span>(<span class="hljs-params">example</span>): | |
| <span class="hljs-keyword">return</span> {<span class="hljs-string">"condition"</span>: example[<span class="hljs-string">"condition"</span>].lower()} | |
| drug_dataset.<span class="hljs-built_in">map</span>(lowercase_condition)<!-- HTML_TAG_END --></pre></div> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->AttributeError: <span class="hljs-string">'NoneType'</span> <span class="hljs-built_in">object</span> has no attribute <span class="hljs-string">'lower'</span><!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-rwkkrv">Oh tidak, kita menemui masalah! Dari error tersebut kita tahu bahwa beberapa entri di kolom <code>condition</code> bernilai <code>None</code>, yang tidak bisa diubah menjadi huruf kecil. Mari kita hilangkan baris-baris ini menggunakan <code>Dataset.filter()</code>, yang bekerja mirip seperti <code>map()</code> dan mengharapkan fungsi yang menerima satu contoh dataset. Daripada menulis fungsi eksplisit seperti:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">def</span> <span class="hljs-title function_">filter_nones</span>(<span class="hljs-params">x</span>): | |
| <span class="hljs-keyword">return</span> x[<span class="hljs-string">"condition"</span>] <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> <span class="hljs-literal">None</span><!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-90o9jf">Alih-alih menjalankan <code>drug_dataset.filter(filter_nones)</code>, kita bisa melakukannya dalam satu baris menggunakan <em>lambda function</em>. Dalam Python, lambda function adalah fungsi kecil yang bisa Anda definisikan tanpa harus memberi nama eksplisit. Bentuk umumnya adalah:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->lambda <span class="hljs-tag"><<span class="hljs-name">argumen</span>></span> : <span class="hljs-tag"><<span class="hljs-name">ekspresi</span>></span><!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-14khj95">di mana <code>lambda</code> adalah salah satu <a href="https://docs.python.org/3/reference/lexical_analysis.html#keywords" rel="nofollow">kata kunci khusus</a> di Python, <code><argumen></code> adalah daftar nilai yang dipisahkan koma yang mendefinisikan input untuk fungsi, dan <code><ekspresi></code> mewakili operasi yang ingin Anda jalankan. Misalnya, kita bisa mendefinisikan fungsi lambda sederhana untuk menghitung kuadrat dari suatu angka seperti ini:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->lambda <span class="hljs-keyword">x</span> : <span class="hljs-keyword">x</span> * <span class="hljs-keyword">x</span><!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-100lhe0">Untuk menerapkan fungsi ini ke suatu input, kita perlu membungkusnya bersama input dalam tanda kurung:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->(<span class="hljs-keyword">lambda</span> x: x * x)(<span class="hljs-number">3</span>)<!-- HTML_TAG_END --></pre></div> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-number">9</span><!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-hbrpdp">Demikian juga, kita bisa mendefinisikan lambda function dengan beberapa argumen dengan memisahkannya menggunakan koma. Contohnya, kita bisa menghitung luas segitiga sebagai berikut:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->(<span class="hljs-keyword">lambda</span> base, height: <span class="hljs-number">0.5</span> * base * height)(<span class="hljs-number">4</span>, <span class="hljs-number">8</span>)<!-- HTML_TAG_END --></pre></div> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-number">16.0</span><!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-5b9l4o">Lambda function sangat berguna saat Anda ingin mendefinisikan fungsi kecil yang hanya digunakan sekali. (Untuk informasi lebih lanjut, kami sarankan membaca <a href="https://realpython.com/python-lambda/" rel="nofollow">tutorial Real Python</a> yang sangat baik oleh Andre Burgaud). Dalam konteks 🤗 Datasets, kita bisa menggunakan lambda function untuk mendefinisikan operasi <code>map</code> dan <code>filter</code> sederhana. Jadi, mari kita gunakan trik ini untuk menghilangkan entri <code>None</code> dalam dataset kita:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->drug_dataset = drug_dataset.<span class="hljs-built_in">filter</span>(<span class="hljs-keyword">lambda</span> x: x[<span class="hljs-string">"condition"</span>] <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> <span class="hljs-literal">None</span>)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1v66ea0">Dengan entri <code>None</code> dihapus, kita bisa menormalisasi kolom <code>condition</code>:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->drug_dataset = drug_dataset.<span class="hljs-built_in">map</span>(lowercase_condition) | |
| <span class="hljs-comment"># Cek apakah berhasil</span> | |
| drug_dataset[<span class="hljs-string">"train"</span>][<span class="hljs-string">"condition"</span>][:<span class="hljs-number">3</span>]<!-- HTML_TAG_END --></pre></div> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->[<span class="hljs-string">'left ventricular dysfunction'</span>, <span class="hljs-string">'adhd'</span>, <span class="hljs-string">'birth control'</span>]<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-6eesrd">Berhasil! Sekarang setelah kita membersihkan labelnya, mari kita lanjutkan dengan membersihkan bagian ulasannya.</p> <h2 class="relative group"><a id="creating-new-columns" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#creating-new-columns"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Membuat kolom baru</span></h2> <p data-svelte-h="svelte-16m191x">Setiap kali Anda bekerja dengan ulasan pelanggan, praktik yang baik adalah memeriksa jumlah kata dalam setiap ulasan. Sebuah ulasan bisa saja hanya terdiri dari satu kata seperti “Bagus!” atau berupa esai panjang dengan ribuan kata, dan tergantung pada kasus penggunaannya, Anda perlu menangani kedua ekstrem ini dengan cara yang berbeda. Untuk menghitung jumlah kata dalam setiap ulasan, kita akan menggunakan pendekatan sederhana dengan memisahkan setiap teks berdasarkan spasi.</p> <p data-svelte-h="svelte-46td7s">Mari kita definisikan fungsi untuk menghitung jumlah kata:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">def</span> <span class="hljs-title function_">compute_review_length</span>(<span class="hljs-params">example</span>): | |
| <span class="hljs-keyword">return</span> {<span class="hljs-string">"review_length"</span>: <span class="hljs-built_in">len</span>(example[<span class="hljs-string">"review"</span>].split())}<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1g8kxfr">Berbeda dengan fungsi <code>lowercase_condition()</code>, fungsi <code>compute_review_length()</code> mengembalikan sebuah dictionary dengan key yang <strong>tidak</strong> sesuai dengan salah satu nama kolom dalam dataset. Dalam kasus ini, ketika <code>compute_review_length()</code> diberikan ke <code>Dataset.map()</code>, fungsi tersebut akan diterapkan ke semua baris dalam dataset untuk membuat kolom baru bernama <code>review_length</code>:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->drug_dataset = drug_dataset.<span class="hljs-built_in">map</span>(compute_review_length) | |
| <span class="hljs-comment"># Lihat contoh pertama</span> | |
| drug_dataset[<span class="hljs-string">"train"</span>][<span class="hljs-number">0</span>]<!-- HTML_TAG_END --></pre></div> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->{<span class="hljs-string">'patient_id'</span>: <span class="hljs-number">206461</span>, | |
| <span class="hljs-string">'drugName'</span>: <span class="hljs-string">'Valsartan'</span>, | |
| <span class="hljs-string">'condition'</span>: <span class="hljs-string">'left ventricular dysfunction'</span>, | |
| <span class="hljs-string">'review'</span>: <span class="hljs-string">'"It has no side effect, I take it in combination of Bystolic 5 Mg and Fish Oil"'</span>, | |
| <span class="hljs-string">'rating'</span>: <span class="hljs-number">9.0</span>, | |
| <span class="hljs-string">'date'</span>: <span class="hljs-string">'May 20, 2012'</span>, | |
| <span class="hljs-string">'usefulCount'</span>: <span class="hljs-number">27</span>, | |
| <span class="hljs-string">'review_length'</span>: <span class="hljs-number">17</span>}<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-th61o">Seperti yang diharapkan, kolom <code>review_length</code> sudah ditambahkan. Kita bisa mengurutkan kolom ini dengan <code>Dataset.sort()</code> untuk melihat nilai ekstrem:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->drug_dataset[<span class="hljs-string">"train"</span>].sort(<span class="hljs-string">"review_length"</span>)[:<span class="hljs-number">3</span>]<!-- HTML_TAG_END --></pre></div> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->{<span class="hljs-string">'patient_id'</span>: [<span class="hljs-number">103488</span>, <span class="hljs-number">23627</span>, <span class="hljs-number">20558</span>], | |
| <span class="hljs-string">'drugName'</span>: [<span class="hljs-string">'Loestrin 21 1 / 20'</span>, <span class="hljs-string">'Chlorzoxazone'</span>, <span class="hljs-string">'Nucynta'</span>], | |
| <span class="hljs-string">'condition'</span>: [<span class="hljs-string">'birth control'</span>, <span class="hljs-string">'muscle spasm'</span>, <span class="hljs-string">'pain'</span>], | |
| <span class="hljs-string">'review'</span>: [<span class="hljs-string">'"Excellent."'</span>, <span class="hljs-string">'"useless"'</span>, <span class="hljs-string">'"ok"'</span>], | |
| <span class="hljs-string">'rating'</span>: [<span class="hljs-number">10.0</span>, <span class="hljs-number">1.0</span>, <span class="hljs-number">6.0</span>], | |
| <span class="hljs-string">'date'</span>: [<span class="hljs-string">'November 4, 2008'</span>, <span class="hljs-string">'March 24, 2017'</span>, <span class="hljs-string">'August 20, 2016'</span>], | |
| <span class="hljs-string">'usefulCount'</span>: [<span class="hljs-number">5</span>, <span class="hljs-number">2</span>, <span class="hljs-number">10</span>], | |
| <span class="hljs-string">'review_length'</span>: [<span class="hljs-number">1</span>, <span class="hljs-number">1</span>, <span class="hljs-number">1</span>]}<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1tgni2t">Seperti dugaan, ada ulasan yang hanya satu kata. Ini mungkin cukup untuk analisis sentimen, tapi tidak ideal untuk memprediksi kondisi.</p> <div class="course-tip bg-gradient-to-br dark:bg-gradient-to-r before:border-green-500 dark:before:border-green-800 from-green-50 dark:from-gray-900 to-white dark:to-gray-950 border border-green-50 text-green-700 dark:text-gray-400"><p data-svelte-h="svelte-zt1nrz">🙋 Alternatif lain untuk menambahkan kolom baru ke dalam sebuah dataset adalah dengan menggunakan fungsi <code>Dataset.add_column()</code>. Fungsi ini memungkinkan Anda memberikan kolom sebagai daftar Python atau array NumPy, dan bisa sangat berguna dalam situasi di mana <code>Dataset.map()</code> kurang cocok untuk analisis Anda.</p></div> <p data-svelte-h="svelte-1qn7o6i">Mari kita gunakan fungsi <code>Dataset.filter()</code> untuk menghapus ulasan yang berisi kurang dari 30 kata. Mirip dengan yang kita lakukan pada kolom <code>condition</code>, kita bisa menyaring ulasan yang sangat pendek dengan mensyaratkan bahwa panjang ulasan harus melebihi ambang batas ini:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->drug_dataset = drug_dataset.<span class="hljs-built_in">filter</span>(<span class="hljs-keyword">lambda</span> x: x[<span class="hljs-string">"review_length"</span>] > <span class="hljs-number">30</span>) | |
| <span class="hljs-built_in">print</span>(drug_dataset.num_rows)<!-- HTML_TAG_END --></pre></div> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->{<span class="hljs-string">'train'</span>: <span class="hljs-number">138514</span>, <span class="hljs-string">'test'</span>: <span class="hljs-number">46108</span>}<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-ehn3jm">Seperti yang terlihat, ini menghapus sekitar 15% ulasan dari dataset pelatihan dan pengujian.</p> <div class="course-tip bg-gradient-to-br dark:bg-gradient-to-r before:border-green-500 dark:before:border-green-800 from-green-50 dark:from-gray-900 to-white dark:to-gray-950 border border-green-50 text-green-700 dark:text-gray-400"><p data-svelte-h="svelte-ap3402">✏️ <strong>Coba sendiri!</strong> Gunakan <code>Dataset.sort()</code> untuk melihat ulasan dengan jumlah kata terbanyak. Lihat <a href="https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.Dataset.sort" rel="nofollow">dokumentasi</a> untuk mengetahui cara mengurutkan secara menurun.</p></div> <p data-svelte-h="svelte-thy6wb">Hal terakhir yang perlu kita tangani adalah karakter HTML dalam ulasan. Kita bisa menggunakan modul <code>html</code> dari Python untuk mengubah kode HTML kembali ke karakter aslinya:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">import</span> html | |
| text = <span class="hljs-string">"I&#039;m a transformer called BERT"</span> | |
| html.unescape(text)<!-- HTML_TAG_END --></pre></div> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-string">"I'm a transformer called BERT"</span><!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-oay8tz">Kita akan menggunakan <code>Dataset.map()</code> untuk mengganti semua karakter HTML dalam seluruh korpus:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->drug_dataset = drug_dataset.<span class="hljs-built_in">map</span>(<span class="hljs-keyword">lambda</span> x: {<span class="hljs-string">"review"</span>: html.unescape(x[<span class="hljs-string">"review"</span>])})<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1vutis">Seperti yang kamu lihat, metode <code>Dataset.map()</code> sangat berguna untuk memproses data — dan kita bahkan belum menyentuh semua fitur yang bisa dilakukannya!</p> <h2 class="relative group"><a id="the-map-methods-superpowers" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#the-map-methods-superpowers"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Kekuatan super dari metode map()</span></h2> <p data-svelte-h="svelte-1jhcam9">Metode <code>Dataset.map()</code> memiliki argumen <code>batched</code> yang, jika disetel ke <code>True</code>, akan mengirim satu batch contoh ke fungsi map sekaligus (ukuran batch ini dapat diatur, default-nya 1000). Misalnya, fungsi <code>map()</code> sebelumnya yang digunakan untuk menghilangkan karakter HTML membutuhkan waktu agak lama untuk dijalankan. Kita bisa mempercepatnya dengan memproses beberapa elemen sekaligus menggunakan list comprehension.</p> <p data-svelte-h="svelte-1uesc3">Saat kamu menentukan <code>batched=True</code>, fungsi akan menerima dictionary berisi field dari dataset, tapi setiap nilai sekarang berupa <em>list dari nilai</em>, bukan satu nilai saja. Nilai kembalian dari <code>Dataset.map()</code> juga harus dalam bentuk dictionary yang memiliki field yang ingin diubah atau ditambahkan, dengan daftar nilai. Berikut contoh lain untuk menghapus karakter HTML, kali ini dengan <code>batched=True</code>:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->new_drug_dataset = drug_dataset.<span class="hljs-built_in">map</span>( | |
| <span class="hljs-keyword">lambda</span> x: {<span class="hljs-string">"review"</span>: [html.unescape(o) <span class="hljs-keyword">for</span> o <span class="hljs-keyword">in</span> x[<span class="hljs-string">"review"</span>]]}, batched=<span class="hljs-literal">True</span> | |
| )<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-2ie5wz">Jika Anda menjalankan kode ini di dalam sebuah notebook, Anda akan melihat bahwa perintah ini dieksekusi jauh lebih cepat dibandingkan perintah sebelumnya. Dan bukan karena ulasan kita sudah tidak lagi mengandung karakter HTML — jika Anda menjalankan kembali instruksi dari bagian sebelumnya (tanpa <code>batched=True</code>), waktunya akan tetap sama seperti sebelumnya. Hal ini terjadi karena <em>list comprehension</em> umumnya lebih cepat dibanding menjalankan kode yang sama dalam sebuah <em>for loop</em>, dan kita juga mendapatkan peningkatan performa dengan mengakses banyak elemen sekaligus dibandingkan satu per satu.</p> <p data-svelte-h="svelte-b1zzqg">Menggunakan <code>Dataset.map()</code> dengan <code>batched=True</code> sangat penting untuk memanfaatkan kecepatan tokenizer “fast” yang akan kita temui di <a href="/course/chapter6">Bab 6</a>, yang bisa melakukan tokenisasi cepat terhadap banyak teks. Misalnya, untuk melakukan tokenisasi semua review obat dengan tokenizer cepat, kita bisa menggunakan fungsi berikut:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">from</span> transformers <span class="hljs-keyword">import</span> AutoTokenizer | |
| tokenizer = AutoTokenizer.from_pretrained(<span class="hljs-string">"bert-base-cased"</span>) | |
| <span class="hljs-keyword">def</span> <span class="hljs-title function_">tokenize_function</span>(<span class="hljs-params">examples</span>): | |
| <span class="hljs-keyword">return</span> tokenizer(examples[<span class="hljs-string">"review"</span>], truncation=<span class="hljs-literal">True</span>)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1jf91ai">Seperti yang kamu lihat di <a href="/course/chapter3">Bab 3</a>, kita bisa memberi satu atau beberapa contoh ke tokenizer, jadi kita bisa menggunakan fungsi ini dengan atau tanpa <code>batched=True</code>. Mari manfaatkan kesempatan ini untuk membandingkan performa dari berbagai opsi. Di notebook, kamu bisa mengukur waktu eksekusi dengan menambahkan <code>%time</code> di depan baris kode:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->%time tokenized_dataset = drug_dataset.<span class="hljs-built_in">map</span>(tokenize_function, batched=<span class="hljs-literal">True</span>)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-13llebe">Kamu juga bisa mengukur waktu seluruh cell dengan <code>%%time</code> di awal cell. Di perangkat kami, perintah ini memakan waktu 10.8 detik (terlihat dari “Wall time”).</p> <div class="course-tip bg-gradient-to-br dark:bg-gradient-to-r before:border-green-500 dark:before:border-green-800 from-green-50 dark:from-gray-900 to-white dark:to-gray-950 border border-green-50 text-green-700 dark:text-gray-400"><p data-svelte-h="svelte-7ae342">✏️ <strong>Coba sendiri!</strong> Jalankan perintah yang sama dengan dan tanpa <code>batched=True</code>, lalu coba juga dengan tokenizer lambat (tambah <code>use_fast=False</code> dalam <code>AutoTokenizer.from_pretrained()</code>) untuk melihat perbandingannya di perangkat kamu.</p></div> <p data-svelte-h="svelte-u4eisq">Berikut hasil yang kami dapat:</p> <table data-svelte-h="svelte-1koka2u"><thead><tr><th>Opsi</th> <th>Tokenizer Cepat</th> <th>Tokenizer Lambat</th></tr></thead> <tbody><tr><td><code>batched=True</code></td> <td>10.8s</td> <td>4menit 41detik</td></tr> <tr><td><code>batched=False</code></td> <td>59.2s</td> <td>5menit 3detik</td></tr></tbody></table> <p data-svelte-h="svelte-jknv91">Artinya, menggunakan tokenizer cepat dengan <code>batched=True</code> 30x lebih cepat dari tokenizer lambat tanpa batching — luar biasa! Ini alasan utama mengapa tokenizer cepat adalah default saat menggunakan <code>AutoTokenizer</code>. Mereka mencapai kecepatan ini karena di balik layar, tokenisasi dilakukan dalam bahasa <strong>Rust</strong>, yang memungkinkan eksekusi paralel secara efisien.</p> <p data-svelte-h="svelte-1061ozh">Paralelisasi juga menjadi alasan mengapa tokenizer cepat bisa 6x lebih cepat saat menggunakan batching: kamu tidak bisa paralelkan satu proses tokenisasi, tapi kamu bisa memecah ribuan teks untuk diproses oleh beberapa thread/proses.</p> <p data-svelte-h="svelte-imjpwu"><code>Dataset.map()</code> juga memiliki kemampuan paralelisasi. Karena tidak menggunakan Rust, tokenizer lambat tetap tidak bisa menandingi tokenizer cepat, tapi ini tetap membantu (terutama jika kamu menggunakan tokenizer yang belum mendukung versi cepat). Untuk mengaktifkan multiprocessing, gunakan argumen <code>num_proc</code> dan tentukan jumlah proses:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->slow_tokenizer = AutoTokenizer.from_pretrained(<span class="hljs-string">"bert-base-cased"</span>, use_fast=<span class="hljs-literal">False</span>) | |
| <span class="hljs-keyword">def</span> <span class="hljs-title function_">slow_tokenize_function</span>(<span class="hljs-params">examples</span>): | |
| <span class="hljs-keyword">return</span> slow_tokenizer(examples[<span class="hljs-string">"review"</span>], truncation=<span class="hljs-literal">True</span>) | |
| tokenized_dataset = drug_dataset.<span class="hljs-built_in">map</span>(slow_tokenize_function, batched=<span class="hljs-literal">True</span>, num_proc=<span class="hljs-number">8</span>)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1vf4e3g">Anda bisa bereksperimen sedikit dengan waktu eksekusi untuk menentukan jumlah proses yang optimal; dalam kasus kami, 8 tampaknya memberikan peningkatan kecepatan terbaik. Berikut adalah hasil yang kami peroleh dengan dan tanpa multiprocessing:</p> <table data-svelte-h="svelte-2qv38o"><thead><tr><th align="center">Opsi</th> <th align="center">Tokenizer Cepat</th> <th align="center">Tokenizer Lambat</th></tr></thead> <tbody><tr><td align="center"><code>batched=True</code></td> <td align="center">10.8 detik</td> <td align="center">4 menit 41 detik</td></tr> <tr><td align="center"><code>batched=False</code></td> <td align="center">59.2 detik</td> <td align="center">5 menit 3 detik</td></tr> <tr><td align="center"><code>batched=True</code>, <code>num_proc=8</code></td> <td align="center">6.52 detik</td> <td align="center">41.3 detik</td></tr> <tr><td align="center"><code>batched=False</code>, <code>num_proc=8</code></td> <td align="center">9.49 detik</td> <td align="center">45.2 detik</td></tr></tbody></table> <p data-svelte-h="svelte-1sgr3us">Itu adalah hasil yang jauh lebih masuk akal untuk tokenizer lambat, namun performa tokenizer cepat juga meningkat secara signifikan. Namun, perlu dicatat bahwa hal tersebut tidak selalu berlaku — untuk nilai <code>num_proc</code> selain 8, pengujian kami menunjukkan bahwa lebih cepat menggunakan <code>batched=True</code> tanpa opsi tersebut. Secara umum, kami tidak menyarankan penggunaan multiprocessing Python untuk tokenizer cepat dengan <code>batched=True</code>.</p> <div class="course-tip bg-gradient-to-br dark:bg-gradient-to-r before:border-green-500 dark:before:border-green-800 from-green-50 dark:from-gray-900 to-white dark:to-gray-950 border border-green-50 text-green-700 dark:text-gray-400"><p data-svelte-h="svelte-1ypozum">Menggunakan <code>num_proc</code> untuk mempercepat proses sangat disarankan, <strong>selama fungsi yang kamu gunakan tidak sudah memproses paralel secara internal</strong>.</p></div> <p data-svelte-h="svelte-wo2b12">Semua fungsionalitas ini yang diringkas ke dalam satu metode saja sudah sangat mengagumkan, tetapi masih ada lagi! Dengan <code>Dataset.map()</code> dan <code>batched=True</code>, Anda bisa mengubah jumlah elemen dalam dataset Anda. Ini sangat berguna dalam banyak situasi, terutama ketika Anda ingin membuat beberapa fitur pelatihan dari satu contoh data, dan kita akan perlu melakukan ini sebagai bagian dari pra-pemrosesan untuk beberapa tugas NLP yang akan kita kerjakan di <a href="/course/chapter7">Bab 7</a>.</p> <div class="course-tip bg-gradient-to-br dark:bg-gradient-to-r before:border-green-500 dark:before:border-green-800 from-green-50 dark:from-gray-900 to-white dark:to-gray-950 border border-green-50 text-green-700 dark:text-gray-400"><p data-svelte-h="svelte-rux6kr">💡 Dalam pembelajaran mesin, <em>contoh (example)</em> biasanya didefinisikan sebagai sekumpulan <em>fitur</em> yang kita masukkan ke model. Dalam beberapa konteks, fitur ini adalah kolom dalam <code>Dataset</code>, tapi dalam kasus lain (seperti tanya jawab), satu contoh bisa menghasilkan beberapa fitur sekaligus.</p></div> <p data-svelte-h="svelte-14pjwdc">Mari lihat bagaimana caranya! Kita akan tokenisasi dan potong contoh kita hingga maksimal 128 token, tapi kita minta tokenizer untuk mengembalikan <em>semua potongan teks</em> bukan hanya yang pertama. Ini bisa dilakukan dengan <code>return_overflowing_tokens=True</code>:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">def</span> <span class="hljs-title function_">tokenize_and_split</span>(<span class="hljs-params">examples</span>): | |
| <span class="hljs-keyword">return</span> tokenizer( | |
| examples[<span class="hljs-string">"review"</span>], | |
| truncation=<span class="hljs-literal">True</span>, | |
| max_length=<span class="hljs-number">128</span>, | |
| return_overflowing_tokens=<span class="hljs-literal">True</span>, | |
| )<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-9bi96x">Mari kita uji ini pada satu contoh terlebih dahulu sebelum menggunakan <code>Dataset.map()</code> pada seluruh dataset:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->result = tokenize_and_split(drug_dataset[<span class="hljs-string">"train"</span>][<span class="hljs-number">0</span>]) | |
| [<span class="hljs-built_in">len</span>(inp) <span class="hljs-keyword">for</span> inp <span class="hljs-keyword">in</span> result[<span class="hljs-string">"input_ids"</span>]]<!-- HTML_TAG_END --></pre></div> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->[<span class="hljs-number">128</span>, <span class="hljs-number">49</span>]<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-eoigwv">Jadi, contoh pertama dalam set pelatihan kita menjadi dua fitur karena ditokenisasi melebihi jumlah token maksimum yang telah kita tentukan: yang pertama dengan panjang 128 dan yang kedua dengan panjang 49. Sekarang mari kita lakukan ini untuk semua elemen dalam dataset!</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->tokenized_dataset = drug_dataset.<span class="hljs-built_in">map</span>(tokenize_and_split, batched=<span class="hljs-literal">True</span>)<!-- HTML_TAG_END --></pre></div> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->ArrowInvalid: Column <span class="hljs-number">1</span> named condition expected length <span class="hljs-number">1463</span> but got length <span class="hljs-number">1000</span><!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-vi1ehl">Oh tidak! Itu tidak berhasil! Kenapa? Melihat pesan error akan memberi kita petunjuk: ada ketidaksesuaian panjang pada salah satu kolom, yang satu memiliki panjang 1.463 dan yang lainnya 1.000. Jika Anda sudah melihat <a href="https://huggingface.co/docs/datasets/package_reference/main_classes#datasets.Dataset.map" rel="nofollow">dokumentasi</a> <code>Dataset.map()</code>, Anda mungkin ingat bahwa yang diberikan ke fungsi yang kita mapping adalah jumlah sampel; di sini, 1.000 contoh menghasilkan 1.463 fitur baru, yang menyebabkan error pada bentuk (shape).</p> <p data-svelte-h="svelte-161ndjy">Masalahnya adalah kita mencoba menggabungkan dua dataset dengan ukuran berbeda: kolom-kolom dari <code>drug_dataset</code> memiliki jumlah contoh tertentu (yaitu 1.000 seperti dalam pesan error), tetapi <code>tokenized_dataset</code> yang sedang kita buat memiliki lebih banyak (1.463 dalam pesan error; lebih dari 1.000 karena kita melakukan tokenisasi ulasan panjang menjadi lebih dari satu contoh dengan menggunakan <code>return_overflowing_tokens=True</code>). Ini tidak bisa dilakukan pada sebuah <code>Dataset</code>, jadi kita harus menghapus kolom-kolom dari dataset lama atau membuat ukurannya sama seperti pada dataset baru. Kita bisa melakukan yang pertama dengan argumen <code>remove_columns</code>:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->tokenized_dataset = drug_dataset.<span class="hljs-built_in">map</span>( | |
| tokenize_and_split, batched=<span class="hljs-literal">True</span>, remove_columns=drug_dataset[<span class="hljs-string">"train"</span>].column_names | |
| )<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1mar61l">Sekarang berhasil. Kita bisa cek bahwa dataset baru memiliki jumlah elemen lebih banyak dari sebelumnya:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-built_in">len</span>(tokenized_dataset[<span class="hljs-string">"train"</span>]), <span class="hljs-built_in">len</span>(drug_dataset[<span class="hljs-string">"train"</span>])<!-- HTML_TAG_END --></pre></div> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->(<span class="hljs-number">206772</span>, <span class="hljs-number">138514</span>)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1e9x2cg">Kita sempat menyebutkan bahwa kita juga bisa menangani masalah panjang yang tidak cocok dengan menyamakan ukuran kolom lama dengan kolom baru. Untuk melakukan ini, kita membutuhkan field <code>overflow_to_sample_mapping</code> yang dikembalikan oleh tokenizer saat kita menetapkan <code>return_overflowing_tokens=True</code>. Field ini memberikan pemetaan dari indeks fitur baru ke indeks contoh asalnya. Dengan itu, kita bisa mengaitkan setiap key yang ada di dataset asli dengan daftar nilai yang ukurannya tepat, dengan mengulangi nilai tiap contoh sebanyak jumlah fitur baru yang dihasilkannya:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">def</span> <span class="hljs-title function_">tokenize_and_split</span>(<span class="hljs-params">examples</span>): | |
| result = tokenizer( | |
| examples[<span class="hljs-string">"review"</span>], | |
| truncation=<span class="hljs-literal">True</span>, | |
| max_length=<span class="hljs-number">128</span>, | |
| return_overflowing_tokens=<span class="hljs-literal">True</span>, | |
| ) | |
| <span class="hljs-comment"># Ambil pemetaan antara indeks baru dan lama</span> | |
| sample_map = result.pop(<span class="hljs-string">"overflow_to_sample_mapping"</span>) | |
| <span class="hljs-keyword">for</span> key, values <span class="hljs-keyword">in</span> examples.items(): | |
| result[key] = [values[i] <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> sample_map] | |
| <span class="hljs-keyword">return</span> result<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-2o21ti">Kita bisa melihat bahwa ini bekerja dengan <code>Dataset.map()</code> tanpa perlu menghapus kolom-kolom lama:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->tokenized_dataset = drug_dataset.<span class="hljs-built_in">map</span>(tokenize_and_split, batched=<span class="hljs-literal">True</span>) | |
| tokenized_dataset<!-- HTML_TAG_END --></pre></div> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->DatasetDict({ | |
| train: Dataset({ | |
| features: [<span class="hljs-string">'attention_mask'</span>, <span class="hljs-string">'condition'</span>, <span class="hljs-string">'date'</span>, <span class="hljs-string">'drugName'</span>, <span class="hljs-string">'input_ids'</span>, <span class="hljs-string">'patient_id'</span>, <span class="hljs-string">'rating'</span>, <span class="hljs-string">'review'</span>, <span class="hljs-string">'review_length'</span>, <span class="hljs-string">'token_type_ids'</span>, <span class="hljs-string">'usefulCount'</span>], | |
| num_rows: <span class="hljs-number">206772</span> | |
| }) | |
| test: Dataset({ | |
| features: [<span class="hljs-string">'attention_mask'</span>, <span class="hljs-string">'condition'</span>, <span class="hljs-string">'date'</span>, <span class="hljs-string">'drugName'</span>, <span class="hljs-string">'input_ids'</span>, <span class="hljs-string">'patient_id'</span>, <span class="hljs-string">'rating'</span>, <span class="hljs-string">'review'</span>, <span class="hljs-string">'review_length'</span>, <span class="hljs-string">'token_type_ids'</span>, <span class="hljs-string">'usefulCount'</span>], | |
| num_rows: <span class="hljs-number">68876</span> | |
| }) | |
| })<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1429qjf">Kita mendapatkan jumlah fitur pelatihan yang sama seperti sebelumnya, tetapi kali ini kita mempertahankan semua kolom lama. Jika kamu membutuhkannya untuk pasca-pemrosesan setelah menerapkan model, pendekatan ini bisa sangat berguna.</p> <p data-svelte-h="svelte-1rve7p7">Sekarang kamu telah melihat bagaimana 🤗 Datasets dapat digunakan untuk melakukan praproses dataset dengan berbagai cara. Walaupun fungsi pemrosesan di 🤗 Datasets sudah mencakup sebagian besar kebutuhan pelatihan model, mungkin ada saatnya kamu perlu berpindah ke Pandas untuk menggunakan fitur yang lebih canggih seperti <code>DataFrame.groupby()</code> atau API tingkat tinggi untuk visualisasi. Untungnya, 🤗 Datasets dirancang agar interoperable dengan library seperti Pandas, NumPy, PyTorch, TensorFlow, dan JAX. Mari kita lihat bagaimana cara kerjanya.</p> <h2 class="relative group"><a id="from-datasets-to-dataframes-and-back" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#from-datasets-to-dataframes-and-back"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Dari Dataset ke DataFrame dan sebaliknya</span></h2> <iframe class="w-full xl:w-4/6 h-80" src="https://www.youtube-nocookie.com/embed/tfcY1067A5Q" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> <p data-svelte-h="svelte-6gl4v5">Untuk memungkinkan konversi antar berbagai library pihak ketiga, 🤗 Datasets menyediakan fungsi <code>Dataset.set_format()</code>. Fungsi ini hanya mengubah <strong>format keluaran</strong> dataset, jadi kamu bisa berpindah ke format lain tanpa mempengaruhi <strong>format data inti</strong>, yang menggunakan Apache Arrow. Format ini diubah secara langsung. Untuk demontrasi, mari kita ubah dataset kita ke format Pandas:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->drug_dataset.set_format(<span class="hljs-string">"pandas"</span>)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1dhujr7">Sekarang saat kita mengakses elemen dari dataset, kita mendapatkan <code>pandas.DataFrame</code> alih-alih dictionary:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->drug_dataset[<span class="hljs-string">"train"</span>][:<span class="hljs-number">3</span>]<!-- HTML_TAG_END --></pre></div> <table border="1" class="dataframe" data-svelte-h="svelte-fhhlil"><thead><tr style="text-align: right;"><th></th> <th>patient_id</th> <th>drugName</th> <th>condition</th> <th>review</th> <th>rating</th> <th>date</th> <th>usefulCount</th> <th>review_length</th></tr></thead> <tbody><tr><th>0</th> <td>95260</td> <td>Guanfacine</td> <td>adhd</td> <td>"My son is halfway through his fourth week of Intuniv..."</td> <td>8.0</td> <td>April 27, 2010</td> <td>192</td> <td>141</td></tr> <tr><th>1</th> <td>92703</td> <td>Lybrel</td> <td>birth control</td> <td>"I used to take another oral contraceptive, which had 21 pill cycle, and was very happy- very light periods, max 5 days, no other side effects..."</td> <td>5.0</td> <td>December 14, 2009</td> <td>17</td> <td>134</td></tr> <tr><th>2</th> <td>138000</td> <td>Ortho Evra</td> <td>birth control</td> <td>"This is my first time using any form of birth control..."</td> <td>8.0</td> <td>November 3, 2015</td> <td>10</td> <td>89</td></tr></tbody></table> <p data-svelte-h="svelte-y757pt">Mari kita buat <code>pandas.DataFrame</code> untuk seluruh set pelatihan dengan memilih semua elemen dari <code>drug_dataset["train"]</code>:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->train_df = drug_dataset[<span class="hljs-string">"train"</span>][:]<!-- HTML_TAG_END --></pre></div> <div class="course-tip bg-gradient-to-br dark:bg-gradient-to-r before:border-green-500 dark:before:border-green-800 from-green-50 dark:from-gray-900 to-white dark:to-gray-950 border border-green-50 text-green-700 dark:text-gray-400"><p data-svelte-h="svelte-14boj1z">🚨 Di balik layar, <code>Dataset.set_format()</code> mengubah format keluaran dari metode <code>__getitem__()</code> milik dataset. Artinya, saat kita ingin membuat objek baru seperti <code>train_df</code> dari <code>Dataset</code> dalam format <code>"pandas"</code>, kita perlu melakukan slicing seluruh dataset untuk mendapatkan <code>pandas.DataFrame</code>. Kamu bisa verifikasi bahwa <code>drug_dataset["train"]</code> tetap bertipe <code>Dataset</code>, terlepas dari format output-nya.</p></div> <p data-svelte-h="svelte-16wthm2">Dari sini, kita bisa menggunakan semua fitur Pandas yang kita inginkan. Contohnya, kita bisa melakukan chaining untuk menghitung distribusi kelas pada kolom <code>condition</code>:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->frequencies = ( | |
| train_df[<span class="hljs-string">"condition"</span>] | |
| .value_counts() | |
| .to_frame() | |
| .reset_index() | |
| .rename(columns={<span class="hljs-string">"index"</span>: <span class="hljs-string">"condition"</span>, <span class="hljs-string">"count"</span>: <span class="hljs-string">"frequency"</span>}) | |
| ) | |
| frequencies.head()<!-- HTML_TAG_END --></pre></div> <table border="1" class="dataframe" data-svelte-h="svelte-10crns6"><thead><tr style="text-align: right;"><th></th> <th>condition</th> <th>frequency</th></tr></thead> <tbody><tr><th>0</th> <td>birth control</td> <td>27655</td></tr> <tr><th>1</th> <td>depression</td> <td>8023</td></tr> <tr><th>2</th> <td>acne</td> <td>5209</td></tr> <tr><th>3</th> <td>anxiety</td> <td>4991</td></tr> <tr><th>4</th> <td>pain</td> <td>4744</td></tr></tbody></table> <p data-svelte-h="svelte-1klc6vl">Setelah selesai melakukan analisis di Pandas, kita bisa membuat objek <code>Dataset</code> baru dari <code>DataFrame</code> tersebut dengan menggunakan fungsi <code>Dataset.from_pandas()</code>:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">from</span> datasets <span class="hljs-keyword">import</span> Dataset | |
| freq_dataset = Dataset.from_pandas(frequencies) | |
| freq_dataset<!-- HTML_TAG_END --></pre></div> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->Dataset({ | |
| features: [<span class="hljs-string">'condition'</span>, <span class="hljs-string">'frequency'</span>], | |
| num_rows: <span class="hljs-number">819</span> | |
| })<!-- HTML_TAG_END --></pre></div> <div class="course-tip bg-gradient-to-br dark:bg-gradient-to-r before:border-green-500 dark:before:border-green-800 from-green-50 dark:from-gray-900 to-white dark:to-gray-950 border border-green-50 text-green-700 dark:text-gray-400"><p data-svelte-h="svelte-1mm036m">✏️ <strong>Coba sendiri!</strong> Hitung rata-rata rating per obat, lalu simpan hasilnya ke dalam <code>Dataset</code> baru.</p></div> <p data-svelte-h="svelte-10dim3m">Ini mengakhiri tur kita terhadap berbagai teknik praproses data yang tersedia di 🤗 Datasets. Sebagai penutup bagian ini, mari kita buat <em>validation set</em> untuk menyiapkan dataset agar bisa digunakan melatih sebuah <em>classifier</em>. Sebelum itu, kita reset kembali format keluaran <code>drug_dataset</code> dari <code>"pandas"</code> ke <code>"arrow"</code>:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->drug_dataset.reset_format()<!-- HTML_TAG_END --></pre></div> <h2 class="relative group"><a id="creating-a-validation-set" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#creating-a-validation-set"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Membuat validation set</span></h2> <p data-svelte-h="svelte-geptc6">Meskipun kita memiliki <em>test set</em> yang bisa digunakan untuk evaluasi, praktik terbaik adalah membiarkan <em>test set</em> tetap utuh dan membuat <em>validation set</em> terpisah selama pengembangan. Setelah kamu puas dengan performa model di <em>validation set</em>, kamu bisa melakukan pengecekan akhir pada <em>test set</em>. Proses ini membantu mengurangi risiko <em>overfitting</em> ke <em>test set</em>, yang bisa menyebabkan model gagal saat digunakan pada data nyata.</p> <p data-svelte-h="svelte-1mgp69w">🤗 Datasets menyediakan fungsi <code>Dataset.train_test_split()</code> yang mirip dengan fungsi populer dari <code>scikit-learn</code>. Mari kita gunakan untuk membagi <em>training set</em> menjadi <em>train</em> dan <em>validation set</em> (dengan <code>seed</code> untuk replikasi hasil):</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->drug_dataset_clean = drug_dataset[<span class="hljs-string">"train"</span>].train_test_split(train_size=<span class="hljs-number">0.8</span>, seed=<span class="hljs-number">42</span>) | |
| <span class="hljs-comment"># Ubah nama default "test" menjadi "validation"</span> | |
| drug_dataset_clean[<span class="hljs-string">"validation"</span>] = drug_dataset_clean.pop(<span class="hljs-string">"test"</span>) | |
| <span class="hljs-comment"># Tambahkan "test set" ke `DatasetDict` kita</span> | |
| drug_dataset_clean[<span class="hljs-string">"test"</span>] = drug_dataset[<span class="hljs-string">"test"</span>] | |
| drug_dataset_clean<!-- HTML_TAG_END --></pre></div> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->DatasetDict({ | |
| train: Dataset({ | |
| features: [<span class="hljs-string">'patient_id'</span>, <span class="hljs-string">'drugName'</span>, <span class="hljs-string">'condition'</span>, <span class="hljs-string">'review'</span>, <span class="hljs-string">'rating'</span>, <span class="hljs-string">'date'</span>, <span class="hljs-string">'usefulCount'</span>, <span class="hljs-string">'review_length'</span>, <span class="hljs-string">'review_clean'</span>], | |
| num_rows: <span class="hljs-number">110811</span> | |
| }) | |
| validation: Dataset({ | |
| features: [<span class="hljs-string">'patient_id'</span>, <span class="hljs-string">'drugName'</span>, <span class="hljs-string">'condition'</span>, <span class="hljs-string">'review'</span>, <span class="hljs-string">'rating'</span>, <span class="hljs-string">'date'</span>, <span class="hljs-string">'usefulCount'</span>, <span class="hljs-string">'review_length'</span>, <span class="hljs-string">'review_clean'</span>], | |
| num_rows: <span class="hljs-number">27703</span> | |
| }) | |
| test: Dataset({ | |
| features: [<span class="hljs-string">'patient_id'</span>, <span class="hljs-string">'drugName'</span>, <span class="hljs-string">'condition'</span>, <span class="hljs-string">'review'</span>, <span class="hljs-string">'rating'</span>, <span class="hljs-string">'date'</span>, <span class="hljs-string">'usefulCount'</span>, <span class="hljs-string">'review_length'</span>, <span class="hljs-string">'review_clean'</span>], | |
| num_rows: <span class="hljs-number">46108</span> | |
| }) | |
| })<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-88okrf">Mantap, kita telah menyiapkan dataset yang siap untuk digunakan dalam pelatihan model! Di <a href="/course/chapter5/5">bagian 5</a> kita akan melihat bagaimana cara mengunggah dataset ke Hugging Face Hub, tapi untuk sekarang mari kita akhiri analisis ini dengan melihat beberapa cara menyimpan dataset ke penyimpanan lokal.</p> <h2 class="relative group"><a id="saving-a-dataset" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#saving-a-dataset"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Menyimpan dataset</span></h2> <iframe class="w-full xl:w-4/6 h-80" src="https://www.youtube-nocookie.com/embed/blF9uxYcKHo" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> <p data-svelte-h="svelte-wq00sw">Meskipun 🤗 Datasets akan menyimpan cache untuk setiap dataset yang diunduh dan operasi yang dilakukan padanya, terkadang kamu tetap ingin menyimpan dataset ke disk (misalnya, jika cache terhapus). Seperti yang ditunjukkan pada tabel berikut, 🤗 Datasets menyediakan tiga fungsi utama untuk menyimpan dataset ke dalam berbagai format:</p> <table data-svelte-h="svelte-1awdy6d"><thead><tr><th align="center">Format Data</th> <th align="center">Fungsi</th></tr></thead> <tbody><tr><td align="center">Arrow</td> <td align="center"><code>Dataset.save_to_disk()</code></td></tr> <tr><td align="center">CSV</td> <td align="center"><code>Dataset.to_csv()</code></td></tr> <tr><td align="center">JSON</td> <td align="center"><code>Dataset.to_json()</code></td></tr></tbody></table> <p data-svelte-h="svelte-1mji2hp">Sebagai contoh, mari kita simpan dataset yang telah dibersihkan dalam format Arrow:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->drug_dataset_clean.save_to_disk(<span class="hljs-string">"drug-reviews"</span>)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1mxp5bu">Ini akan membuat direktori dengan struktur berikut:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->drug-reviews/ | |
| ├── dataset_dict.json | |
| ├── test | |
| │ ├── dataset.arrow | |
| │ ├── dataset_info.json | |
| │ └── <span class="hljs-keyword">state</span>.json | |
| ├── train | |
| │ ├── dataset.arrow | |
| │ ├── dataset_info.json | |
| │ ├── indices.arrow | |
| │ └── <span class="hljs-keyword">state</span>.json | |
| └── validation | |
| ├── dataset.arrow | |
| ├── dataset_info.json | |
| ├── indices.arrow | |
| └── <span class="hljs-keyword">state</span>.json<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-41v2o2">Kita bisa melihat bahwa setiap <em>split</em> memiliki file <em>dataset.arrow</em> sendiri, bersama dengan metadata di <em>dataset_info.json</em> dan <em>state.json</em>. Kamu bisa menganggap format Arrow ini seperti tabel baris-kolom yang dioptimalkan untuk aplikasi performa tinggi yang memproses dataset besar.</p> <p data-svelte-h="svelte-pmdxe2">Setelah dataset disimpan, kita bisa memuatnya kembali menggunakan fungsi <code>load_from_disk()</code> seperti ini:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">from</span> datasets <span class="hljs-keyword">import</span> load_from_disk | |
| drug_dataset_reloaded = load_from_disk(<span class="hljs-string">"drug-reviews"</span>) | |
| drug_dataset_reloaded<!-- HTML_TAG_END --></pre></div> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->DatasetDict({ | |
| train: Dataset({ | |
| features: [<span class="hljs-string">'patient_id'</span>, <span class="hljs-string">'drugName'</span>, <span class="hljs-string">'condition'</span>, <span class="hljs-string">'review'</span>, <span class="hljs-string">'rating'</span>, <span class="hljs-string">'date'</span>, <span class="hljs-string">'usefulCount'</span>, <span class="hljs-string">'review_length'</span>], | |
| num_rows: <span class="hljs-number">110811</span> | |
| }) | |
| validation: Dataset({ | |
| features: [<span class="hljs-string">'patient_id'</span>, <span class="hljs-string">'drugName'</span>, <span class="hljs-string">'condition'</span>, <span class="hljs-string">'review'</span>, <span class="hljs-string">'rating'</span>, <span class="hljs-string">'date'</span>, <span class="hljs-string">'usefulCount'</span>, <span class="hljs-string">'review_length'</span>], | |
| num_rows: <span class="hljs-number">27703</span> | |
| }) | |
| test: Dataset({ | |
| features: [<span class="hljs-string">'patient_id'</span>, <span class="hljs-string">'drugName'</span>, <span class="hljs-string">'condition'</span>, <span class="hljs-string">'review'</span>, <span class="hljs-string">'rating'</span>, <span class="hljs-string">'date'</span>, <span class="hljs-string">'usefulCount'</span>, <span class="hljs-string">'review_length'</span>], | |
| num_rows: <span class="hljs-number">46108</span> | |
| }) | |
| })<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-fd14xw">Untuk format CSV dan JSON, kita harus menyimpan setiap <em>split</em> sebagai file terpisah. Salah satu caranya adalah dengan melakukan iterasi pada <code>DatasetDict</code>:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-keyword">for</span> split, dataset <span class="hljs-keyword">in</span> drug_dataset_clean.items(): | |
| dataset.to_json(<span class="hljs-string">f"drug-reviews-<span class="hljs-subst">{split}</span>.jsonl"</span>)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-143aaha">Ini akan menyimpan setiap split dalam <a href="https://jsonlines.org" rel="nofollow">format JSON Lines</a>, di mana setiap baris adalah satu entri JSON. Contoh baris pertama:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->!head -n <span class="hljs-number">1</span> drug-reviews-train.jsonl<!-- HTML_TAG_END --></pre></div> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->{<span class="hljs-string">"patient_id"</span>:<span class="hljs-number">141780</span>,<span class="hljs-string">"drugName"</span>:<span class="hljs-string">"Escitalopram"</span>,<span class="hljs-string">"condition"</span>:<span class="hljs-string">"depression"</span>,<span class="hljs-string">"review"</span>:<span class="hljs-string">"\"I seemed to experience the regular side effects of LEXAPRO, insomnia, low sex drive, sleepiness during the day. I am taking it at night because my doctor said if it made me tired to take it at night. I assumed it would and started out taking it at night. Strange dreams, some pleasant. I was diagnosed with fibromyalgia. Seems to be helping with the pain. Have had anxiety and depression in my family, and have tried quite a few other medications that haven't worked. Only have been on it for two weeks but feel more positive in my mind, want to accomplish more in my life. Hopefully the side effects will dwindle away, worth it to stick with it from hearing others responses. Great medication.\""</span>,<span class="hljs-string">"rating"</span>:<span class="hljs-number">9.0</span>,<span class="hljs-string">"date"</span>:<span class="hljs-string">"May 29, 2011"</span>,<span class="hljs-string">"usefulCount"</span>:<span class="hljs-number">10</span>,<span class="hljs-string">"review_length"</span>:<span class="hljs-number">125</span>}<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1goky84">Kita bisa memuat kembali file-file ini menggunakan teknik dari <a href="/course/chapter5/2">bagian 2</a> seperti berikut:</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START -->data_files = { | |
| <span class="hljs-string">"train"</span>: <span class="hljs-string">"drug-reviews-train.jsonl"</span>, | |
| <span class="hljs-string">"validation"</span>: <span class="hljs-string">"drug-reviews-validation.jsonl"</span>, | |
| <span class="hljs-string">"test"</span>: <span class="hljs-string">"drug-reviews-test.jsonl"</span>, | |
| } | |
| drug_dataset_reloaded = load_dataset(<span class="hljs-string">"json"</span>, data_files=data_files)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-s7tz5l">Dan selesai sudah eksplorasi kita mengenai <em>data wrangling</em> dengan 🤗 Datasets! Sekarang setelah kita memiliki dataset yang bersih dan siap digunakan untuk pelatihan model, berikut beberapa ide yang bisa kamu coba:</p> <ol data-svelte-h="svelte-b79zas"><li>Gunakan teknik dari <a href="/course/chapter3">Bab 3</a> untuk melatih <em>classifier</em> yang bisa memprediksi kondisi pasien berdasarkan ulasan obat.</li> <li>Gunakan pipeline <code>summarization</code> dari <a href="/course/chapter1">Bab 1</a> untuk membuat ringkasan dari ulasan pasien.</li></ol> <p data-svelte-h="svelte-hguauf">Selanjutnya, kita akan melihat bagaimana 🤗 Datasets memungkinkan kamu bekerja dengan dataset berukuran besar tanpa membuat laptopmu kewalahan!</p> <a class="!text-gray-400 !no-underline text-sm flex items-center not-prose mt-4" href="https://github.com/huggingface/course/blob/main/chapters/id/chapter5/3.mdx" target="_blank"><span data-svelte-h="svelte-1kd6by1"><</span> <span data-svelte-h="svelte-x0xyl0">></span> <span data-svelte-h="svelte-1dajgef"><span class="underline ml-1.5">Update</span> on GitHub</span></a> <p></p> | |
| <script> | |
| { | |
| __sveltekit_ojy514 = { | |
| assets: "/docs/course/pr_1054/id", | |
| base: "/docs/course/pr_1054/id", | |
| env: {} | |
| }; | |
| const element = document.currentScript.parentElement; | |
| const data = [null,null]; | |
| Promise.all([ | |
| import("/docs/course/pr_1054/id/_app/immutable/entry/start.4f92af03.js"), | |
| import("/docs/course/pr_1054/id/_app/immutable/entry/app.19cef1b6.js") | |
| ]).then(([kit, app]) => { | |
| kit.start(app, element, { | |
| node_ids: [0, 38], | |
| data, | |
| form: null, | |
| error: null | |
| }); | |
| }); | |
| } | |
| </script> | |
Xet Storage Details
- Size:
- 148 kB
- Xet hash:
- ffb7009ad5b5ca00214728a3ddcc3297ef6b90a3b1ac14620f244ee6318e2fb3
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.