Buckets:
| # Introduction[[introduction]] | |
| <CourseFloatingBanner | |
| chapter={5} | |
| classNames="absolute z-10 right-0 top-0" | |
| /> | |
| In [Chapter 3](/course/chapter3) you got your first taste of the 🤗 Datasets library and saw that there were three main steps when it came to fine-tuning a model: | |
| 1. Load a dataset from the Hugging Face Hub. | |
| 2. Preprocess the data with `Dataset.map()`. | |
| 3. Load and compute metrics. | |
| But this is just scratching the surface of what 🤗 Datasets can do! In this chapter, we will take a deep dive into the library. Along the way, we'll find answers to the following questions: | |
| * What do you do when your dataset is not on the Hub? | |
| * How can you slice and dice a dataset? (And what if you _really_ need to use Pandas?) | |
| * What do you do when your dataset is huge and will melt your laptop's RAM? | |
| * What the heck are "memory mapping" and Apache Arrow? | |
| * How can you create your own dataset and push it to the Hub? | |
| The techniques you learn here will prepare you for the advanced tokenization and fine-tuning tasks in [Chapter 6](/course/chapter6) and [Chapter 7](/course/chapter7) -- so grab a coffee and let's get started! | |
| <EditOnGithub source="https://github.com/huggingface/course/blob/main/chapters/en/chapter5/1.mdx" /> |
Xet Storage Details
- Size:
- 1.24 kB
- Xet hash:
- 1a22dd651965a63fc76a4c3f4b3f254f22d3caf908639a72146c1291830f7640
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.