Buckets:

rtrm's picture
|
download
raw
1.24 kB

Introduction[[introduction]]

In Chapter 3 you got your first taste of the 🤗 Datasets library and saw that there were three main steps when it came to fine-tuning a model:

  1. Load a dataset from the Hugging Face Hub.
  2. Preprocess the data with Dataset.map().
  3. Load and compute metrics.

But this is just scratching the surface of what 🤗 Datasets can do! In this chapter, we will take a deep dive into the library. Along the way, we'll find answers to the following questions:

  • What do you do when your dataset is not on the Hub?
  • How can you slice and dice a dataset? (And what if you really need to use Pandas?)
  • What do you do when your dataset is huge and will melt your laptop's RAM?
  • What the heck are "memory mapping" and Apache Arrow?
  • How can you create your own dataset and push it to the Hub?

The techniques you learn here will prepare you for the advanced tokenization and fine-tuning tasks in Chapter 6 and Chapter 7 -- so grab a coffee and let's get started!

Xet Storage Details

Size:
1.24 kB
·
Xet hash:
1a22dd651965a63fc76a4c3f4b3f254f22d3caf908639a72146c1291830f7640

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.