Buckets:

hf-doc-build
/

doc-dev

Files

xet

hf-doc-build/doc-dev / datasets /pr_8021 /en /tabular_load.md

rtrm

29 days ago

preview code

download

raw

6.76 kB

	# Load tabular data

	A tabular dataset is a generic dataset used to describe any data stored in rows and columns, where the rows represent an example and the columns represent a feature (can be continuous or categorical). These datasets are commonly stored in CSV files, Pandas DataFrames, and in database tables. This guide will show you how to load and create a tabular dataset from:

	- CSV files
	- Pandas DataFrames
	- HDF5 files
	- Databases

	## CSV files

	🤗 Datasets can read CSV files by specifying the generic `csv` dataset builder name in the [load_dataset()](/docs/datasets/pr_8021/en/package_reference/loading_methods#datasets.load_dataset) method. To load more than one CSV file, pass them as a list to the `data_files` parameter:

	```py
	>>> from datasets import load_dataset
	>>> dataset = load_dataset("csv", data_files="my_file.csv")

	# load multiple CSV files
	>>> dataset = load_dataset("csv", data_files=["my_file_1.csv", "my_file_2.csv", "my_file_3.csv"])
	```

	You can also map specific CSV files to the train and test splits:

	```py
	>>> dataset = load_dataset("csv", data_files={"train": ["my_train_file_1.csv", "my_train_file_2.csv"], "test": "my_test_file.csv"})
	```

	To load remote CSV files, pass the URLs instead:

	```py
	>>> base_url = "https://huggingface.co/datasets/lhoestq/demo1/resolve/main/data/"
	>>> dataset = load_dataset('csv', data_files={"train": base_url + "train.csv", "test": base_url + "test.csv"})
	```

	To load zipped CSV files:

	```py
	>>> url = "https://domain.org/train_data.zip"
	>>> data_files = {"train": url}
	>>> dataset = load_dataset("csv", data_files=data_files)
	```

	## Pandas DataFrames

	🤗 Datasets also supports loading datasets from [Pandas DataFrames](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) with the [from_pandas()](/docs/datasets/pr_8021/en/package_reference/main_classes#datasets.Dataset.from_pandas) method:

	```py
	>>> from datasets import Dataset
	>>> import pandas as pd

	# create a Pandas DataFrame
	>>> df = pd.read_csv("https://huggingface.co/datasets/imodels/credit-card/raw/main/train.csv")
	>>> df = pd.DataFrame(df)
	# load Dataset from Pandas DataFrame
	>>> dataset = Dataset.from_pandas(df)
	```

	Use the `splits` parameter to specify the name of the dataset split:

	```py
	>>> train_ds = Dataset.from_pandas(train_df, split="train")
	>>> test_ds = Dataset.from_pandas(test_df, split="test")
	```

	If the dataset doesn't look as expected, you should explicitly [specify your dataset features](loading#specify-features). A [pandas.Series](https://pandas.pydata.org/docs/reference/api/pandas.Series.html) may not always carry enough information for Arrow to automatically infer a data type. For example, if a DataFrame is of length `0` or if the Series only contains `None/NaN` objects, the type is set to `null`.

	## HDF5 files

	[HDF5](https://www.hdfgroup.org/solutions/hdf5/) files are commonly used for storing large amounts of numerical data in scientific computing and machine learning. Loading HDF5 files with 🤗 Datasets is similar to loading CSV files:

	```py
	>>> from datasets import load_dataset
	>>> dataset = load_dataset("hdf5", data_files="data.h5")
	```

	Note that the HDF5 loader assumes that the file has "tabular" structure, i.e. that all datasets in the file have (the same number of) rows on their first dimension.

	## Databases

	Datasets stored in databases are typically accessed with SQL queries. With 🤗 Datasets, you can connect to a database, query for the data you need, and create a dataset out of it. Then you can use all the processing features of 🤗 Datasets to prepare your dataset for training.

	### SQLite

	SQLite is a small, lightweight database that is fast and easy to set up. You can use an existing database if you'd like, or follow along and start from scratch.

	Start by creating a quick SQLite database with this [Covid-19 data](https://github.com/nytimes/covid-19-data/blob/master/us-states.csv) from the New York Times:

	```py
	>>> import sqlite3
	>>> import pandas as pd

	>>> conn = sqlite3.connect("us_covid_data.db")
	>>> df = pd.read_csv("https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv")
	>>> df.to_sql("states", conn, if_exists="replace")
	```

	This creates a `states` table in the `us_covid_data.db` database which you can now load into a dataset.

	To connect to the database, you'll need the [URI string](https://docs.sqlalchemy.org/en/13/core/engines.html#database-urls) that identifies your database. Connecting to a database with a URI caches the returned dataset. The URI string differs for each database dialect, so be sure to check the [Database URLs](https://docs.sqlalchemy.org/en/13/core/engines.html#database-urls) for whichever database you're using.

	For SQLite, it is:

	```py
	>>> uri = "sqlite:///us_covid_data.db"
	```

	Load the table by passing the table name and URI to [from_sql()](/docs/datasets/pr_8021/en/package_reference/main_classes#datasets.Dataset.from_sql):

	```py
	>>> from datasets import Dataset

	>>> ds = Dataset.from_sql("states", uri)
	>>> ds
	Dataset({
	features: ['index', 'date', 'state', 'fips', 'cases', 'deaths'],
	num_rows: 54382
	})
	```

	Then you can use all of 🤗 Datasets process features like [filter()](/docs/datasets/pr_8021/en/package_reference/main_classes#datasets.Dataset.filter) for example:

	```py
	>>> ds.filter(lambda x: x["state"] == "California")
	```

	You can also load a dataset from a SQL query instead of an entire table, which is useful for querying and joining multiple tables.

	Load the dataset by passing your query and URI to [from_sql()](/docs/datasets/pr_8021/en/package_reference/main_classes#datasets.Dataset.from_sql):

	```py
	>>> from datasets import Dataset

	>>> ds = Dataset.from_sql('SELECT * FROM states WHERE state="California";', uri)
	>>> ds
	Dataset({
	features: ['index', 'date', 'state', 'fips', 'cases', 'deaths'],
	num_rows: 1019
	})
	```

	Then you can use all of 🤗 Datasets process features like [filter()](/docs/datasets/pr_8021/en/package_reference/main_classes#datasets.Dataset.filter) for example:

	```py
	>>> ds.filter(lambda x: x["cases"] > 10000)
	```

	### PostgreSQL

	You can also connect and load a dataset from a PostgreSQL database, however we won't directly demonstrate how in the documentation because the example is only meant to be run in a notebook. Instead, take a look at how to install and setup a PostgreSQL server in this [notebook](https://colab.research.google.com/github/nateraw/huggingface-hub-examples/blob/main/sql_with_huggingface_datasets.ipynb#scrollTo=d83yGQMPHGFi)!

	After you've setup your PostgreSQL database, you can use the [from_sql()](/docs/datasets/pr_8021/en/package_reference/main_classes#datasets.Dataset.from_sql) method to load a dataset from a table or query.

Xet Storage Details

Size:: 6.76 kB
Xet hash:: 635dd75ea74c3e35561a2dcb6b80b0f21b80ebda6536eb17204713b4e14baec8

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.