Buckets:
| # Load text data | |
| This guide shows you how to load text datasets. To learn how to load any type of dataset, take a look at the general loading guide. | |
| Text files are one of the most common file types for storing a dataset. By default, 🤗 Datasets samples a text file line by line to build the dataset. | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> dataset = load_dataset("text", data_files={"train": ["my_text_1.txt", "my_text_2.txt"], "test": "my_test_file.txt"}) | |
| # Load from a directory | |
| >>> dataset = load_dataset("text", data_dir="path/to/text/dataset") | |
| ``` | |
| To sample a text file by paragraph or even an entire document, use the `sample_by` parameter: | |
| ```py | |
| # Sample by paragraph | |
| >>> dataset = load_dataset("text", data_files={"train": "my_train_file.txt", "test": "my_test_file.txt"}, sample_by="paragraph") | |
| # Sample by document | |
| >>> dataset = load_dataset("text", data_files={"train": "my_train_file.txt", "test": "my_test_file.txt"}, sample_by="document") | |
| ``` | |
| You can also use grep patterns to load specific files: | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> c4_subset = load_dataset("allenai/c4", data_files="en/c4-train.0000*-of-01024.json.gz") | |
| ``` | |
| To load remote text files via HTTP, pass the URLs instead: | |
| ```py | |
| >>> dataset = load_dataset("text", data_files="https://huggingface.co/datasets/hf-internal-testing/dataset_with_data_files/resolve/main/data/train.txt") | |
| ``` | |
| To load XML data you can use the "xml" loader, which is equivalent to "text" with sample_by="document": | |
| ```py | |
| >>> from datasets import load_dataset | |
| >>> dataset = load_dataset("xml", data_files={"train": ["my_xml_1.xml", "my_xml_2.xml"], "test": "my_xml_file.xml"}) | |
| # Load from a directory | |
| >>> dataset = load_dataset("xml", data_dir="path/to/xml/dataset") | |
| ``` | |
Xet Storage Details
- Size:
- 1.76 kB
- Xet hash:
- 1c1c35006f9483d60f524cc8b223c951a74fd86095b9656977e9b6b3e4ac280b
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.