text8-dataset / README.md
roshbeed's picture
Upload README.md with huggingface_hub
26271a8 verified

Text8 Dataset

This repository contains the Text8 dataset, a large text corpus commonly used for training word embeddings and language models.

Dataset Information

Files

  • text8_full.txt: Complete text8 corpus
  • text8_sentences.json: Text8 split into sentences for easier processing
  • dataset_info.json: Dataset metadata

Usage

You can load this dataset in your training scripts using:

from huggingface_hub import hf_hub_download
import json

# Download sentences
sentences_path = hf_hub_download(
    repo_id="roshbeed/text8-dataset",
    filename="text8_sentences.json",
    token="your_token"
)

with open(sentences_path, 'r') as f:
    data = json.load(f)
    sentences = data['sentences']

# Use sentences for training

Citation

If you use this dataset, please cite the original source.