# Text8 Dataset This repository contains the Text8 dataset, a large text corpus commonly used for training word embeddings and language models. ## Dataset Information - **Source**: http://mattmahoney.net/dc/text8.zip - **License**: Public domain - **Format**: Text corpus - **Size**: Large text corpus (~100MB) ## Files - `text8_full.txt`: Complete text8 corpus - `text8_sentences.json`: Text8 split into sentences for easier processing - `dataset_info.json`: Dataset metadata ## Usage You can load this dataset in your training scripts using: ```python from huggingface_hub import hf_hub_download import json # Download sentences sentences_path = hf_hub_download( repo_id="roshbeed/text8-dataset", filename="text8_sentences.json", token="your_token" ) with open(sentences_path, 'r') as f: data = json.load(f) sentences = data['sentences'] # Use sentences for training ``` ## Citation If you use this dataset, please cite the original source.