Text8 Dataset

This repository contains the Text8 dataset, a large text corpus commonly used for training word embeddings and language models.

Dataset Information

Source: http://mattmahoney.net/dc/text8.zip
License: Public domain
Format: Text corpus
Size: Large text corpus (~100MB)

Files

text8_full.txt: Complete text8 corpus
text8_sentences.json: Text8 split into sentences for easier processing
dataset_info.json: Dataset metadata

Usage

You can load this dataset in your training scripts using:

from huggingface_hub import hf_hub_download
import json

# Download sentences
sentences_path = hf_hub_download(
    repo_id="roshbeed/text8-dataset",
    filename="text8_sentences.json",
    token="your_token"
)

with open(sentences_path, 'r') as f:
    data = json.load(f)
    sentences = data['sentences']

# Use sentences for training

Citation

If you use this dataset, please cite the original source.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support