YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Text8 Dataset

This repository contains the Text8 dataset, a large text corpus commonly used for training word embeddings and language models.

Dataset Information

Files

  • text8_full.txt: Complete text8 corpus
  • text8_sentences.json: Text8 split into sentences for easier processing
  • dataset_info.json: Dataset metadata

Usage

You can load this dataset in your training scripts using:

from huggingface_hub import hf_hub_download
import json

# Download sentences
sentences_path = hf_hub_download(
    repo_id="roshbeed/text8-dataset",
    filename="text8_sentences.json",
    token="your_token"
)

with open(sentences_path, 'r') as f:
    data = json.load(f)
    sentences = data['sentences']

# Use sentences for training

Citation

If you use this dataset, please cite the original source.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support