YAML Metadata
Warning:
empty or missing yaml metadata in repo card
(https://huggingface.co/docs/hub/model-cards#model-card-metadata)
Text8 Dataset
This repository contains the Text8 dataset, a large text corpus commonly used for training word embeddings and language models.
Dataset Information
- Source: http://mattmahoney.net/dc/text8.zip
- License: Public domain
- Format: Text corpus
- Size: Large text corpus (~100MB)
Files
text8_full.txt: Complete text8 corpustext8_sentences.json: Text8 split into sentences for easier processingdataset_info.json: Dataset metadata
Usage
You can load this dataset in your training scripts using:
from huggingface_hub import hf_hub_download
import json
# Download sentences
sentences_path = hf_hub_download(
repo_id="roshbeed/text8-dataset",
filename="text8_sentences.json",
token="your_token"
)
with open(sentences_path, 'r') as f:
data = json.load(f)
sentences = data['sentences']
# Use sentences for training
Citation
If you use this dataset, please cite the original source.
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support