# Text8 Dataset

This repository contains the Text8 dataset, a large text corpus commonly used for training word embeddings and language models.

## Dataset Information

- **Source**: http://mattmahoney.net/dc/text8.zip
- **License**: Public domain
- **Format**: Text corpus
- **Size**: Large text corpus (~100MB)

## Files

- `text8_full.txt`: Complete text8 corpus
- `text8_sentences.json`: Text8 split into sentences for easier processing
- `dataset_info.json`: Dataset metadata

## Usage

You can load this dataset in your training scripts using:

```python
from huggingface_hub import hf_hub_download
import json

# Download sentences
sentences_path = hf_hub_download(
    repo_id="roshbeed/text8-dataset",
    filename="text8_sentences.json",
    token="your_token"
)

with open(sentences_path, 'r') as f:
    data = json.load(f)
    sentences = data['sentences']

# Use sentences for training
```

## Citation

If you use this dataset, please cite the original source.