File size: 976 Bytes
26271a8 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 |
# Text8 Dataset
This repository contains the Text8 dataset, a large text corpus commonly used for training word embeddings and language models.
## Dataset Information
- **Source**: http://mattmahoney.net/dc/text8.zip
- **License**: Public domain
- **Format**: Text corpus
- **Size**: Large text corpus (~100MB)
## Files
- `text8_full.txt`: Complete text8 corpus
- `text8_sentences.json`: Text8 split into sentences for easier processing
- `dataset_info.json`: Dataset metadata
## Usage
You can load this dataset in your training scripts using:
```python
from huggingface_hub import hf_hub_download
import json
# Download sentences
sentences_path = hf_hub_download(
repo_id="roshbeed/text8-dataset",
filename="text8_sentences.json",
token="your_token"
)
with open(sentences_path, 'r') as f:
data = json.load(f)
sentences = data['sentences']
# Use sentences for training
```
## Citation
If you use this dataset, please cite the original source.
|