File size: 976 Bytes
26271a8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
# Text8 Dataset

This repository contains the Text8 dataset, a large text corpus commonly used for training word embeddings and language models.

## Dataset Information

- **Source**: http://mattmahoney.net/dc/text8.zip
- **License**: Public domain
- **Format**: Text corpus
- **Size**: Large text corpus (~100MB)

## Files

- `text8_full.txt`: Complete text8 corpus
- `text8_sentences.json`: Text8 split into sentences for easier processing
- `dataset_info.json`: Dataset metadata

## Usage

You can load this dataset in your training scripts using:

```python
from huggingface_hub import hf_hub_download
import json

# Download sentences
sentences_path = hf_hub_download(
    repo_id="roshbeed/text8-dataset",
    filename="text8_sentences.json",
    token="your_token"
)

with open(sentences_path, 'r') as f:
    data = json.load(f)
    sentences = data['sentences']

# Use sentences for training
```

## Citation

If you use this dataset, please cite the original source.