File size: 1,404 Bytes
f86dc09
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# TinyStories demo slice (~700 KB train + 70 KB valid)

This is a small slice of the **TinyStories** dataset by Eldan & Li (2023):
- Source: https://huggingface.co/datasets/roneneldan/TinyStories
- Original file: `TinyStoriesV2-GPT4-valid.txt` (~22 MB)
- This slice: the first ~700 KB of stories from that file, packed as raw uint8 bytes
- License: the upstream dataset is CC-BY-4.0; this redistribution preserves that license

## Files

| File | Size | Purpose |
|---|---|---|
| `train.bin` | ~700 KB | training shard (uint8 byte sequence) |
| `valid.bin` | ~70 KB | held-out validation shard |

## Why this slice and not the full thing

The full TinyStories train file is ~2 GB. We didn't want every kit user to download
2 GB just to do their first smoke-training run. 700 KB is enough to:

- Run 50–500 training steps in a few minutes on CPU and see loss fall
- Verify your install end-to-end
- Get a feel for how the trainer behaves before committing to a real run

For a real ~10M-param training run you want millions of bytes minimum; download
the full dataset from the source URL above and point `--data-dir` at it.

## Format

Files are flat sequences of `uint8` bytes — no headers, no separators between
stories beyond the natural `<|endoftext|>` strings inside the text. The trainer
memmaps these and samples random windows of `seq_len` bytes. Each byte IS a
token (vocabulary size = 256).