| # TinyStories demo slice (~700 KB train + 70 KB valid) |
|
|
| This is a small slice of the **TinyStories** dataset by Eldan & Li (2023): |
| - Source: https://huggingface.co/datasets/roneneldan/TinyStories |
| - Original file: `TinyStoriesV2-GPT4-valid.txt` (~22 MB) |
| - This slice: the first ~700 KB of stories from that file, packed as raw uint8 bytes |
| - License: the upstream dataset is CC-BY-4.0; this redistribution preserves that license |
|
|
| ## Files |
|
|
| | File | Size | Purpose | |
| |---|---|---| |
| | `train.bin` | ~700 KB | training shard (uint8 byte sequence) | |
| | `valid.bin` | ~70 KB | held-out validation shard | |
|
|
| ## Why this slice and not the full thing |
|
|
| The full TinyStories train file is ~2 GB. We didn't want every kit user to download |
| 2 GB just to do their first smoke-training run. 700 KB is enough to: |
|
|
| - Run 50–500 training steps in a few minutes on CPU and see loss fall |
| - Verify your install end-to-end |
| - Get a feel for how the trainer behaves before committing to a real run |
|
|
| For a real ~10M-param training run you want millions of bytes minimum; download |
| the full dataset from the source URL above and point `--data-dir` at it. |
|
|
| ## Format |
|
|
| Files are flat sequences of `uint8` bytes — no headers, no separators between |
| stories beyond the natural `<|endoftext|>` strings inside the text. The trainer |
| memmaps these and samples random windows of `seq_len` bytes. Each byte IS a |
| token (vocabulary size = 256). |
|
|