TilelliLab's picture
Mirror small files (code, paper, results)
f86dc09 verified
|
Raw
History Blame Contribute Delete
1.4 kB

TinyStories demo slice (~700 KB train + 70 KB valid)

This is a small slice of the TinyStories dataset by Eldan & Li (2023):

  • Source: https://huggingface.co/datasets/roneneldan/TinyStories
  • Original file: TinyStoriesV2-GPT4-valid.txt (~22 MB)
  • This slice: the first ~700 KB of stories from that file, packed as raw uint8 bytes
  • License: the upstream dataset is CC-BY-4.0; this redistribution preserves that license

Files

File Size Purpose
train.bin ~700 KB training shard (uint8 byte sequence)
valid.bin ~70 KB held-out validation shard

Why this slice and not the full thing

The full TinyStories train file is ~2 GB. We didn't want every kit user to download 2 GB just to do their first smoke-training run. 700 KB is enough to:

  • Run 50–500 training steps in a few minutes on CPU and see loss fall
  • Verify your install end-to-end
  • Get a feel for how the trainer behaves before committing to a real run

For a real ~10M-param training run you want millions of bytes minimum; download the full dataset from the source URL above and point --data-dir at it.

Format

Files are flat sequences of uint8 bytes — no headers, no separators between stories beyond the natural <|endoftext|> strings inside the text. The trainer memmaps these and samples random windows of seq_len bytes. Each byte IS a token (vocabulary size = 256).