TilelliLab
/

Tilelli-llm

Text Generation

small-language-model

mixture-of-experts

negative-results

reproducibility

Model card Files Files and versions

Tilelli-llm / data /tinystories_demo /README.md

TilelliLab's picture

Mirror small files (code, paper, results)

f86dc09 verified 16 days ago

|

History Blame Contribute Delete

1.4 kB

	# TinyStories demo slice (~700 KB train + 70 KB valid)

	This is a small slice of the TinyStories dataset by Eldan & Li (2023):
	- Source: https://huggingface.co/datasets/roneneldan/TinyStories
	- Original file: `TinyStoriesV2-GPT4-valid.txt` (~22 MB)
	- This slice: the first ~700 KB of stories from that file, packed as raw uint8 bytes
	- License: the upstream dataset is CC-BY-4.0; this redistribution preserves that license

	## Files

	\| File \| Size \| Purpose \|
	\|---\|---\|---\|
	\| `train.bin` \| ~700 KB \| training shard (uint8 byte sequence) \|
	\| `valid.bin` \| ~70 KB \| held-out validation shard \|

	## Why this slice and not the full thing

	The full TinyStories train file is ~2 GB. We didn't want every kit user to download
	2 GB just to do their first smoke-training run. 700 KB is enough to:

	- Run 50–500 training steps in a few minutes on CPU and see loss fall
	- Verify your install end-to-end
	- Get a feel for how the trainer behaves before committing to a real run

	For a real ~10M-param training run you want millions of bytes minimum; download
	the full dataset from the source URL above and point `--data-dir` at it.

	## Format

	Files are flat sequences of `uint8` bytes — no headers, no separators between
	stories beyond the natural `<\|endoftext\|>` strings inside the text. The trainer
	memmaps these and samples random windows of `seq_len` bytes. Each byte IS a
	token (vocabulary size = 256).