sproos
/

parameter-golf-tokenizers

Model card Files Files and versions

parameter-golf-tokenizers / README.md

sproos's picture

Upload folder using huggingface_hub

c5f9e16 verified 13 days ago

|

history blame contribute delete

2.86 kB

	# Data Workflows

	This directory contains the dataset download helpers and export scripts used for the challenge.

	Canonical local layout:
	- `data/datasets/<dataset_name>/`
	- `data/tokenizers/`
	- `data/manifest.json`
	- `data/docs_selected.jsonl`
	- `data/docs_selected.source_manifest.json`

	## Downloading Published Data

	Download the cached FineWeb export for a tokenizer variant with:

	```bash
	python3 data/cached_challenge_fineweb.py --variant sp1024
	```

	This populates `./data/datasets/fineweb10B_sp1024/` and `./data/tokenizers/`.
	By default it downloads the full validation split and 8B training tokens (80 train shards).

	To fetch more training shards, pass `--train-shards`:

	```bash
	python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 180
	```

	The downloader is manifest-driven and can fetch only a prefix of train shards from a larger published export. With the current shard size of `100_000_000` tokens, `10B` retokenized training tokens is `100` train shards:

	```bash
	MATCHED_FINEWEB_REPO_ID=your-hf-username/your-dataset-repo \
	MATCHED_FINEWEB_REMOTE_ROOT_PREFIX=your_50B_export_root \
	python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 100
	```

	Validation is always downloaded in full from the fixed `fineweb_val_*` split. Training on the first `N` train shards means training on the prefix of the same frozen shuffled export, so the data order stays aligned with the baseline for that tokenizer family.

	The default published repo is `willdepueoai/parameter-golf`, with the export rooted under the repo subdirectory `datasets/`.

	## Rebuilding Tokenizers From Published Docs

	To retrain a tokenizer or re-export shards from exactly the same selected documents, run the standalone retokenizer against the published docs cache:

	```bash
	python3 data/download_hf_docs_and_tokenize.py \
	--repo-id your-hf-username/your-dataset-repo \
	--remote-root your_50B_export_root \
	--output-root /tmp/my_custom_tokenizer_export \
	--tokenizer-config ./data/tokenizer_specs.json \
	--max-train-tokens 8000000000
	```

	The sidecar `docs_selected.source_manifest.json` includes `docs_sha256`, so users can verify they are rebuilding from the exact same document list and order as the baseline export.

	## Useful Knobs

	For CPU-heavy exports, useful knobs are:

	```bash
	MATCHED_FINEWEB_SP_BATCH_SIZE=2048
	MATCHED_FINEWEB_TOKENIZER_THREADS=16
	MATCHED_FINEWEB_TIKTOKEN_THREADS=16
	MATCHED_FINEWEB_GPT2_DECODE_BATCH_SIZE=512
	```

	These control batched tokenizer encoding during shard export, tokenizer thread count, tiktoken thread count, and batched GPT-2 decode for the blobstore docs-cache path.

	When rebuilding locally, `--max-train-tokens 8000000000` matches the published 8B-train-token export. With the default shard size of `100_000_000`, that produces 80 train shards plus the full validation split.