Datasets: Switching, Custom Data, Mixing, Streaming
This guide explains how to select built-in datasets, use your own data, mix multiple datasets, and stream large corpora.
Built-in datasets
Use --dataset <name> and (optionally) --dataset_subset.
Supported names (see src/data/datasets.py:DATASET_CONFIGS):
wikitext(default subset:wikitext-2-raw-v1)wikitext-103(streaming)openwebtext(Skylion007 mirror, streaming)slim-pajama(streaming)pile,pile-unc(streaming)c4(subset:en, streaming)bookcorpus(open variant, streaming)oscar(subset:unshuffled_deduplicated_en, streaming)wikipedia(subset:20231101.en, streaming)dummy(small in-memory samples)
Example:
python train_ultrathink.py \
--dataset c4 \
--dataset_subset en \
--streaming \
--tokenizer_name gpt2 \
--max_seq_length 512
Custom dataset from local file(s)
Use --dataset custom --data_path <path>.
Accepted formats via datasets: JSON/JSONL, txt, parquet. For small local files, line-delimited JSON works well:
data.jsonl:
{"text": "First sample"}
{"text": "Second sample"}
Command:
python train_ultrathink.py \
--dataset custom \
--data_path /path/to/data.jsonl \
--text_column text \
--tokenizer_name gpt2 \
--max_seq_length 512
Notes:
- Set
--max_samplesto cap examples when iterating quickly. - For remote URLs or globs,
TextDatasetauto-selects a datasets builder and can stream.
Mixing multiple datasets
Use --mix_datasets to blend datasets with weights. This overrides --dataset.
Example (50/50 wikitext/openwebtext):
python train_ultrathink.py \
--mix_datasets "wikitext:0.5,openwebtext:0.5" \
--tokenizer_name gpt2 \
--max_seq_length 512 \
--streaming
Implementation references:
- Mixing logic:
src/data/datasets.py:MixedDataset - Parsing and creation:
train_ultrathink.py:UltraThinkTrainer.load_datasets
Streaming vs local loading
--streaminguses HF datasets streaming: low memory footprint, good for very large corpora.- Non-streaming loads examples into memory (or iterates but indexes normally). Use with small datasets.
Tokenization parameters
--tokenizer_name(default: gpt2)--max_seq_lengthcontrolspadding='max_length'and truncation.- Labels are masked on padding positions to avoid loss on padded tokens.
Troubleshooting
- If you see non-finite loss at start:
- Lower LR (e.g.,
5e-5), keep--max_seq_length 512initially. - Use
--amp_warmup_steps 200and--dre_warmup_steps 500.
- Lower LR (e.g.,
- C4 deprecation warning: prefer
--dataset c4 --dataset_subset enwhich maps toallenai/c4under the hood in some environments.