Vedisasi
/

UltraThinking-LLM-Training

Model card Files Files and versions

UltraThinking-LLM-Training / docs /datasets.md

Vedisasi's picture

Upload folder using huggingface_hub

54c5666 verified 4 months ago

|

history blame contribute delete

2.8 kB

	# Datasets: Switching, Custom Data, Mixing, Streaming

	This guide explains how to select built-in datasets, use your own data, mix multiple datasets, and stream large corpora.

	## Built-in datasets
	Use `--dataset <name>` and (optionally) `--dataset_subset`.

	Supported names (see `src/data/datasets.py:DATASET_CONFIGS`):
	- `wikitext` (default subset: `wikitext-2-raw-v1`)
	- `wikitext-103` (streaming)
	- `openwebtext` (Skylion007 mirror, streaming)
	- `slim-pajama` (streaming)
	- `pile`, `pile-unc` (streaming)
	- `c4` (subset: `en`, streaming)
	- `bookcorpus` (open variant, streaming)
	- `oscar` (subset: `unshuffled_deduplicated_en`, streaming)
	- `wikipedia` (subset: `20231101.en`, streaming)
	- `dummy` (small in-memory samples)

	Example:
	```bash
	python train_ultrathink.py \
	--dataset c4 \
	--dataset_subset en \
	--streaming \
	--tokenizer_name gpt2 \
	--max_seq_length 512
	```

	## Custom dataset from local file(s)
	Use `--dataset custom --data_path <path>`.

	Accepted formats via datasets: JSON/JSONL, txt, parquet. For small local files, line-delimited JSON works well:

	`data.jsonl`:
	```json
	{"text": "First sample"}
	{"text": "Second sample"}
	```

	Command:
	```bash
	python train_ultrathink.py \
	--dataset custom \
	--data_path /path/to/data.jsonl \
	--text_column text \
	--tokenizer_name gpt2 \
	--max_seq_length 512
	```

	Notes:
	- Set `--max_samples` to cap examples when iterating quickly.
	- For remote URLs or globs, `TextDataset` auto-selects a datasets builder and can stream.

	## Mixing multiple datasets
	Use `--mix_datasets` to blend datasets with weights. This overrides `--dataset`.

	Example (50/50 wikitext/openwebtext):
	```bash
	python train_ultrathink.py \
	--mix_datasets "wikitext:0.5,openwebtext:0.5" \
	--tokenizer_name gpt2 \
	--max_seq_length 512 \
	--streaming
	```

	Implementation references:
	- Mixing logic: `src/data/datasets.py:MixedDataset`
	- Parsing and creation: `train_ultrathink.py:UltraThinkTrainer.load_datasets`

	## Streaming vs local loading
	- `--streaming` uses HF datasets streaming: low memory footprint, good for very large corpora.
	- Non-streaming loads examples into memory (or iterates but indexes normally). Use with small datasets.

	## Tokenization parameters
	- `--tokenizer_name` (default: gpt2)
	- `--max_seq_length` controls `padding='max_length'` and truncation.
	- Labels are masked on padding positions to avoid loss on padded tokens.

	## Troubleshooting
	- If you see non-finite loss at start:
	- Lower LR (e.g., `5e-5`), keep `--max_seq_length 512` initially.
	- Use `--amp_warmup_steps 200` and `--dre_warmup_steps 500`.
	- C4 deprecation warning: prefer `--dataset c4 --dataset_subset en` which maps to `allenai/c4` under the hood in some environments.