File size: 2,803 Bytes

54c5666

# Datasets: Switching, Custom Data, Mixing, Streaming

This guide explains how to select built-in datasets, use your own data, mix multiple datasets, and stream large corpora.

## Built-in datasets
Use `--dataset <name>` and (optionally) `--dataset_subset`.

Supported names (see `src/data/datasets.py:DATASET_CONFIGS`):
- `wikitext` (default subset: `wikitext-2-raw-v1`)
- `wikitext-103` (streaming)
- `openwebtext` (Skylion007 mirror, streaming)
- `slim-pajama` (streaming)
- `pile`, `pile-unc` (streaming)
- `c4` (subset: `en`, streaming)
- `bookcorpus` (open variant, streaming)
- `oscar` (subset: `unshuffled_deduplicated_en`, streaming)
- `wikipedia` (subset: `20231101.en`, streaming)
- `dummy` (small in-memory samples)

Example:
```bash

python train_ultrathink.py \

  --dataset c4 \

  --dataset_subset en \

  --streaming \

  --tokenizer_name gpt2 \

  --max_seq_length 512

```

## Custom dataset from local file(s)
Use `--dataset custom --data_path <path>`.

Accepted formats via datasets: JSON/JSONL, txt, parquet. For small local files, line-delimited JSON works well:

`data.jsonl`:
```json

{"text": "First sample"}

{"text": "Second sample"}

```

Command:
```bash

python train_ultrathink.py \

  --dataset custom \

  --data_path /path/to/data.jsonl \

  --text_column text \

  --tokenizer_name gpt2 \

  --max_seq_length 512

```

Notes:
- Set `--max_samples` to cap examples when iterating quickly.
- For remote URLs or globs, `TextDataset` auto-selects a datasets builder and can stream.

## Mixing multiple datasets
Use `--mix_datasets` to blend datasets with weights. This overrides `--dataset`.

Example (50/50 wikitext/openwebtext):
```bash

python train_ultrathink.py \

  --mix_datasets "wikitext:0.5,openwebtext:0.5" \

  --tokenizer_name gpt2 \

  --max_seq_length 512 \

  --streaming

```

Implementation references:
- Mixing logic: `src/data/datasets.py:MixedDataset`
- Parsing and creation: `train_ultrathink.py:UltraThinkTrainer.load_datasets`

## Streaming vs local loading
- `--streaming` uses HF datasets streaming: low memory footprint, good for very large corpora.
- Non-streaming loads examples into memory (or iterates but indexes normally). Use with small datasets.

## Tokenization parameters
- `--tokenizer_name` (default: gpt2)
- `--max_seq_length` controls `padding='max_length'` and truncation.
- Labels are masked on padding positions to avoid loss on padded tokens.

## Troubleshooting
- If you see non-finite loss at start:
  - Lower LR (e.g., `5e-5`), keep `--max_seq_length 512` initially.
  - Use `--amp_warmup_steps 200` and `--dre_warmup_steps 500`.
- C4 deprecation warning: prefer `--dataset c4 --dataset_subset en` which maps to `allenai/c4` under the hood in some environments.