File size: 2,803 Bytes
54c5666
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
# Datasets: Switching, Custom Data, Mixing, Streaming

This guide explains how to select built-in datasets, use your own data, mix multiple datasets, and stream large corpora.

## Built-in datasets
Use `--dataset <name>` and (optionally) `--dataset_subset`.

Supported names (see `src/data/datasets.py:DATASET_CONFIGS`):
- `wikitext` (default subset: `wikitext-2-raw-v1`)
- `wikitext-103` (streaming)
- `openwebtext` (Skylion007 mirror, streaming)
- `slim-pajama` (streaming)
- `pile`, `pile-unc` (streaming)
- `c4` (subset: `en`, streaming)
- `bookcorpus` (open variant, streaming)
- `oscar` (subset: `unshuffled_deduplicated_en`, streaming)
- `wikipedia` (subset: `20231101.en`, streaming)
- `dummy` (small in-memory samples)

Example:
```bash

python train_ultrathink.py \

  --dataset c4 \

  --dataset_subset en \

  --streaming \

  --tokenizer_name gpt2 \

  --max_seq_length 512

```

## Custom dataset from local file(s)
Use `--dataset custom --data_path <path>`.

Accepted formats via datasets: JSON/JSONL, txt, parquet. For small local files, line-delimited JSON works well:

`data.jsonl`:
```json

{"text": "First sample"}

{"text": "Second sample"}

```

Command:
```bash

python train_ultrathink.py \

  --dataset custom \

  --data_path /path/to/data.jsonl \

  --text_column text \

  --tokenizer_name gpt2 \

  --max_seq_length 512

```

Notes:
- Set `--max_samples` to cap examples when iterating quickly.
- For remote URLs or globs, `TextDataset` auto-selects a datasets builder and can stream.

## Mixing multiple datasets
Use `--mix_datasets` to blend datasets with weights. This overrides `--dataset`.

Example (50/50 wikitext/openwebtext):
```bash

python train_ultrathink.py \

  --mix_datasets "wikitext:0.5,openwebtext:0.5" \

  --tokenizer_name gpt2 \

  --max_seq_length 512 \

  --streaming

```

Implementation references:
- Mixing logic: `src/data/datasets.py:MixedDataset`
- Parsing and creation: `train_ultrathink.py:UltraThinkTrainer.load_datasets`

## Streaming vs local loading
- `--streaming` uses HF datasets streaming: low memory footprint, good for very large corpora.
- Non-streaming loads examples into memory (or iterates but indexes normally). Use with small datasets.

## Tokenization parameters
- `--tokenizer_name` (default: gpt2)
- `--max_seq_length` controls `padding='max_length'` and truncation.
- Labels are masked on padding positions to avoid loss on padded tokens.

## Troubleshooting
- If you see non-finite loss at start:
  - Lower LR (e.g., `5e-5`), keep `--max_seq_length 512` initially.
  - Use `--amp_warmup_steps 200` and `--dre_warmup_steps 500`.
- C4 deprecation warning: prefer `--dataset c4 --dataset_subset en` which maps to `allenai/c4` under the hood in some environments.