auto download tokenizer was blocked
mkdir -p tokenizer
wget -O tokenizer/tokenizer.json https://huggingface.co/gpt2/resolve/main/tokenizer.json
wget -O tokenizer/vocab.json https://huggingface.co/gpt2/resolve/main/vocab.json
wget -O tokenizer/merges.txt https://huggingface.co/gpt2/resolve/main/merges.txt
wget -O tokenizer/tokenizer_config.json https://huggingface.co/gpt2/resolve/main/tokenizer_config.json
So use download tokenizer before run train script
write here if you have any issues
I will fix GPT-2 issue the fix GPT-3 and publish train script
(rocm_py310) root@jirack1:/home/kgrabko/jirackkit/src/main/python# mkdir -p tokenizer
wget -O tokenizer/tokenizer.json https://huggingface.co/gpt2/resolve/main/tokenizer.json
wget -O tokenizer/vocab.json https://huggingface.co/gpt2/resolve/main/vocab.json
wget -O tokenizer/merges.txt https://huggingface.co/gpt2/resolve/main/merges.txt
wget -O tokenizer/tokenizer_config.json https://huggingface.co/gpt2/resolve/main/tokenizer_config.json
--2025-12-04 12:54:53-- https://huggingface.co/gpt2/resolve/main/tokenizer.json
Resolving huggingface.co (huggingface.co)... 3.168.73.111, 3.168.73.38, 3.168.73.129, ...
Connecting to huggingface.co (huggingface.co)|3.168.73.111|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1355256 (1.3M) [text/plain]
Saving to: ‘tokenizer/tokenizer.json’
tokenizer/tokenizer.json 100%[===========================================================================================>] 1.29M 3.26MB/s in 0.4s
2025-12-04 12:54:54 (3.26 MB/s) - ‘tokenizer/tokenizer.json’ saved [1355256/1355256]
--2025-12-04 12:54:54-- https://huggingface.co/gpt2/resolve/main/vocab.json
Resolving huggingface.co (huggingface.co)... 3.168.73.106, 3.168.73.129, 3.168.73.111, ...
Connecting to huggingface.co (huggingface.co)|3.168.73.106|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1042301 (1018K) [text/plain]
Saving to: ‘tokenizer/vocab.json’
tokenizer/vocab.json 100%[===========================================================================================>] 1018K 2.74MB/s in 0.4s
2025-12-04 12:54:54 (2.74 MB/s) - ‘tokenizer/vocab.json’ saved [1042301/1042301]
--2025-12-04 12:54:54-- https://huggingface.co/gpt2/resolve/main/merges.txt
Resolving huggingface.co (huggingface.co)... 3.168.73.38, 3.168.73.129, 3.168.73.111, ...
Connecting to huggingface.co (huggingface.co)|3.168.73.38|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 456318 (446K) [text/plain]
Saving to: ‘tokenizer/merges.txt’
tokenizer/merges.txt 100%[===========================================================================================>] 445.62K 2.87MB/s in 0.2s
2025-12-04 12:54:55 (2.87 MB/s) - ‘tokenizer/merges.txt’ saved [456318/456318]
--2025-12-04 12:54:55-- https://huggingface.co/gpt2/resolve/main/tokenizer_config.json
Resolving huggingface.co (huggingface.co)... 3.168.73.106, 3.168.73.111, 3.168.73.38, ...
Connecting to huggingface.co (huggingface.co)|3.168.73.106|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 26 [text/plain]
Saving to: ‘tokenizer/tokenizer_config.json’
tokenizer/tokenizer_config.json 100%[===========================================================================================>] 26 --.-KB/s in 0s
2025-12-04 12:54:55 (19.7 MB/s) - ‘tokenizer/tokenizer_config.json’ saved [26/26]
(rocm_py310) root@jirack1:/home/kgrabko/jirackkit/src/main/python# python3 fine_tune_jit_with_validation_1b.py
Using device: cuda
Using existing cleaned dataset → datasets/dialogues_text_clean.txt
Loading model...
Starting from base JIT model: models/gpt_modern_1b_class.script.pt
⚠️ Warning: model.gradient_checkpointing_enable() not found on JIT model. Training will proceed without GC.
Loading and tokenizing text from datasets/dialogues_text_clean.txt
Token indices sequence length is longer than the specified maximum sequence length for this model (359379 > 1024). Running this sequence through the model will result in indexing errors
Lazy dataset: 1,333 sequences for train split (from 1,403 total)
Loading and tokenizing text from datasets/dialogues_text_clean.txt
Token indices sequence length is longer than the specified maximum sequence length for this model (359379 > 1024). Running this sequence through the model will result in indexing errors
Lazy dataset: 70 sequences for val split (from 1,403 total)
=== BEGINNING LONG-TERM TRAINING ===
Epochs: 1 | Steps (Train): 1333 | Examples (Train): 1333
Batch Size (Effective): 1 | Precision: FP32
--- Epoch 1/1 ---
Epoch 1 [TRAIN]: 33%|█████████████████████████████ | 446/1333 [09:45<26:36, 1.80s/it, loss=5.767, ppl=319.5, step=446]
See how run ML script . just adapt ML options to run