CMSManhattan/JiRack_GPT3_empty · auto download tokenizer was blocked

Center Business Solutions inc org Dec 3, 2025

mkdir -p tokenizer
wget -O tokenizer/tokenizer.json https://huggingface.co/gpt2/resolve/main/tokenizer.json
wget -O tokenizer/vocab.json https://huggingface.co/gpt2/resolve/main/vocab.json
wget -O tokenizer/merges.txt https://huggingface.co/gpt2/resolve/main/merges.txt
wget -O tokenizer/tokenizer_config.json https://huggingface.co/gpt2/resolve/main/tokenizer_config.json

kgrabko

Center Business Solutions inc org Dec 3, 2025

So use download tokenizer before run train script

kgrabko

Center Business Solutions inc org Dec 3, 2025

write here if you have any issues

kgrabko

Center Business Solutions inc org Dec 3, 2025

I will fix GPT-2 issue the fix GPT-3 and publish train script

kgrabko

Center Business Solutions inc org Dec 4, 2025

(rocm_py310) root@jirack1:/home/kgrabko/jirackkit/src/main/python# mkdir -p tokenizer
wget -O tokenizer/tokenizer.json https://huggingface.co/gpt2/resolve/main/tokenizer.json
wget -O tokenizer/vocab.json https://huggingface.co/gpt2/resolve/main/vocab.json
wget -O tokenizer/merges.txt https://huggingface.co/gpt2/resolve/main/merges.txt
wget -O tokenizer/tokenizer_config.json https://huggingface.co/gpt2/resolve/main/tokenizer_config.json
--2025-12-04 12:54:53-- https://huggingface.co/gpt2/resolve/main/tokenizer.json
Resolving huggingface.co (huggingface.co)... 3.168.73.111, 3.168.73.38, 3.168.73.129, ...
Connecting to huggingface.co (huggingface.co)|3.168.73.111|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1355256 (1.3M) [text/plain]
Saving to: ‘tokenizer/tokenizer.json’

tokenizer/tokenizer.json 100%[===========================================================================================>] 1.29M 3.26MB/s in 0.4s

2025-12-04 12:54:54 (3.26 MB/s) - ‘tokenizer/tokenizer.json’ saved [1355256/1355256]

--2025-12-04 12:54:54-- https://huggingface.co/gpt2/resolve/main/vocab.json
Resolving huggingface.co (huggingface.co)... 3.168.73.106, 3.168.73.129, 3.168.73.111, ...
Connecting to huggingface.co (huggingface.co)|3.168.73.106|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1042301 (1018K) [text/plain]
Saving to: ‘tokenizer/vocab.json’

tokenizer/vocab.json 100%[===========================================================================================>] 1018K 2.74MB/s in 0.4s

2025-12-04 12:54:54 (2.74 MB/s) - ‘tokenizer/vocab.json’ saved [1042301/1042301]

--2025-12-04 12:54:54-- https://huggingface.co/gpt2/resolve/main/merges.txt
Resolving huggingface.co (huggingface.co)... 3.168.73.38, 3.168.73.129, 3.168.73.111, ...
Connecting to huggingface.co (huggingface.co)|3.168.73.38|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 456318 (446K) [text/plain]
Saving to: ‘tokenizer/merges.txt’

tokenizer/merges.txt 100%[===========================================================================================>] 445.62K 2.87MB/s in 0.2s

2025-12-04 12:54:55 (2.87 MB/s) - ‘tokenizer/merges.txt’ saved [456318/456318]

--2025-12-04 12:54:55-- https://huggingface.co/gpt2/resolve/main/tokenizer_config.json
Resolving huggingface.co (huggingface.co)... 3.168.73.106, 3.168.73.111, 3.168.73.38, ...
Connecting to huggingface.co (huggingface.co)|3.168.73.106|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 26 [text/plain]
Saving to: ‘tokenizer/tokenizer_config.json’

tokenizer/tokenizer_config.json 100%[===========================================================================================>] 26 --.-KB/s in 0s

2025-12-04 12:54:55 (19.7 MB/s) - ‘tokenizer/tokenizer_config.json’ saved [26/26]

(rocm_py310) root@jirack1:/home/kgrabko/jirackkit/src/main/python# python3 fine_tune_jit_with_validation_1b.py
Using device: cuda
Using existing cleaned dataset → datasets/dialogues_text_clean.txt
Loading model...
Starting from base JIT model: models/gpt_modern_1b_class.script.pt
⚠️ Warning: model.gradient_checkpointing_enable() not found on JIT model. Training will proceed without GC.
Loading and tokenizing text from datasets/dialogues_text_clean.txt
Token indices sequence length is longer than the specified maximum sequence length for this model (359379 > 1024). Running this sequence through the model will result in indexing errors
Lazy dataset: 1,333 sequences for train split (from 1,403 total)
Loading and tokenizing text from datasets/dialogues_text_clean.txt
Token indices sequence length is longer than the specified maximum sequence length for this model (359379 > 1024). Running this sequence through the model will result in indexing errors
Lazy dataset: 70 sequences for val split (from 1,403 total)

=== BEGINNING LONG-TERM TRAINING ===
Epochs: 1 | Steps (Train): 1333 | Examples (Train): 1333
Batch Size (Effective): 1 | Precision: FP32

--- Epoch 1/1 ---
Epoch 1 [TRAIN]: 33%|█████████████████████████████ | 446/1333 [09:45<26:36, 1.80s/it, loss=5.767, ppl=319.5, step=446]

kgrabko

Center Business Solutions inc org Dec 4, 2025

See how run ML script . just adapt ML options to run