auto download tokenizer was blocked

#1
by kgrabko - opened
Center Business Solutions inc org

mkdir -p tokenizer
wget -O tokenizer/tokenizer.json https://huggingface.co/gpt2/resolve/main/tokenizer.json
wget -O tokenizer/vocab.json https://huggingface.co/gpt2/resolve/main/vocab.json
wget -O tokenizer/merges.txt https://huggingface.co/gpt2/resolve/main/merges.txt
wget -O tokenizer/tokenizer_config.json https://huggingface.co/gpt2/resolve/main/tokenizer_config.json

Center Business Solutions inc org

So use download tokenizer before run train script

Center Business Solutions inc org

write here if you have any issues

Center Business Solutions inc org

I will fix GPT-2 issue the fix GPT-3 and publish train script

Center Business Solutions inc org

(rocm_py310) root@jirack1:/home/kgrabko/jirackkit/src/main/python# mkdir -p tokenizer
wget -O tokenizer/tokenizer.json https://huggingface.co/gpt2/resolve/main/tokenizer.json
wget -O tokenizer/vocab.json https://huggingface.co/gpt2/resolve/main/vocab.json
wget -O tokenizer/merges.txt https://huggingface.co/gpt2/resolve/main/merges.txt
wget -O tokenizer/tokenizer_config.json https://huggingface.co/gpt2/resolve/main/tokenizer_config.json
--2025-12-04 12:54:53-- https://huggingface.co/gpt2/resolve/main/tokenizer.json
Resolving huggingface.co (huggingface.co)... 3.168.73.111, 3.168.73.38, 3.168.73.129, ...
Connecting to huggingface.co (huggingface.co)|3.168.73.111|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1355256 (1.3M) [text/plain]
Saving to: ‘tokenizer/tokenizer.json’

tokenizer/tokenizer.json 100%[===========================================================================================>] 1.29M 3.26MB/s in 0.4s

2025-12-04 12:54:54 (3.26 MB/s) - ‘tokenizer/tokenizer.json’ saved [1355256/1355256]

--2025-12-04 12:54:54-- https://huggingface.co/gpt2/resolve/main/vocab.json
Resolving huggingface.co (huggingface.co)... 3.168.73.106, 3.168.73.129, 3.168.73.111, ...
Connecting to huggingface.co (huggingface.co)|3.168.73.106|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1042301 (1018K) [text/plain]
Saving to: ‘tokenizer/vocab.json’

tokenizer/vocab.json 100%[===========================================================================================>] 1018K 2.74MB/s in 0.4s

2025-12-04 12:54:54 (2.74 MB/s) - ‘tokenizer/vocab.json’ saved [1042301/1042301]

--2025-12-04 12:54:54-- https://huggingface.co/gpt2/resolve/main/merges.txt
Resolving huggingface.co (huggingface.co)... 3.168.73.38, 3.168.73.129, 3.168.73.111, ...
Connecting to huggingface.co (huggingface.co)|3.168.73.38|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 456318 (446K) [text/plain]
Saving to: ‘tokenizer/merges.txt’

tokenizer/merges.txt 100%[===========================================================================================>] 445.62K 2.87MB/s in 0.2s

2025-12-04 12:54:55 (2.87 MB/s) - ‘tokenizer/merges.txt’ saved [456318/456318]

--2025-12-04 12:54:55-- https://huggingface.co/gpt2/resolve/main/tokenizer_config.json
Resolving huggingface.co (huggingface.co)... 3.168.73.106, 3.168.73.111, 3.168.73.38, ...
Connecting to huggingface.co (huggingface.co)|3.168.73.106|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 26 [text/plain]
Saving to: ‘tokenizer/tokenizer_config.json’

tokenizer/tokenizer_config.json 100%[===========================================================================================>] 26 --.-KB/s in 0s

2025-12-04 12:54:55 (19.7 MB/s) - ‘tokenizer/tokenizer_config.json’ saved [26/26]

(rocm_py310) root@jirack1:/home/kgrabko/jirackkit/src/main/python# python3 fine_tune_jit_with_validation_1b.py
Using device: cuda
Using existing cleaned dataset → datasets/dialogues_text_clean.txt
Loading model...
Starting from base JIT model: models/gpt_modern_1b_class.script.pt
⚠️ Warning: model.gradient_checkpointing_enable() not found on JIT model. Training will proceed without GC.
Loading and tokenizing text from datasets/dialogues_text_clean.txt
Token indices sequence length is longer than the specified maximum sequence length for this model (359379 > 1024). Running this sequence through the model will result in indexing errors
Lazy dataset: 1,333 sequences for train split (from 1,403 total)
Loading and tokenizing text from datasets/dialogues_text_clean.txt
Token indices sequence length is longer than the specified maximum sequence length for this model (359379 > 1024). Running this sequence through the model will result in indexing errors
Lazy dataset: 70 sequences for val split (from 1,403 total)

=== BEGINNING LONG-TERM TRAINING ===
Epochs: 1 | Steps (Train): 1333 | Examples (Train): 1333
Batch Size (Effective): 1 | Precision: FP32

--- Epoch 1/1 ---
Epoch 1 [TRAIN]: 33%|█████████████████████████████ | 446/1333 [09:45<26:36, 1.80s/it, loss=5.767, ppl=319.5, step=446]

Center Business Solutions inc org

See how run ML script . just adapt ML options to run

Sign up or log in to comment