| [launch] PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True |
| [launch] pip install bitsandbytes==0.43.1 |
| WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable.It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the |
|
|
| [notice] A new release of pip is available: 24.2 -> 26.1.1 |
| [notice] To update, run: python -m pip install |
| [launch] bitsandbytes=0.43.1 |
| [launch] pip install transformers |
| WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable.It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the |
|
|
| [notice] A new release of pip is available: 24.2 -> 26.1.1 |
| [notice] To update, run: python -m pip install |
| The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`. |
|
0it [00:00, ?it/s]
0it [00:00, ?it/s] |
| [launch] transformers=4.44.0 |
| [launch] pip install peft |
| WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable.It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the |
|
|
| [notice] A new release of pip is available: 24.2 -> 26.1.1 |
| [notice] To update, run: python -m pip install |
| [launch] peft=0.12.0 |
| [launch] accelerate=1.13.0 |
| [launch] python3 -u /workspace/p21hr/train_p21h_v3.py |
| [P21H] device=cuda dtype=torch.bfloat16 |
| [mix] multi-wiki=28,986,824 chars, anima=224,198,521 chars |
| [mix] multi-wiki chunks=28,308, anima chunks=218,944 |
| [mix] kept multi-wiki=28,308 (50.2MB), anima=1 (0.0MB) |
| [mix] wrote 28,309 records to /workspace/p21hr/mixed_corpus_v3.jsonl (50.2MB, wiki=28308, anima=1) |
| [mix] sha256=7e62fd32034ced9f5ab5652ad9ed211b513ebc917b230a8fc4466adaf3c32d22 |
| [P21H][init=qwen] V3Ξ² β Qwen warm-start from Qwen/Qwen2.5-1.5B |
| [from_qwen] loading Qwen/Qwen2.5-1.5B |
| [from_qwen] qwen: vocab=151936 d=1536 L=28 n_head=12 n_kv_head=2 -> v3_n_kv_head=4 rope_base=1000000.0 |
| [from_qwen] init OK β total params 2999.7M |
| [P21H] model params total=2,999,735,296 (2999.74M) |
| [P21H] BEFORE-train per-lang OOD eval (greedy) |
| [P21H] BEFORE en greedy: {'GENERALIZE': 10, 'MEMORIZE': 0, 'MEM_PARTIAL': 0, 'EMPTY': 0, 'ERROR': 0} |
| [P21H] BEFORE ko greedy: {'GENERALIZE': 10, 'MEMORIZE': 0, 'MEM_PARTIAL': 0, 'EMPTY': 0, 'ERROR': 0} |
| [P21H] BEFORE zh greedy: {'GENERALIZE': 10, 'MEMORIZE': 0, 'MEM_PARTIAL': 0, 'EMPTY': 0, 'ERROR': 0} |
| [P21H] BEFORE ru greedy: {'GENERALIZE': 10, 'MEMORIZE': 0, 'MEM_PARTIAL': 0, 'EMPTY': 0, 'ERROR': 0} |
| [P21H] BEFORE ja greedy: {'GENERALIZE': 9, 'MEMORIZE': 0, 'MEM_PARTIAL': 0, 'EMPTY': 1, 'ERROR': 0} |
| [P21H] optimizer=PagedAdamW8bit (bnb 0.43.1) |
| Token indices sequence length is longer than the specified maximum sequence length for this model (13836908 > 131072). Running this sequence through the model will result in indexing errors |
| [P21H] tokenized mixed corpus: 6,000,000 tokens (block=512) |
| [P21H] steps=5000 bsz=2 block=512 peak_lr=5e-05 warmup=100 lambda_mitosis=0.0 init=qwen mitosis_max=16 ckpt_every=500 osc_thr=0.5 es_patience=8 |
| [P21H] step= 1 lr=5.00e-07 CE=14.7927 total=14.7927 pool=2 splits=0 phi=0.7120 t=5s |
| [P21H] ckpt save (best) step=1 CE=14.7927 β /workspace/p21hr/out_main/ckpt_best.pt (5735.8 MB) |
| [P21H] step= 125 lr=5.00e-05 CE=8.1634 total=8.1634 pool=16 splits=14 phi=0.6579 t=76s |
| [P21H] ckpt save (best) step=125 CE=8.1634 β /workspace/p21hr/out_main/ckpt_best.pt (5735.8 MB) |
| [P21H] step= 250 lr=4.99e-05 CE=7.3439 total=7.3439 pool=16 splits=14 phi=0.6579 t=147s |
| [P21H] ckpt save (best) step=250 CE=7.3439 β /workspace/p21hr/out_main/ckpt_best.pt (5735.8 MB) |
| [P21H] step= 375 lr=4.97e-05 CE=7.1829 total=7.1829 pool=16 splits=14 phi=0.6579 t=219s |
| [P21H] ckpt save (best) step=375 CE=7.1829 β /workspace/p21hr/out_main/ckpt_best.pt (5735.8 MB) |
| [P21H] step= 500 lr=4.93e-05 CE=6.8499 total=6.8499 pool=16 splits=14 phi=0.6579 t=293s |
| [P21H] ckpt save (best) step=500 CE=6.8499 β /workspace/p21hr/out_main/ckpt_best.pt (5735.8 MB) |
| [P21H] ckpt save (step500) step=500 CE=6.8499 β /workspace/p21hr/out_main/ckpt_step500.pt (5735.8 MB) |
| [P21H] kosmos anchor written: v3_emit_step500_ru_ru_factual_geo.kosmos |
| [P21H] step= 625 lr=4.87e-05 CE=7.4549 total=7.4549 pool=16 splits=14 phi=0.6579 t=367s |
| [P21H] step= 750 lr=4.81e-05 CE=5.6695 total=5.6695 pool=16 splits=14 phi=0.6579 t=432s |
| [P21H] ckpt save (best) step=750 CE=5.6695 β /workspace/p21hr/out_main/ckpt_best.pt (5735.8 MB) |
| [P21H] step= 875 lr=4.73e-05 CE=6.3453 total=6.3453 pool=16 splits=14 phi=0.6579 t=503s |
| [P21H] step= 1000 lr=4.64e-05 CE=6.4940 total=6.4940 pool=16 splits=14 phi=0.6579 t=567s |
| [P21H] ckpt save (step1000) step=1000 CE=6.4940 β /workspace/p21hr/out_main/ckpt_step1000.pt (5735.8 MB) |
| [P21H] kosmos anchor written: v3_emit_step1000_ru_ru_factual_geo.kosmos |
| [P21H] step= 1125 lr=4.53e-05 CE=6.5491 total=6.5491 pool=16 splits=14 phi=0.6579 t=637s |
| [P21H] ckpt save (osc_step1125) step=1125 CE=6.5491 β /workspace/p21hr/out_main/ckpt_osc_step1125.pt (5735.8 MB) |
| [P21H] EARLY STOP: CE re-divergence: recent_mean=6.4628 > best_CE=5.6695+0.5 (mode collapse @ step 1125) |
| [P21H] TRAIN DONE wall=641.2s final_CE=6.5491 actual_steps=1125/5000 early_stopped=True reason=CE re-divergence: recent_mean=6.4628 > best_CE=5.6695+0.5 (mode collapse @ step 1125) best_CE=5.6695@step750 |
| [P21H] ckpt saved β /workspace/p21hr/out_main/ckpt.pt (5735.8 MB) |
| [P21H] AFTER per-lang OOD eval (greedy + sample) |
| [P21H] AFTER en: greedy={'GENERALIZE': 10, 'MEMORIZE': 0, 'MEM_PARTIAL': 0, 'EMPTY': 0, 'ERROR': 0} sample={'GENERALIZE': 10, 'MEMORIZE': 0, 'MEM_PARTIAL': 0, 'EMPTY': 0, 'ERROR': 0} verdict=WEAK score=0/20 gen=20 coh=0 |
| [P21H] AFTER ko: greedy={'GENERALIZE': 10, 'MEMORIZE': 0, 'MEM_PARTIAL': 0, 'EMPTY': 0, 'ERROR': 0} sample={'GENERALIZE': 10, 'MEMORIZE': 0, 'MEM_PARTIAL': 0, 'EMPTY': 0, 'ERROR': 0} verdict=PARTIAL score=15/20 gen=20 coh=15 |
| [P21H] AFTER zh: greedy={'GENERALIZE': 10, 'MEMORIZE': 0, 'MEM_PARTIAL': 0, 'EMPTY': 0, 'ERROR': 0} sample={'GENERALIZE': 10, 'MEMORIZE': 0, 'MEM_PARTIAL': 0, 'EMPTY': 0, 'ERROR': 0} verdict=WEAK score=2/20 gen=20 coh=2 |
| [P21H] AFTER ru: greedy={'GENERALIZE': 10, 'MEMORIZE': 0, 'MEM_PARTIAL': 0, 'EMPTY': 0, 'ERROR': 0} sample={'GENERALIZE': 10, 'MEMORIZE': 0, 'MEM_PARTIAL': 0, 'EMPTY': 0, 'ERROR': 0} verdict=WEAK score=5/20 gen=20 coh=5 |
| [P21H] AFTER ja: greedy={'GENERALIZE': 10, 'MEMORIZE': 0, 'MEM_PARTIAL': 0, 'EMPTY': 0, 'ERROR': 0} sample={'GENERALIZE': 10, 'MEMORIZE': 0, 'MEM_PARTIAL': 0, 'EMPTY': 0, 'ERROR': 0} verdict=WEAK score=6/20 gen=20 coh=6 |
| [P21H] AFTER anima Eval1 |
| [P21H] AGG: STRONG=0 PARTIAL=1 WEAK=4 PURE_MEMORIZE=0 β FAIL |
| [P21H] anima_register_hits=0/20 register_regress=True |
| [P21H] KOSMOS anchors total=7 (during_train=2 final=5) |
| [P21H] DONE β /workspace/p21hr/out_main |
|
|