OpenTransformer
/

AGILLM4-diffusionblocks

agillm

diffusionblocks

memory-efficient-training

block-wise-training

Model card Files Files and versions

xet

Community

Scott/Codex commited on 30 days ago

Commit

14e0da5

1 Parent(s): 690cf55

Set official DBlock run to 90-day token target

Browse files

Files changed (2) hide show

README.md +4 -0
relaunch_agillm4_dblock_sg2.sh +3 -3

README.md CHANGED Viewed

@@ -89,4 +89,8 @@ Sublinear coverage update 2026-05-29: the saved AGILLM-4 trainer snapshot now ma
 Profiling/speed update 2026-05-29: added in-process DBlock profiling (`--profile_steps`, `--profile_log_every`) after external ptrace profiling was blocked on Vast. The profile showed the bottleneck is transformer recompute/backward, not fused CE or the optimizer: at B=2 full checkpointing, AR backward averaged ~605 ms/step, AR forward ~184 ms, CE ~4.5 ms, optimizer ~17 ms. Tested speed levers live: no checkpointing OOMed at B=2 and fell to B=1, selective checkpoint stride=2 fit but hugged VRAM and reached ~2.94k tok/s, B=5/6 hit a memory-pressure cliff, while B=4 with full DBlock checkpointing was the best stable official setting (~3.0k tok/s warm window, ~13.2 GB tensor peak / ~17.6 GB reserved, ETA ~269-275 days). The live relaunch now uses `--batch_size 4 --grad_checkpoint --dblock_checkpoint_stride 1` and leaves selective checkpointing available for future context/batch tradeoffs.
 License: Apache-2.0 (matching the upstream method).

 Profiling/speed update 2026-05-29: added in-process DBlock profiling (`--profile_steps`, `--profile_log_every`) after external ptrace profiling was blocked on Vast. The profile showed the bottleneck is transformer recompute/backward, not fused CE or the optimizer: at B=2 full checkpointing, AR backward averaged ~605 ms/step, AR forward ~184 ms, CE ~4.5 ms, optimizer ~17 ms. Tested speed levers live: no checkpointing OOMed at B=2 and fell to B=1, selective checkpoint stride=2 fit but hugged VRAM and reached ~2.94k tok/s, B=5/6 hit a memory-pressure cliff, while B=4 with full DBlock checkpointing was the best stable official setting (~3.0k tok/s warm window, ~13.2 GB tensor peak / ~17.6 GB reserved, ETA ~269-275 days). The live relaunch now uses `--batch_size 4 --grad_checkpoint --dblock_checkpoint_stride 1` and leaves selective checkpointing available for future context/batch tradeoffs.
+90-day target update 2026-05-29: the live Vast line now uses a compute-bounded 35 tokens/parameter target (`TOKEN_PARAM_RATIO=${TOKEN_PARAM_RATIO:-35}` in `relaunch_agillm4_dblock_sg2.sh`) instead of the earlier 100 tokens/parameter target. With 716,595,202 trainable parameters this sets the finish line to 25,080,832,070 tokens. At the observed B=4 DBlock throughput (~3.04k tok/s shortly after restart, improving toward ~3.08k tok/s), the remaining ETA is under 90 days while preserving the same low-VRAM DBlock/sublinear/tied-head training line. This is a deliberately compute-bounded official run; the ratio can be raised later if evaluations show continued strong returns.
 License: Apache-2.0 (matching the upstream method).

relaunch_agillm4_dblock_sg2.sh CHANGED Viewed

@@ -8,9 +8,10 @@ export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512,expandable_segments:True
 export AGILLM_ATTN_BACKEND=sublinear
 [ -f /root/.cache/huggingface/token ] && { export HF_TOKEN="$(tr -d '\r\n' </root/.cache/huggingface/token)"; export HUGGING_FACE_HUB_TOKEN="$HF_TOKEN"; }
 SAVE_DIR=/workspace/agillm4_4090_ckpts
 CKPT="$(ls -1t "$SAVE_DIR"/pretrain_step*.pt 2>/dev/null | head -1)"
 exec >> /workspace/agillm4_floor_train.log 2>&1
-echo "RELAUNCH_AGILLM4_DBLOCK_SG2 $(date -u +%Y-%m-%dT%H:%M:%SZ) resume=$CKPT (batch4 official speed-optimized + sublinear v2)"
 exec python -u nB300_agillm4.py train --preset agillm4_floor --resume "$CKPT" \
   --dblock --dblock_blocks 4 --dblock_schedule loss_balanced --dblock_warmup_steps 16 \
   --dblock_sigma_curriculum_steps 2000 --dblock_log_every 25 --dblock_objective_mode stochastic \
@@ -20,6 +21,5 @@ exec python -u nB300_agillm4.py train --preset agillm4_floor --resume "$CKPT" \
   --sublinear_window 128 --sublinear_stride 128 --sublinear_max_anchors 128 --sublinear_chunk 128 \
   --sublinear_sinks 4 --sublinear_recent_anchors 64 --no-sublinear_pooled_landmarks \
   --grad_checkpoint --dblock_checkpoint_stride 1 --optimizer paged_adamw8bit --sat_every 4 --nat_every 4 --nat_max_tokens 768 --nat_mask_ratio 0.5 \
-  --token_param_ratio 100 --save_dir "$SAVE_DIR" --save_every_sec 86400 --heartbeat_every_sec 300 \
-  \
   --empty_cache_every_steps 0 --delta_every_steps 25000 --delta_max_keep 1 --max_ckpts 1

 export AGILLM_ATTN_BACKEND=sublinear
 [ -f /root/.cache/huggingface/token ] && { export HF_TOKEN="$(tr -d '\r\n' </root/.cache/huggingface/token)"; export HUGGING_FACE_HUB_TOKEN="$HF_TOKEN"; }
 SAVE_DIR=/workspace/agillm4_4090_ckpts
+TOKEN_PARAM_RATIO="${TOKEN_PARAM_RATIO:-35}"
 CKPT="$(ls -1t "$SAVE_DIR"/pretrain_step*.pt 2>/dev/null | head -1)"
 exec >> /workspace/agillm4_floor_train.log 2>&1
+echo "RELAUNCH_AGILLM4_DBLOCK_SG2 $(date -u +%Y-%m-%dT%H:%M:%SZ) resume=$CKPT (90day ratio35 + batch4 + sublinear v2)"
 exec python -u nB300_agillm4.py train --preset agillm4_floor --resume "$CKPT" \
   --dblock --dblock_blocks 4 --dblock_schedule loss_balanced --dblock_warmup_steps 16 \
   --dblock_sigma_curriculum_steps 2000 --dblock_log_every 25 --dblock_objective_mode stochastic \
   --sublinear_window 128 --sublinear_stride 128 --sublinear_max_anchors 128 --sublinear_chunk 128 \
   --sublinear_sinks 4 --sublinear_recent_anchors 64 --no-sublinear_pooled_landmarks \
   --grad_checkpoint --dblock_checkpoint_stride 1 --optimizer paged_adamw8bit --sat_every 4 --nat_every 4 --nat_max_tokens 768 --nat_mask_ratio 0.5 \
+  --token_param_ratio "$TOKEN_PARAM_RATIO" --save_dir "$SAVE_DIR" --save_every_sec 86400 --heartbeat_every_sec 300 \
   --empty_cache_every_steps 0 --delta_every_steps 25000 --delta_max_keep 1 --max_ckpts 1