Scott/Codex commited on
Commit
14e0da5
·
1 Parent(s): 690cf55

Set official DBlock run to 90-day token target

Browse files
Files changed (2) hide show
  1. README.md +4 -0
  2. relaunch_agillm4_dblock_sg2.sh +3 -3
README.md CHANGED
@@ -89,4 +89,8 @@ Sublinear coverage update 2026-05-29: the saved AGILLM-4 trainer snapshot now ma
89
  Profiling/speed update 2026-05-29: added in-process DBlock profiling (`--profile_steps`, `--profile_log_every`) after external ptrace profiling was blocked on Vast. The profile showed the bottleneck is transformer recompute/backward, not fused CE or the optimizer: at B=2 full checkpointing, AR backward averaged ~605 ms/step, AR forward ~184 ms, CE ~4.5 ms, optimizer ~17 ms. Tested speed levers live: no checkpointing OOMed at B=2 and fell to B=1, selective checkpoint stride=2 fit but hugged VRAM and reached ~2.94k tok/s, B=5/6 hit a memory-pressure cliff, while B=4 with full DBlock checkpointing was the best stable official setting (~3.0k tok/s warm window, ~13.2 GB tensor peak / ~17.6 GB reserved, ETA ~269-275 days). The live relaunch now uses `--batch_size 4 --grad_checkpoint --dblock_checkpoint_stride 1` and leaves selective checkpointing available for future context/batch tradeoffs.
90
 
91
 
 
 
 
 
92
  License: Apache-2.0 (matching the upstream method).
 
89
  Profiling/speed update 2026-05-29: added in-process DBlock profiling (`--profile_steps`, `--profile_log_every`) after external ptrace profiling was blocked on Vast. The profile showed the bottleneck is transformer recompute/backward, not fused CE or the optimizer: at B=2 full checkpointing, AR backward averaged ~605 ms/step, AR forward ~184 ms, CE ~4.5 ms, optimizer ~17 ms. Tested speed levers live: no checkpointing OOMed at B=2 and fell to B=1, selective checkpoint stride=2 fit but hugged VRAM and reached ~2.94k tok/s, B=5/6 hit a memory-pressure cliff, while B=4 with full DBlock checkpointing was the best stable official setting (~3.0k tok/s warm window, ~13.2 GB tensor peak / ~17.6 GB reserved, ETA ~269-275 days). The live relaunch now uses `--batch_size 4 --grad_checkpoint --dblock_checkpoint_stride 1` and leaves selective checkpointing available for future context/batch tradeoffs.
90
 
91
 
92
+
93
+ 90-day target update 2026-05-29: the live Vast line now uses a compute-bounded 35 tokens/parameter target (`TOKEN_PARAM_RATIO=${TOKEN_PARAM_RATIO:-35}` in `relaunch_agillm4_dblock_sg2.sh`) instead of the earlier 100 tokens/parameter target. With 716,595,202 trainable parameters this sets the finish line to 25,080,832,070 tokens. At the observed B=4 DBlock throughput (~3.04k tok/s shortly after restart, improving toward ~3.08k tok/s), the remaining ETA is under 90 days while preserving the same low-VRAM DBlock/sublinear/tied-head training line. This is a deliberately compute-bounded official run; the ratio can be raised later if evaluations show continued strong returns.
94
+
95
+
96
  License: Apache-2.0 (matching the upstream method).
relaunch_agillm4_dblock_sg2.sh CHANGED
@@ -8,9 +8,10 @@ export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512,expandable_segments:True
8
  export AGILLM_ATTN_BACKEND=sublinear
9
  [ -f /root/.cache/huggingface/token ] && { export HF_TOKEN="$(tr -d '\r\n' </root/.cache/huggingface/token)"; export HUGGING_FACE_HUB_TOKEN="$HF_TOKEN"; }
10
  SAVE_DIR=/workspace/agillm4_4090_ckpts
 
11
  CKPT="$(ls -1t "$SAVE_DIR"/pretrain_step*.pt 2>/dev/null | head -1)"
12
  exec >> /workspace/agillm4_floor_train.log 2>&1
13
- echo "RELAUNCH_AGILLM4_DBLOCK_SG2 $(date -u +%Y-%m-%dT%H:%M:%SZ) resume=$CKPT (batch4 official speed-optimized + sublinear v2)"
14
  exec python -u nB300_agillm4.py train --preset agillm4_floor --resume "$CKPT" \
15
  --dblock --dblock_blocks 4 --dblock_schedule loss_balanced --dblock_warmup_steps 16 \
16
  --dblock_sigma_curriculum_steps 2000 --dblock_log_every 25 --dblock_objective_mode stochastic \
@@ -20,6 +21,5 @@ exec python -u nB300_agillm4.py train --preset agillm4_floor --resume "$CKPT" \
20
  --sublinear_window 128 --sublinear_stride 128 --sublinear_max_anchors 128 --sublinear_chunk 128 \
21
  --sublinear_sinks 4 --sublinear_recent_anchors 64 --no-sublinear_pooled_landmarks \
22
  --grad_checkpoint --dblock_checkpoint_stride 1 --optimizer paged_adamw8bit --sat_every 4 --nat_every 4 --nat_max_tokens 768 --nat_mask_ratio 0.5 \
23
- --token_param_ratio 100 --save_dir "$SAVE_DIR" --save_every_sec 86400 --heartbeat_every_sec 300 \
24
- \
25
  --empty_cache_every_steps 0 --delta_every_steps 25000 --delta_max_keep 1 --max_ckpts 1
 
8
  export AGILLM_ATTN_BACKEND=sublinear
9
  [ -f /root/.cache/huggingface/token ] && { export HF_TOKEN="$(tr -d '\r\n' </root/.cache/huggingface/token)"; export HUGGING_FACE_HUB_TOKEN="$HF_TOKEN"; }
10
  SAVE_DIR=/workspace/agillm4_4090_ckpts
11
+ TOKEN_PARAM_RATIO="${TOKEN_PARAM_RATIO:-35}"
12
  CKPT="$(ls -1t "$SAVE_DIR"/pretrain_step*.pt 2>/dev/null | head -1)"
13
  exec >> /workspace/agillm4_floor_train.log 2>&1
14
+ echo "RELAUNCH_AGILLM4_DBLOCK_SG2 $(date -u +%Y-%m-%dT%H:%M:%SZ) resume=$CKPT (90day ratio35 + batch4 + sublinear v2)"
15
  exec python -u nB300_agillm4.py train --preset agillm4_floor --resume "$CKPT" \
16
  --dblock --dblock_blocks 4 --dblock_schedule loss_balanced --dblock_warmup_steps 16 \
17
  --dblock_sigma_curriculum_steps 2000 --dblock_log_every 25 --dblock_objective_mode stochastic \
 
21
  --sublinear_window 128 --sublinear_stride 128 --sublinear_max_anchors 128 --sublinear_chunk 128 \
22
  --sublinear_sinks 4 --sublinear_recent_anchors 64 --no-sublinear_pooled_landmarks \
23
  --grad_checkpoint --dblock_checkpoint_stride 1 --optimizer paged_adamw8bit --sat_every 4 --nat_every 4 --nat_max_tokens 768 --nat_mask_ratio 0.5 \
24
+ --token_param_ratio "$TOKEN_PARAM_RATIO" --save_dir "$SAVE_DIR" --save_every_sec 86400 --heartbeat_every_sec 300 \
 
25
  --empty_cache_every_steps 0 --delta_every_steps 25000 --delta_max_keep 1 --max_ckpts 1