Scott/Codex commited on
Commit ·
14e0da5
1
Parent(s): 690cf55
Set official DBlock run to 90-day token target
Browse files- README.md +4 -0
- relaunch_agillm4_dblock_sg2.sh +3 -3
README.md
CHANGED
|
@@ -89,4 +89,8 @@ Sublinear coverage update 2026-05-29: the saved AGILLM-4 trainer snapshot now ma
|
|
| 89 |
Profiling/speed update 2026-05-29: added in-process DBlock profiling (`--profile_steps`, `--profile_log_every`) after external ptrace profiling was blocked on Vast. The profile showed the bottleneck is transformer recompute/backward, not fused CE or the optimizer: at B=2 full checkpointing, AR backward averaged ~605 ms/step, AR forward ~184 ms, CE ~4.5 ms, optimizer ~17 ms. Tested speed levers live: no checkpointing OOMed at B=2 and fell to B=1, selective checkpoint stride=2 fit but hugged VRAM and reached ~2.94k tok/s, B=5/6 hit a memory-pressure cliff, while B=4 with full DBlock checkpointing was the best stable official setting (~3.0k tok/s warm window, ~13.2 GB tensor peak / ~17.6 GB reserved, ETA ~269-275 days). The live relaunch now uses `--batch_size 4 --grad_checkpoint --dblock_checkpoint_stride 1` and leaves selective checkpointing available for future context/batch tradeoffs.
|
| 90 |
|
| 91 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 92 |
License: Apache-2.0 (matching the upstream method).
|
|
|
|
| 89 |
Profiling/speed update 2026-05-29: added in-process DBlock profiling (`--profile_steps`, `--profile_log_every`) after external ptrace profiling was blocked on Vast. The profile showed the bottleneck is transformer recompute/backward, not fused CE or the optimizer: at B=2 full checkpointing, AR backward averaged ~605 ms/step, AR forward ~184 ms, CE ~4.5 ms, optimizer ~17 ms. Tested speed levers live: no checkpointing OOMed at B=2 and fell to B=1, selective checkpoint stride=2 fit but hugged VRAM and reached ~2.94k tok/s, B=5/6 hit a memory-pressure cliff, while B=4 with full DBlock checkpointing was the best stable official setting (~3.0k tok/s warm window, ~13.2 GB tensor peak / ~17.6 GB reserved, ETA ~269-275 days). The live relaunch now uses `--batch_size 4 --grad_checkpoint --dblock_checkpoint_stride 1` and leaves selective checkpointing available for future context/batch tradeoffs.
|
| 90 |
|
| 91 |
|
| 92 |
+
|
| 93 |
+
90-day target update 2026-05-29: the live Vast line now uses a compute-bounded 35 tokens/parameter target (`TOKEN_PARAM_RATIO=${TOKEN_PARAM_RATIO:-35}` in `relaunch_agillm4_dblock_sg2.sh`) instead of the earlier 100 tokens/parameter target. With 716,595,202 trainable parameters this sets the finish line to 25,080,832,070 tokens. At the observed B=4 DBlock throughput (~3.04k tok/s shortly after restart, improving toward ~3.08k tok/s), the remaining ETA is under 90 days while preserving the same low-VRAM DBlock/sublinear/tied-head training line. This is a deliberately compute-bounded official run; the ratio can be raised later if evaluations show continued strong returns.
|
| 94 |
+
|
| 95 |
+
|
| 96 |
License: Apache-2.0 (matching the upstream method).
|
relaunch_agillm4_dblock_sg2.sh
CHANGED
|
@@ -8,9 +8,10 @@ export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512,expandable_segments:True
|
|
| 8 |
export AGILLM_ATTN_BACKEND=sublinear
|
| 9 |
[ -f /root/.cache/huggingface/token ] && { export HF_TOKEN="$(tr -d '\r\n' </root/.cache/huggingface/token)"; export HUGGING_FACE_HUB_TOKEN="$HF_TOKEN"; }
|
| 10 |
SAVE_DIR=/workspace/agillm4_4090_ckpts
|
|
|
|
| 11 |
CKPT="$(ls -1t "$SAVE_DIR"/pretrain_step*.pt 2>/dev/null | head -1)"
|
| 12 |
exec >> /workspace/agillm4_floor_train.log 2>&1
|
| 13 |
-
echo "RELAUNCH_AGILLM4_DBLOCK_SG2 $(date -u +%Y-%m-%dT%H:%M:%SZ) resume=$CKPT (
|
| 14 |
exec python -u nB300_agillm4.py train --preset agillm4_floor --resume "$CKPT" \
|
| 15 |
--dblock --dblock_blocks 4 --dblock_schedule loss_balanced --dblock_warmup_steps 16 \
|
| 16 |
--dblock_sigma_curriculum_steps 2000 --dblock_log_every 25 --dblock_objective_mode stochastic \
|
|
@@ -20,6 +21,5 @@ exec python -u nB300_agillm4.py train --preset agillm4_floor --resume "$CKPT" \
|
|
| 20 |
--sublinear_window 128 --sublinear_stride 128 --sublinear_max_anchors 128 --sublinear_chunk 128 \
|
| 21 |
--sublinear_sinks 4 --sublinear_recent_anchors 64 --no-sublinear_pooled_landmarks \
|
| 22 |
--grad_checkpoint --dblock_checkpoint_stride 1 --optimizer paged_adamw8bit --sat_every 4 --nat_every 4 --nat_max_tokens 768 --nat_mask_ratio 0.5 \
|
| 23 |
-
--token_param_ratio
|
| 24 |
-
\
|
| 25 |
--empty_cache_every_steps 0 --delta_every_steps 25000 --delta_max_keep 1 --max_ckpts 1
|
|
|
|
| 8 |
export AGILLM_ATTN_BACKEND=sublinear
|
| 9 |
[ -f /root/.cache/huggingface/token ] && { export HF_TOKEN="$(tr -d '\r\n' </root/.cache/huggingface/token)"; export HUGGING_FACE_HUB_TOKEN="$HF_TOKEN"; }
|
| 10 |
SAVE_DIR=/workspace/agillm4_4090_ckpts
|
| 11 |
+
TOKEN_PARAM_RATIO="${TOKEN_PARAM_RATIO:-35}"
|
| 12 |
CKPT="$(ls -1t "$SAVE_DIR"/pretrain_step*.pt 2>/dev/null | head -1)"
|
| 13 |
exec >> /workspace/agillm4_floor_train.log 2>&1
|
| 14 |
+
echo "RELAUNCH_AGILLM4_DBLOCK_SG2 $(date -u +%Y-%m-%dT%H:%M:%SZ) resume=$CKPT (90day ratio35 + batch4 + sublinear v2)"
|
| 15 |
exec python -u nB300_agillm4.py train --preset agillm4_floor --resume "$CKPT" \
|
| 16 |
--dblock --dblock_blocks 4 --dblock_schedule loss_balanced --dblock_warmup_steps 16 \
|
| 17 |
--dblock_sigma_curriculum_steps 2000 --dblock_log_every 25 --dblock_objective_mode stochastic \
|
|
|
|
| 21 |
--sublinear_window 128 --sublinear_stride 128 --sublinear_max_anchors 128 --sublinear_chunk 128 \
|
| 22 |
--sublinear_sinks 4 --sublinear_recent_anchors 64 --no-sublinear_pooled_landmarks \
|
| 23 |
--grad_checkpoint --dblock_checkpoint_stride 1 --optimizer paged_adamw8bit --sat_every 4 --nat_every 4 --nat_max_tokens 768 --nat_mask_ratio 0.5 \
|
| 24 |
+
--token_param_ratio "$TOKEN_PARAM_RATIO" --save_dir "$SAVE_DIR" --save_every_sec 86400 --heartbeat_every_sec 300 \
|
|
|
|
| 25 |
--empty_cache_every_steps 0 --delta_every_steps 25000 --delta_max_keep 1 --max_ckpts 1
|