AGILLM4_floor_4090_defaults_and_warmstart

Browse files

Files changed (5) hide show

AGILLM-4.md +30 -34
README.md +6 -2
run_agillm4_4090_longblock.sh +19 -8
run_agillm4_4090_sublinear_probe.sh +13 -3
run_agillm4_main_b200_b300.sh +9 -2

AGILLM-4.md CHANGED Viewed

@@ -9,7 +9,7 @@ AGILLM-4 should not be "AGILLM-3 with a larger `--block`." The next useful versi
 The near-term codebase target is a measurable long-context trainer that can survive on current hardware:
-- RTX 4090: production 24GB long-block lane for the current-size model family, aiming above the current ~1200-token ceiling while preserving AR+SAT+NAT.
 - B200 180GB: serious 2k-16k experiments.
 - B300 262GB: 8k-64k experiments, then memory-augmented tests.
 - Multi-GPU/cluster: ring or sequence parallel experiments later.
@@ -26,36 +26,31 @@ Implemented presets:
 | `agillm4_main` | d=1536, L=32, H=24, rank=192 | ~1.5B | main target |
 | `agillm4_big` | d=1792, L=36, H=28, rank=224 | ~2.1B | stretch target after memory works |
-Default recommendation: train `agillm4_main` if B200/B300 availability is good. Use `agillm4_floor` only for debugging the new long-context/memory stack, not as the named release target.
 ## 4090 Production Long-Block Plan
-The RTX 4090 lane is production. Its job is to keep useful AGILLM training moving when only 24GB VRAM is available, and specifically to break past the current ~1200-token block ceiling.
-Production recipe for >1200 block on 4090:
 ```bash
-python -u /workspace/agillm-4/nB300_agillm4.py train \
-  --preset large \
-  --batch_size 1 \
-  --block 1536 \
-  --amp \
-  --attn_backend sdpa \
-  --grad_checkpoint \
-  --optimizer paged_adamw8bit \
-  --sat_every 1 \
-  --nat_every 4 \
-  --nat_max_tokens 768
 ```
 Important: `--sat_every 1 --nat_every 4` keeps SAT trained every step and NAT active on a cadence that fits 24GB cards. On B200/B300 use `--nat_every 1` for full AR+SAT+NAT every step. The AGILLM-4 code now backprops AR, SAT, and NAT sequentially, so the objective remains joint while peak VRAM is lower than holding all activation graphs at once.
 Escalation ladder on 4090:
-1. `block=1280`
-2. `block=1536`
-3. `block=1792`
-4. `block=2048`
 If 8-bit optimizer is unavailable, install `bitsandbytes` rather than dropping the long-block target. SAT remains active every step; NAT should stay enabled with a slower cadence or `--nat_max_tokens` cap on 24GB. The code lowers peak memory by backpropagating AR, SAT, and NAT sequentially, not by deleting heads.
@@ -67,9 +62,9 @@ seed:
 ```bash
 python /workspace/agillm-4/build_v4_seed.py \
-  --from-ckpt /workspace/ckpts_sft_math_v1/final.pt \
-  --v4-preset main \
-  --out /workspace/agillm-4/agillm4_seed_from_v3.pt
 ```
 Construction rules (v3 d=1024 / H=16 / r=128 / L=24 → v4 d=1536 / H=24 / r=192 / L=32):
@@ -98,14 +93,15 @@ Use with `--warmstart_from`:
 ```bash
 python /workspace/agillm-4/nB300_agillm4.py train \
-  --preset large \
-  --warmstart_from /workspace/agillm-4/agillm4_seed_from_v3.pt \
-  --batch_size 1 --block 1536 --amp --grad_checkpoint --sat_every 1 --nat_every 1 \
-  --anchor_memory --anchor_stride 256 --anchor_max 2048
 ```
-The seed file is ~6 GB on disk (fp32 tensors). Rebuild it whenever a newer v3
-SFT final is preferred (e.g. swap chat-v2 for the math-v1 final).
 ## First Implemented Scaffold
@@ -135,13 +131,13 @@ This gives real VRAM and throughput data before committing to long training.
 ```bash
 python /workspace/agillm-4/profile_agillm4.py \
-  --preset large \
-  --block 1536 \
   --batch_size 1 \
   --backends sdpa,sublinear \
   --grad_checkpoint \
   --amp \
-  --json_out /workspace/agillm4_profile_1536.json
 ```
 Use this before changing architecture. The profiler reports AR core forward,
@@ -177,8 +173,8 @@ On a 4090 production lane, first probe:
 ```bash
 python /workspace/agillm-4/block_sweep_agillm4.py \
-  --preset large \
-  --blocks 1280,1536,1792,2048,3072 \
   --batch_size 1 \
   --attn_backend sublinear \
   --sublinear_window 256 \
@@ -277,7 +273,7 @@ Status: wired into `Encoder` as a single `AnchorMemoryLayer` inserted after a co
 ```bash
 python /workspace/agillm-4/nB300_agillm4.py train \
-  --preset large --batch_size 1 --block 1536 --amp --grad_checkpoint \
   --anchor_memory --anchor_stride 256 --anchor_max 2048
 ```

 The near-term codebase target is a measurable long-context trainer that can survive on current hardware:
+- RTX 4090: production 24GB lane for the >1B AGILLM-4 floor shape, then block-size growth while preserving AR+SAT+NAT.
 - B200 180GB: serious 2k-16k experiments.
 - B300 262GB: 8k-64k experiments, then memory-augmented tests.
 - Multi-GPU/cluster: ring or sequence parallel experiments later.
 | `agillm4_main` | d=1536, L=32, H=24, rank=192 | ~1.5B | main target |
 | `agillm4_big` | d=1792, L=36, H=28, rank=224 | ~2.1B | stretch target after memory works |
+Default recommendation: train `agillm4_main` if B200/B300 availability is good. On a 24GB 4090, start with `agillm4_floor` so the run is still larger than AGILLM-3 while leaving enough VRAM for AR+SAT+NAT.
 ## 4090 Production Long-Block Plan
+The RTX 4090 lane is production. Its job is to keep useful AGILLM-4 training moving when only 24GB VRAM is available without accidentally dropping back to the AGILLM-3-sized `large` preset.
+Production first-run recipe on 4090:
 ```bash
+AGILLM4_4090_WARMSTART_FROM=/workspace/agillm-4/agillm4_floor_seed_from_v3_v47.pt \
+AGILLM4_4090_PRESET=agillm4_floor \
+AGILLM4_4090_BLOCK=512 \
+AGILLM4_4090_TOKEN_PARAM_RATIO=100 \
+bash /workspace/agillm-4/run_agillm4_4090_longblock.sh
 ```
 Important: `--sat_every 1 --nat_every 4` keeps SAT trained every step and NAT active on a cadence that fits 24GB cards. On B200/B300 use `--nat_every 1` for full AR+SAT+NAT every step. The AGILLM-4 code now backprops AR, SAT, and NAT sequentially, so the objective remains joint while peak VRAM is lower than holding all activation graphs at once.
 Escalation ladder on 4090:
+1. `block=512`
+2. `block=640`
+3. `block=768`
+4. `block=1024`
+5. `block=1280+` only after measured VRAM headroom
 If 8-bit optimizer is unavailable, install `bitsandbytes` rather than dropping the long-block target. SAT remains active every step; NAT should stay enabled with a slower cadence or `--nat_max_tokens` cap on 24GB. The code lowers peak memory by backpropagating AR, SAT, and NAT sequentially, not by deleting heads.
 ```bash
 python /workspace/agillm-4/build_v4_seed.py \
+  --from-ckpt /workspace/ckpts_sat_fixed_polish_v47_20260523/final.pt \
+  --v4-preset floor \
+  --out /workspace/agillm-4/agillm4_floor_seed_from_v3_v47.pt
 ```
 Construction rules (v3 d=1024 / H=16 / r=128 / L=24 → v4 d=1536 / H=24 / r=192 / L=32):
 ```bash
 python /workspace/agillm-4/nB300_agillm4.py train \
+  --preset agillm4_floor \
+  --warmstart_from /workspace/agillm-4/agillm4_floor_seed_from_v3_v47.pt \
+  --batch_size 1 --block 512 --amp --grad_checkpoint --sat_every 1 --nat_every 4 \
+  --token_param_ratio 100
 ```
+The seed file is several GB on disk (fp32 tensors). Rebuild it whenever a newer v3
+final is preferred. For B200/B300, use `--v4-preset main` and train
+`agillm4_main`; for 4090, use `--v4-preset floor`.
 ## First Implemented Scaffold
 ```bash
 python /workspace/agillm-4/profile_agillm4.py \
+  --preset agillm4_floor \
+  --block 512 \
   --batch_size 1 \
   --backends sdpa,sublinear \
   --grad_checkpoint \
   --amp \
+  --json_out /workspace/agillm4_floor_profile_512.json
 ```
 Use this before changing architecture. The profiler reports AR core forward,
 ```bash
 python /workspace/agillm-4/block_sweep_agillm4.py \
+  --preset agillm4_floor \
+  --blocks 512,640,768,1024,1280 \
   --batch_size 1 \
   --attn_backend sublinear \
   --sublinear_window 256 \
 ```bash
 python /workspace/agillm-4/nB300_agillm4.py train \
+  --preset agillm4_floor --batch_size 1 --block 512 --amp --grad_checkpoint \
   --anchor_memory --anchor_stride 256 --anchor_max 2048
 ```

README.md CHANGED Viewed

@@ -14,8 +14,8 @@ AGILLM-4 is the next training target after AGILLM-3. The current code is a
 production-oriented starting point, copied from the proven single-file trainer
 and extended for:
-- ~1.5B parameter main preset (`agillm4_main`)
-- 100 tokens per parameter target ratio
 - longer block-size work on 24GB, B200, and B300 class GPUs
 - AR+SAT+NAT training, with sequential backward to reduce peak VRAM
 - SDPA and experimental sublinear local+landmark attention backends
@@ -28,4 +28,8 @@ Start with [AGILLM-4.md](AGILLM-4.md) for the training plan and command
 recipes. The current sublinear backend is intentionally experimental: profile it
 against SDPA before using it for a real run.
 Current harvest status from n1.py is tracked in [N1_HARVEST.md](N1_HARVEST.md).

 production-oriented starting point, copied from the proven single-file trainer
 and extended for:
+- >1B parameter floor preset (`agillm4_floor`) and ~1.5B main preset (`agillm4_main`)
+- 100 tokens per parameter target ratio, above the AGILLM-3 training ratio
 - longer block-size work on 24GB, B200, and B300 class GPUs
 - AR+SAT+NAT training, with sequential backward to reduce peak VRAM
 - SDPA and experimental sublinear local+landmark attention backends
 recipes. The current sublinear backend is intentionally experimental: profile it
 against SDPA before using it for a real run.
+On RTX 4090-class 24GB cards, `run_agillm4_4090_longblock.sh` now defaults to
+`agillm4_floor` instead of the AGILLM-3-sized `large` preset. Override
+`AGILLM4_4090_BLOCK` upward only after the first floor run is stable.
 Current harvest status from n1.py is tracked in [N1_HARVEST.md](N1_HARVEST.md).

run_agillm4_4090_longblock.sh CHANGED Viewed

@@ -12,16 +12,27 @@ if [ -f /root/.cache/huggingface/token ]; then
   export HUGGING_FACE_HUB_TOKEN="$HF_TOKEN"
 fi
-mkdir -p /workspace/agillm4_4090_ckpts
 echo "START_AGILLM4_4090_LONG_BLOCK $(date -u +%Y-%m-%dT%H:%M:%SZ) host=$(hostname)"
-echo "This is production AGILLM long-block training on 24GB, not a local toy test."
-echo "preset=${AGILLM4_4090_PRESET:-large} block=${AGILLM4_4090_BLOCK:-1536} sat_every=1 nat_every=${AGILLM4_4090_NAT_EVERY:-4}"
 exec python -u /workspace/agillm-4/nB300_agillm4.py train \
-  --preset "${AGILLM4_4090_PRESET:-large}" \
   --batch_size 1 \
-  --block "${AGILLM4_4090_BLOCK:-1536}" \
   --amp \
   --attn_backend sdpa \
   --grad_checkpoint \
@@ -30,9 +41,9 @@ exec python -u /workspace/agillm-4/nB300_agillm4.py train \
   --nat_every "${AGILLM4_4090_NAT_EVERY:-4}" \
   --nat_loss_weight "${AGILLM4_4090_NAT_LOSS_WEIGHT:-1.0}" \
   --nat_expand "${AGILLM4_4090_NAT_EXPAND:-2}" \
-  --nat_max_tokens "${AGILLM4_4090_NAT_MAX_TOKENS:-768}" \
-  --token_param_ratio 100 \
-  --save_dir /workspace/agillm4_4090_ckpts \
   --save_every_sec "${AGILLM4_4090_SAVE_EVERY_SEC:-21600}" \
   --delta_every_steps "${AGILLM4_4090_DELTA_EVERY_STEPS:-25000}" \
   --delta_max_keep "${AGILLM4_4090_DELTA_MAX_KEEP:-8}" \

   export HUGGING_FACE_HUB_TOKEN="$HF_TOKEN"
 fi
+PRESET="${AGILLM4_4090_PRESET:-agillm4_floor}"
+BLOCK="${AGILLM4_4090_BLOCK:-512}"
+TOKEN_PARAM_RATIO="${AGILLM4_4090_TOKEN_PARAM_RATIO:-100}"
+SAVE_DIR="${AGILLM4_4090_SAVE_DIR:-/workspace/agillm4_4090_ckpts}"
+WARMSTART_ARGS=()
+if [ -n "${AGILLM4_4090_WARMSTART_FROM:-}" ]; then
+  WARMSTART_ARGS+=(--warmstart_from "$AGILLM4_4090_WARMSTART_FROM")
+fi
+mkdir -p "$SAVE_DIR"
 echo "START_AGILLM4_4090_LONG_BLOCK $(date -u +%Y-%m-%dT%H:%M:%SZ) host=$(hostname)"
+echo "This is production AGILLM-4 training on 24GB, not a local toy test."
+echo "preset=$PRESET block=$BLOCK token_param_ratio=$TOKEN_PARAM_RATIO sat_every=1 nat_every=${AGILLM4_4090_NAT_EVERY:-4} warmstart=${AGILLM4_4090_WARMSTART_FROM:-none}"
 exec python -u /workspace/agillm-4/nB300_agillm4.py train \
+  --preset "$PRESET" \
+  "${WARMSTART_ARGS[@]}" \
   --batch_size 1 \
+  --block "$BLOCK" \
   --amp \
   --attn_backend sdpa \
   --grad_checkpoint \
   --nat_every "${AGILLM4_4090_NAT_EVERY:-4}" \
   --nat_loss_weight "${AGILLM4_4090_NAT_LOSS_WEIGHT:-1.0}" \
   --nat_expand "${AGILLM4_4090_NAT_EXPAND:-2}" \
+  --nat_max_tokens "${AGILLM4_4090_NAT_MAX_TOKENS:-512}" \
+  --token_param_ratio "$TOKEN_PARAM_RATIO" \
+  --save_dir "$SAVE_DIR" \
   --save_every_sec "${AGILLM4_4090_SAVE_EVERY_SEC:-21600}" \
   --delta_every_steps "${AGILLM4_4090_DELTA_EVERY_STEPS:-25000}" \
   --delta_max_keep "${AGILLM4_4090_DELTA_MAX_KEEP:-8}" \

run_agillm4_4090_sublinear_probe.sh CHANGED Viewed

@@ -3,10 +3,20 @@ set -euo pipefail
 cd "$(dirname "$0")"
 python -u ./nB300_agillm4.py train \
-  --preset "${AGILLM4_PRESET:-large}" \
   --batch_size "${AGILLM4_BATCH:-1}" \
-  --block "${AGILLM4_BLOCK:-2048}" \
   --amp \
   --attn_backend sublinear \
   --sublinear_window "${AGILLM4_SUBLINEAR_WINDOW:-256}" \
@@ -20,5 +30,5 @@ python -u ./nB300_agillm4.py train \
   --nat_loss_weight "${AGILLM4_NAT_LOSS_WEIGHT:-1.0}" \
   --nat_expand "${AGILLM4_NAT_EXPAND:-2}" \
   --nat_max_tokens "${AGILLM4_NAT_MAX_TOKENS:-768}" \
-  --token_param_ratio 100 \
   --save_dir "${AGILLM4_SAVE_DIR:-/workspace/ckpts_agillm4_sublinear_4090}"

 cd "$(dirname "$0")"
+PRESET="${AGILLM4_PRESET:-agillm4_floor}"
+BLOCK="${AGILLM4_BLOCK:-768}"
+TOKEN_PARAM_RATIO="${AGILLM4_TOKEN_PARAM_RATIO:-100}"
+WARMSTART_ARGS=()
+if [ -n "${AGILLM4_WARMSTART_FROM:-}" ]; then
+  WARMSTART_ARGS+=(--warmstart_from "$AGILLM4_WARMSTART_FROM")
+fi
 python -u ./nB300_agillm4.py train \
+  --preset "$PRESET" \
+  "${WARMSTART_ARGS[@]}" \
   --batch_size "${AGILLM4_BATCH:-1}" \
+  --block "$BLOCK" \
   --amp \
   --attn_backend sublinear \
   --sublinear_window "${AGILLM4_SUBLINEAR_WINDOW:-256}" \
   --nat_loss_weight "${AGILLM4_NAT_LOSS_WEIGHT:-1.0}" \
   --nat_expand "${AGILLM4_NAT_EXPAND:-2}" \
   --nat_max_tokens "${AGILLM4_NAT_MAX_TOKENS:-768}" \
+  --token_param_ratio "$TOKEN_PARAM_RATIO" \
   --save_dir "${AGILLM4_SAVE_DIR:-/workspace/ckpts_agillm4_sublinear_4090}"

run_agillm4_main_b200_b300.sh CHANGED Viewed

@@ -12,13 +12,20 @@ if [ -f /root/.cache/huggingface/token ]; then
   export HUGGING_FACE_HUB_TOKEN="$HF_TOKEN"
 fi
 mkdir -p /workspace/agillm4_ckpts
 echo "START_AGILLM4_MAIN $(date -u +%Y-%m-%dT%H:%M:%SZ) host=$(hostname)"
-echo "preset=agillm4_main target_tokens=150000000000 token_param_ratio=100 block=${AGILLM4_BLOCK:-2048} sat_every=1 nat_every=${AGILLM4_NAT_EVERY:-1}"
 exec python -u /workspace/agillm-4/nB300_agillm4.py train \
   --preset agillm4_main \
   --batch_size "${AGILLM4_BATCH:-1}" \
   --block "${AGILLM4_BLOCK:-2048}" \
   --amp \
@@ -30,7 +37,7 @@ exec python -u /workspace/agillm-4/nB300_agillm4.py train \
   --nat_loss_weight "${AGILLM4_NAT_LOSS_WEIGHT:-1.0}" \
   --nat_expand "${AGILLM4_NAT_EXPAND:-2}" \
   --nat_max_tokens "${AGILLM4_NAT_MAX_TOKENS:-0}" \
-  --token_param_ratio 100 \
   --target_tokens 150000000000 \
   --save_dir /workspace/agillm4_ckpts \
   --save_every_sec "${AGILLM4_SAVE_EVERY_SEC:-21600}" \

   export HUGGING_FACE_HUB_TOKEN="$HF_TOKEN"
 fi
+TOKEN_PARAM_RATIO="${AGILLM4_TOKEN_PARAM_RATIO:-100}"
+WARMSTART_ARGS=()
+if [ -n "${AGILLM4_WARMSTART_FROM:-}" ]; then
+  WARMSTART_ARGS+=(--warmstart_from "$AGILLM4_WARMSTART_FROM")
+fi
 mkdir -p /workspace/agillm4_ckpts
 echo "START_AGILLM4_MAIN $(date -u +%Y-%m-%dT%H:%M:%SZ) host=$(hostname)"
+echo "preset=agillm4_main target_tokens=150000000000 token_param_ratio=$TOKEN_PARAM_RATIO block=${AGILLM4_BLOCK:-2048} sat_every=1 nat_every=${AGILLM4_NAT_EVERY:-1} warmstart=${AGILLM4_WARMSTART_FROM:-none}"
 exec python -u /workspace/agillm-4/nB300_agillm4.py train \
   --preset agillm4_main \
+  "${WARMSTART_ARGS[@]}" \
   --batch_size "${AGILLM4_BATCH:-1}" \
   --block "${AGILLM4_BLOCK:-2048}" \
   --amp \
   --nat_loss_weight "${AGILLM4_NAT_LOSS_WEIGHT:-1.0}" \
   --nat_expand "${AGILLM4_NAT_EXPAND:-2}" \
   --nat_max_tokens "${AGILLM4_NAT_MAX_TOKENS:-0}" \
+  --token_param_ratio "$TOKEN_PARAM_RATIO" \
   --target_tokens 150000000000 \
   --save_dir /workspace/agillm4_ckpts \
   --save_every_sec "${AGILLM4_SAVE_EVERY_SEC:-21600}" \