OpenTransformer commited on
Commit
c2b5995
·
verified ·
1 Parent(s): 528e0d6

AGILLM4_floor_4090_defaults_and_warmstart

Browse files
AGILLM-4.md CHANGED
@@ -9,7 +9,7 @@ AGILLM-4 should not be "AGILLM-3 with a larger `--block`." The next useful versi
9
 
10
  The near-term codebase target is a measurable long-context trainer that can survive on current hardware:
11
 
12
- - RTX 4090: production 24GB long-block lane for the current-size model family, aiming above the current ~1200-token ceiling while preserving AR+SAT+NAT.
13
  - B200 180GB: serious 2k-16k experiments.
14
  - B300 262GB: 8k-64k experiments, then memory-augmented tests.
15
  - Multi-GPU/cluster: ring or sequence parallel experiments later.
@@ -26,36 +26,31 @@ Implemented presets:
26
  | `agillm4_main` | d=1536, L=32, H=24, rank=192 | ~1.5B | main target |
27
  | `agillm4_big` | d=1792, L=36, H=28, rank=224 | ~2.1B | stretch target after memory works |
28
 
29
- Default recommendation: train `agillm4_main` if B200/B300 availability is good. Use `agillm4_floor` only for debugging the new long-context/memory stack, not as the named release target.
30
 
31
  ## 4090 Production Long-Block Plan
32
 
33
- The RTX 4090 lane is production. Its job is to keep useful AGILLM training moving when only 24GB VRAM is available, and specifically to break past the current ~1200-token block ceiling.
34
 
35
- Production recipe for >1200 block on 4090:
36
 
37
  ```bash
38
- python -u /workspace/agillm-4/nB300_agillm4.py train \
39
- --preset large \
40
- --batch_size 1 \
41
- --block 1536 \
42
- --amp \
43
- --attn_backend sdpa \
44
- --grad_checkpoint \
45
- --optimizer paged_adamw8bit \
46
- --sat_every 1 \
47
- --nat_every 4 \
48
- --nat_max_tokens 768
49
  ```
50
 
51
  Important: `--sat_every 1 --nat_every 4` keeps SAT trained every step and NAT active on a cadence that fits 24GB cards. On B200/B300 use `--nat_every 1` for full AR+SAT+NAT every step. The AGILLM-4 code now backprops AR, SAT, and NAT sequentially, so the objective remains joint while peak VRAM is lower than holding all activation graphs at once.
52
 
53
  Escalation ladder on 4090:
54
 
55
- 1. `block=1280`
56
- 2. `block=1536`
57
- 3. `block=1792`
58
- 4. `block=2048`
 
59
 
60
  If 8-bit optimizer is unavailable, install `bitsandbytes` rather than dropping the long-block target. SAT remains active every step; NAT should stay enabled with a slower cadence or `--nat_max_tokens` cap on 24GB. The code lowers peak memory by backpropagating AR, SAT, and NAT sequentially, not by deleting heads.
61
 
@@ -67,9 +62,9 @@ seed:
67
 
68
  ```bash
69
  python /workspace/agillm-4/build_v4_seed.py \
70
- --from-ckpt /workspace/ckpts_sft_math_v1/final.pt \
71
- --v4-preset main \
72
- --out /workspace/agillm-4/agillm4_seed_from_v3.pt
73
  ```
74
 
75
  Construction rules (v3 d=1024 / H=16 / r=128 / L=24 → v4 d=1536 / H=24 / r=192 / L=32):
@@ -98,14 +93,15 @@ Use with `--warmstart_from`:
98
 
99
  ```bash
100
  python /workspace/agillm-4/nB300_agillm4.py train \
101
- --preset large \
102
- --warmstart_from /workspace/agillm-4/agillm4_seed_from_v3.pt \
103
- --batch_size 1 --block 1536 --amp --grad_checkpoint --sat_every 1 --nat_every 1 \
104
- --anchor_memory --anchor_stride 256 --anchor_max 2048
105
  ```
106
 
107
- The seed file is ~6 GB on disk (fp32 tensors). Rebuild it whenever a newer v3
108
- SFT final is preferred (e.g. swap chat-v2 for the math-v1 final).
 
109
 
110
  ## First Implemented Scaffold
111
 
@@ -135,13 +131,13 @@ This gives real VRAM and throughput data before committing to long training.
135
 
136
  ```bash
137
  python /workspace/agillm-4/profile_agillm4.py \
138
- --preset large \
139
- --block 1536 \
140
  --batch_size 1 \
141
  --backends sdpa,sublinear \
142
  --grad_checkpoint \
143
  --amp \
144
- --json_out /workspace/agillm4_profile_1536.json
145
  ```
146
 
147
  Use this before changing architecture. The profiler reports AR core forward,
@@ -177,8 +173,8 @@ On a 4090 production lane, first probe:
177
 
178
  ```bash
179
  python /workspace/agillm-4/block_sweep_agillm4.py \
180
- --preset large \
181
- --blocks 1280,1536,1792,2048,3072 \
182
  --batch_size 1 \
183
  --attn_backend sublinear \
184
  --sublinear_window 256 \
@@ -277,7 +273,7 @@ Status: wired into `Encoder` as a single `AnchorMemoryLayer` inserted after a co
277
 
278
  ```bash
279
  python /workspace/agillm-4/nB300_agillm4.py train \
280
- --preset large --batch_size 1 --block 1536 --amp --grad_checkpoint \
281
  --anchor_memory --anchor_stride 256 --anchor_max 2048
282
  ```
283
 
 
9
 
10
  The near-term codebase target is a measurable long-context trainer that can survive on current hardware:
11
 
12
+ - RTX 4090: production 24GB lane for the >1B AGILLM-4 floor shape, then block-size growth while preserving AR+SAT+NAT.
13
  - B200 180GB: serious 2k-16k experiments.
14
  - B300 262GB: 8k-64k experiments, then memory-augmented tests.
15
  - Multi-GPU/cluster: ring or sequence parallel experiments later.
 
26
  | `agillm4_main` | d=1536, L=32, H=24, rank=192 | ~1.5B | main target |
27
  | `agillm4_big` | d=1792, L=36, H=28, rank=224 | ~2.1B | stretch target after memory works |
28
 
29
+ Default recommendation: train `agillm4_main` if B200/B300 availability is good. On a 24GB 4090, start with `agillm4_floor` so the run is still larger than AGILLM-3 while leaving enough VRAM for AR+SAT+NAT.
30
 
31
  ## 4090 Production Long-Block Plan
32
 
33
+ The RTX 4090 lane is production. Its job is to keep useful AGILLM-4 training moving when only 24GB VRAM is available without accidentally dropping back to the AGILLM-3-sized `large` preset.
34
 
35
+ Production first-run recipe on 4090:
36
 
37
  ```bash
38
+ AGILLM4_4090_WARMSTART_FROM=/workspace/agillm-4/agillm4_floor_seed_from_v3_v47.pt \
39
+ AGILLM4_4090_PRESET=agillm4_floor \
40
+ AGILLM4_4090_BLOCK=512 \
41
+ AGILLM4_4090_TOKEN_PARAM_RATIO=100 \
42
+ bash /workspace/agillm-4/run_agillm4_4090_longblock.sh
 
 
 
 
 
 
43
  ```
44
 
45
  Important: `--sat_every 1 --nat_every 4` keeps SAT trained every step and NAT active on a cadence that fits 24GB cards. On B200/B300 use `--nat_every 1` for full AR+SAT+NAT every step. The AGILLM-4 code now backprops AR, SAT, and NAT sequentially, so the objective remains joint while peak VRAM is lower than holding all activation graphs at once.
46
 
47
  Escalation ladder on 4090:
48
 
49
+ 1. `block=512`
50
+ 2. `block=640`
51
+ 3. `block=768`
52
+ 4. `block=1024`
53
+ 5. `block=1280+` only after measured VRAM headroom
54
 
55
  If 8-bit optimizer is unavailable, install `bitsandbytes` rather than dropping the long-block target. SAT remains active every step; NAT should stay enabled with a slower cadence or `--nat_max_tokens` cap on 24GB. The code lowers peak memory by backpropagating AR, SAT, and NAT sequentially, not by deleting heads.
56
 
 
62
 
63
  ```bash
64
  python /workspace/agillm-4/build_v4_seed.py \
65
+ --from-ckpt /workspace/ckpts_sat_fixed_polish_v47_20260523/final.pt \
66
+ --v4-preset floor \
67
+ --out /workspace/agillm-4/agillm4_floor_seed_from_v3_v47.pt
68
  ```
69
 
70
  Construction rules (v3 d=1024 / H=16 / r=128 / L=24 → v4 d=1536 / H=24 / r=192 / L=32):
 
93
 
94
  ```bash
95
  python /workspace/agillm-4/nB300_agillm4.py train \
96
+ --preset agillm4_floor \
97
+ --warmstart_from /workspace/agillm-4/agillm4_floor_seed_from_v3_v47.pt \
98
+ --batch_size 1 --block 512 --amp --grad_checkpoint --sat_every 1 --nat_every 4 \
99
+ --token_param_ratio 100
100
  ```
101
 
102
+ The seed file is several GB on disk (fp32 tensors). Rebuild it whenever a newer v3
103
+ final is preferred. For B200/B300, use `--v4-preset main` and train
104
+ `agillm4_main`; for 4090, use `--v4-preset floor`.
105
 
106
  ## First Implemented Scaffold
107
 
 
131
 
132
  ```bash
133
  python /workspace/agillm-4/profile_agillm4.py \
134
+ --preset agillm4_floor \
135
+ --block 512 \
136
  --batch_size 1 \
137
  --backends sdpa,sublinear \
138
  --grad_checkpoint \
139
  --amp \
140
+ --json_out /workspace/agillm4_floor_profile_512.json
141
  ```
142
 
143
  Use this before changing architecture. The profiler reports AR core forward,
 
173
 
174
  ```bash
175
  python /workspace/agillm-4/block_sweep_agillm4.py \
176
+ --preset agillm4_floor \
177
+ --blocks 512,640,768,1024,1280 \
178
  --batch_size 1 \
179
  --attn_backend sublinear \
180
  --sublinear_window 256 \
 
273
 
274
  ```bash
275
  python /workspace/agillm-4/nB300_agillm4.py train \
276
+ --preset agillm4_floor --batch_size 1 --block 512 --amp --grad_checkpoint \
277
  --anchor_memory --anchor_stride 256 --anchor_max 2048
278
  ```
279
 
README.md CHANGED
@@ -14,8 +14,8 @@ AGILLM-4 is the next training target after AGILLM-3. The current code is a
14
  production-oriented starting point, copied from the proven single-file trainer
15
  and extended for:
16
 
17
- - ~1.5B parameter main preset (`agillm4_main`)
18
- - 100 tokens per parameter target ratio
19
  - longer block-size work on 24GB, B200, and B300 class GPUs
20
  - AR+SAT+NAT training, with sequential backward to reduce peak VRAM
21
  - SDPA and experimental sublinear local+landmark attention backends
@@ -28,4 +28,8 @@ Start with [AGILLM-4.md](AGILLM-4.md) for the training plan and command
28
  recipes. The current sublinear backend is intentionally experimental: profile it
29
  against SDPA before using it for a real run.
30
 
 
 
 
 
31
  Current harvest status from n1.py is tracked in [N1_HARVEST.md](N1_HARVEST.md).
 
14
  production-oriented starting point, copied from the proven single-file trainer
15
  and extended for:
16
 
17
+ - >1B parameter floor preset (`agillm4_floor`) and ~1.5B main preset (`agillm4_main`)
18
+ - 100 tokens per parameter target ratio, above the AGILLM-3 training ratio
19
  - longer block-size work on 24GB, B200, and B300 class GPUs
20
  - AR+SAT+NAT training, with sequential backward to reduce peak VRAM
21
  - SDPA and experimental sublinear local+landmark attention backends
 
28
  recipes. The current sublinear backend is intentionally experimental: profile it
29
  against SDPA before using it for a real run.
30
 
31
+ On RTX 4090-class 24GB cards, `run_agillm4_4090_longblock.sh` now defaults to
32
+ `agillm4_floor` instead of the AGILLM-3-sized `large` preset. Override
33
+ `AGILLM4_4090_BLOCK` upward only after the first floor run is stable.
34
+
35
  Current harvest status from n1.py is tracked in [N1_HARVEST.md](N1_HARVEST.md).
run_agillm4_4090_longblock.sh CHANGED
@@ -12,16 +12,27 @@ if [ -f /root/.cache/huggingface/token ]; then
12
  export HUGGING_FACE_HUB_TOKEN="$HF_TOKEN"
13
  fi
14
 
15
- mkdir -p /workspace/agillm4_4090_ckpts
 
 
 
 
 
 
 
 
 
 
16
 
17
  echo "START_AGILLM4_4090_LONG_BLOCK $(date -u +%Y-%m-%dT%H:%M:%SZ) host=$(hostname)"
18
- echo "This is production AGILLM long-block training on 24GB, not a local toy test."
19
- echo "preset=${AGILLM4_4090_PRESET:-large} block=${AGILLM4_4090_BLOCK:-1536} sat_every=1 nat_every=${AGILLM4_4090_NAT_EVERY:-4}"
20
 
21
  exec python -u /workspace/agillm-4/nB300_agillm4.py train \
22
- --preset "${AGILLM4_4090_PRESET:-large}" \
 
23
  --batch_size 1 \
24
- --block "${AGILLM4_4090_BLOCK:-1536}" \
25
  --amp \
26
  --attn_backend sdpa \
27
  --grad_checkpoint \
@@ -30,9 +41,9 @@ exec python -u /workspace/agillm-4/nB300_agillm4.py train \
30
  --nat_every "${AGILLM4_4090_NAT_EVERY:-4}" \
31
  --nat_loss_weight "${AGILLM4_4090_NAT_LOSS_WEIGHT:-1.0}" \
32
  --nat_expand "${AGILLM4_4090_NAT_EXPAND:-2}" \
33
- --nat_max_tokens "${AGILLM4_4090_NAT_MAX_TOKENS:-768}" \
34
- --token_param_ratio 100 \
35
- --save_dir /workspace/agillm4_4090_ckpts \
36
  --save_every_sec "${AGILLM4_4090_SAVE_EVERY_SEC:-21600}" \
37
  --delta_every_steps "${AGILLM4_4090_DELTA_EVERY_STEPS:-25000}" \
38
  --delta_max_keep "${AGILLM4_4090_DELTA_MAX_KEEP:-8}" \
 
12
  export HUGGING_FACE_HUB_TOKEN="$HF_TOKEN"
13
  fi
14
 
15
+ PRESET="${AGILLM4_4090_PRESET:-agillm4_floor}"
16
+ BLOCK="${AGILLM4_4090_BLOCK:-512}"
17
+ TOKEN_PARAM_RATIO="${AGILLM4_4090_TOKEN_PARAM_RATIO:-100}"
18
+ SAVE_DIR="${AGILLM4_4090_SAVE_DIR:-/workspace/agillm4_4090_ckpts}"
19
+
20
+ WARMSTART_ARGS=()
21
+ if [ -n "${AGILLM4_4090_WARMSTART_FROM:-}" ]; then
22
+ WARMSTART_ARGS+=(--warmstart_from "$AGILLM4_4090_WARMSTART_FROM")
23
+ fi
24
+
25
+ mkdir -p "$SAVE_DIR"
26
 
27
  echo "START_AGILLM4_4090_LONG_BLOCK $(date -u +%Y-%m-%dT%H:%M:%SZ) host=$(hostname)"
28
+ echo "This is production AGILLM-4 training on 24GB, not a local toy test."
29
+ echo "preset=$PRESET block=$BLOCK token_param_ratio=$TOKEN_PARAM_RATIO sat_every=1 nat_every=${AGILLM4_4090_NAT_EVERY:-4} warmstart=${AGILLM4_4090_WARMSTART_FROM:-none}"
30
 
31
  exec python -u /workspace/agillm-4/nB300_agillm4.py train \
32
+ --preset "$PRESET" \
33
+ "${WARMSTART_ARGS[@]}" \
34
  --batch_size 1 \
35
+ --block "$BLOCK" \
36
  --amp \
37
  --attn_backend sdpa \
38
  --grad_checkpoint \
 
41
  --nat_every "${AGILLM4_4090_NAT_EVERY:-4}" \
42
  --nat_loss_weight "${AGILLM4_4090_NAT_LOSS_WEIGHT:-1.0}" \
43
  --nat_expand "${AGILLM4_4090_NAT_EXPAND:-2}" \
44
+ --nat_max_tokens "${AGILLM4_4090_NAT_MAX_TOKENS:-512}" \
45
+ --token_param_ratio "$TOKEN_PARAM_RATIO" \
46
+ --save_dir "$SAVE_DIR" \
47
  --save_every_sec "${AGILLM4_4090_SAVE_EVERY_SEC:-21600}" \
48
  --delta_every_steps "${AGILLM4_4090_DELTA_EVERY_STEPS:-25000}" \
49
  --delta_max_keep "${AGILLM4_4090_DELTA_MAX_KEEP:-8}" \
run_agillm4_4090_sublinear_probe.sh CHANGED
@@ -3,10 +3,20 @@ set -euo pipefail
3
 
4
  cd "$(dirname "$0")"
5
 
 
 
 
 
 
 
 
 
 
6
  python -u ./nB300_agillm4.py train \
7
- --preset "${AGILLM4_PRESET:-large}" \
 
8
  --batch_size "${AGILLM4_BATCH:-1}" \
9
- --block "${AGILLM4_BLOCK:-2048}" \
10
  --amp \
11
  --attn_backend sublinear \
12
  --sublinear_window "${AGILLM4_SUBLINEAR_WINDOW:-256}" \
@@ -20,5 +30,5 @@ python -u ./nB300_agillm4.py train \
20
  --nat_loss_weight "${AGILLM4_NAT_LOSS_WEIGHT:-1.0}" \
21
  --nat_expand "${AGILLM4_NAT_EXPAND:-2}" \
22
  --nat_max_tokens "${AGILLM4_NAT_MAX_TOKENS:-768}" \
23
- --token_param_ratio 100 \
24
  --save_dir "${AGILLM4_SAVE_DIR:-/workspace/ckpts_agillm4_sublinear_4090}"
 
3
 
4
  cd "$(dirname "$0")"
5
 
6
+ PRESET="${AGILLM4_PRESET:-agillm4_floor}"
7
+ BLOCK="${AGILLM4_BLOCK:-768}"
8
+ TOKEN_PARAM_RATIO="${AGILLM4_TOKEN_PARAM_RATIO:-100}"
9
+
10
+ WARMSTART_ARGS=()
11
+ if [ -n "${AGILLM4_WARMSTART_FROM:-}" ]; then
12
+ WARMSTART_ARGS+=(--warmstart_from "$AGILLM4_WARMSTART_FROM")
13
+ fi
14
+
15
  python -u ./nB300_agillm4.py train \
16
+ --preset "$PRESET" \
17
+ "${WARMSTART_ARGS[@]}" \
18
  --batch_size "${AGILLM4_BATCH:-1}" \
19
+ --block "$BLOCK" \
20
  --amp \
21
  --attn_backend sublinear \
22
  --sublinear_window "${AGILLM4_SUBLINEAR_WINDOW:-256}" \
 
30
  --nat_loss_weight "${AGILLM4_NAT_LOSS_WEIGHT:-1.0}" \
31
  --nat_expand "${AGILLM4_NAT_EXPAND:-2}" \
32
  --nat_max_tokens "${AGILLM4_NAT_MAX_TOKENS:-768}" \
33
+ --token_param_ratio "$TOKEN_PARAM_RATIO" \
34
  --save_dir "${AGILLM4_SAVE_DIR:-/workspace/ckpts_agillm4_sublinear_4090}"
run_agillm4_main_b200_b300.sh CHANGED
@@ -12,13 +12,20 @@ if [ -f /root/.cache/huggingface/token ]; then
12
  export HUGGING_FACE_HUB_TOKEN="$HF_TOKEN"
13
  fi
14
 
 
 
 
 
 
 
15
  mkdir -p /workspace/agillm4_ckpts
16
 
17
  echo "START_AGILLM4_MAIN $(date -u +%Y-%m-%dT%H:%M:%SZ) host=$(hostname)"
18
- echo "preset=agillm4_main target_tokens=150000000000 token_param_ratio=100 block=${AGILLM4_BLOCK:-2048} sat_every=1 nat_every=${AGILLM4_NAT_EVERY:-1}"
19
 
20
  exec python -u /workspace/agillm-4/nB300_agillm4.py train \
21
  --preset agillm4_main \
 
22
  --batch_size "${AGILLM4_BATCH:-1}" \
23
  --block "${AGILLM4_BLOCK:-2048}" \
24
  --amp \
@@ -30,7 +37,7 @@ exec python -u /workspace/agillm-4/nB300_agillm4.py train \
30
  --nat_loss_weight "${AGILLM4_NAT_LOSS_WEIGHT:-1.0}" \
31
  --nat_expand "${AGILLM4_NAT_EXPAND:-2}" \
32
  --nat_max_tokens "${AGILLM4_NAT_MAX_TOKENS:-0}" \
33
- --token_param_ratio 100 \
34
  --target_tokens 150000000000 \
35
  --save_dir /workspace/agillm4_ckpts \
36
  --save_every_sec "${AGILLM4_SAVE_EVERY_SEC:-21600}" \
 
12
  export HUGGING_FACE_HUB_TOKEN="$HF_TOKEN"
13
  fi
14
 
15
+ TOKEN_PARAM_RATIO="${AGILLM4_TOKEN_PARAM_RATIO:-100}"
16
+ WARMSTART_ARGS=()
17
+ if [ -n "${AGILLM4_WARMSTART_FROM:-}" ]; then
18
+ WARMSTART_ARGS+=(--warmstart_from "$AGILLM4_WARMSTART_FROM")
19
+ fi
20
+
21
  mkdir -p /workspace/agillm4_ckpts
22
 
23
  echo "START_AGILLM4_MAIN $(date -u +%Y-%m-%dT%H:%M:%SZ) host=$(hostname)"
24
+ echo "preset=agillm4_main target_tokens=150000000000 token_param_ratio=$TOKEN_PARAM_RATIO block=${AGILLM4_BLOCK:-2048} sat_every=1 nat_every=${AGILLM4_NAT_EVERY:-1} warmstart=${AGILLM4_WARMSTART_FROM:-none}"
25
 
26
  exec python -u /workspace/agillm-4/nB300_agillm4.py train \
27
  --preset agillm4_main \
28
+ "${WARMSTART_ARGS[@]}" \
29
  --batch_size "${AGILLM4_BATCH:-1}" \
30
  --block "${AGILLM4_BLOCK:-2048}" \
31
  --amp \
 
37
  --nat_loss_weight "${AGILLM4_NAT_LOSS_WEIGHT:-1.0}" \
38
  --nat_expand "${AGILLM4_NAT_EXPAND:-2}" \
39
  --nat_max_tokens "${AGILLM4_NAT_MAX_TOKENS:-0}" \
40
+ --token_param_ratio "$TOKEN_PARAM_RATIO" \
41
  --target_tokens 150000000000 \
42
  --save_dir /workspace/agillm4_ckpts \
43
  --save_every_sec "${AGILLM4_SAVE_EVERY_SEC:-21600}" \