AGILLM4_floor_4090_defaults_and_warmstart
Browse files- AGILLM-4.md +30 -34
- README.md +6 -2
- run_agillm4_4090_longblock.sh +19 -8
- run_agillm4_4090_sublinear_probe.sh +13 -3
- run_agillm4_main_b200_b300.sh +9 -2
AGILLM-4.md
CHANGED
|
@@ -9,7 +9,7 @@ AGILLM-4 should not be "AGILLM-3 with a larger `--block`." The next useful versi
|
|
| 9 |
|
| 10 |
The near-term codebase target is a measurable long-context trainer that can survive on current hardware:
|
| 11 |
|
| 12 |
-
- RTX 4090: production 24GB
|
| 13 |
- B200 180GB: serious 2k-16k experiments.
|
| 14 |
- B300 262GB: 8k-64k experiments, then memory-augmented tests.
|
| 15 |
- Multi-GPU/cluster: ring or sequence parallel experiments later.
|
|
@@ -26,36 +26,31 @@ Implemented presets:
|
|
| 26 |
| `agillm4_main` | d=1536, L=32, H=24, rank=192 | ~1.5B | main target |
|
| 27 |
| `agillm4_big` | d=1792, L=36, H=28, rank=224 | ~2.1B | stretch target after memory works |
|
| 28 |
|
| 29 |
-
Default recommendation: train `agillm4_main` if B200/B300 availability is good.
|
| 30 |
|
| 31 |
## 4090 Production Long-Block Plan
|
| 32 |
|
| 33 |
-
The RTX 4090 lane is production. Its job is to keep useful AGILLM training moving when only 24GB VRAM is available
|
| 34 |
|
| 35 |
-
Production recipe
|
| 36 |
|
| 37 |
```bash
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
--attn_backend sdpa \
|
| 44 |
-
--grad_checkpoint \
|
| 45 |
-
--optimizer paged_adamw8bit \
|
| 46 |
-
--sat_every 1 \
|
| 47 |
-
--nat_every 4 \
|
| 48 |
-
--nat_max_tokens 768
|
| 49 |
```
|
| 50 |
|
| 51 |
Important: `--sat_every 1 --nat_every 4` keeps SAT trained every step and NAT active on a cadence that fits 24GB cards. On B200/B300 use `--nat_every 1` for full AR+SAT+NAT every step. The AGILLM-4 code now backprops AR, SAT, and NAT sequentially, so the objective remains joint while peak VRAM is lower than holding all activation graphs at once.
|
| 52 |
|
| 53 |
Escalation ladder on 4090:
|
| 54 |
|
| 55 |
-
1. `block=
|
| 56 |
-
2. `block=
|
| 57 |
-
3. `block=
|
| 58 |
-
4. `block=
|
|
|
|
| 59 |
|
| 60 |
If 8-bit optimizer is unavailable, install `bitsandbytes` rather than dropping the long-block target. SAT remains active every step; NAT should stay enabled with a slower cadence or `--nat_max_tokens` cap on 24GB. The code lowers peak memory by backpropagating AR, SAT, and NAT sequentially, not by deleting heads.
|
| 61 |
|
|
@@ -67,9 +62,9 @@ seed:
|
|
| 67 |
|
| 68 |
```bash
|
| 69 |
python /workspace/agillm-4/build_v4_seed.py \
|
| 70 |
-
--from-ckpt /workspace/
|
| 71 |
-
--v4-preset
|
| 72 |
-
--out /workspace/agillm-4/
|
| 73 |
```
|
| 74 |
|
| 75 |
Construction rules (v3 d=1024 / H=16 / r=128 / L=24 → v4 d=1536 / H=24 / r=192 / L=32):
|
|
@@ -98,14 +93,15 @@ Use with `--warmstart_from`:
|
|
| 98 |
|
| 99 |
```bash
|
| 100 |
python /workspace/agillm-4/nB300_agillm4.py train \
|
| 101 |
-
--preset
|
| 102 |
-
--warmstart_from /workspace/agillm-4/
|
| 103 |
-
--batch_size 1 --block
|
| 104 |
-
--
|
| 105 |
```
|
| 106 |
|
| 107 |
-
The seed file is
|
| 108 |
-
|
|
|
|
| 109 |
|
| 110 |
## First Implemented Scaffold
|
| 111 |
|
|
@@ -135,13 +131,13 @@ This gives real VRAM and throughput data before committing to long training.
|
|
| 135 |
|
| 136 |
```bash
|
| 137 |
python /workspace/agillm-4/profile_agillm4.py \
|
| 138 |
-
--preset
|
| 139 |
-
--block
|
| 140 |
--batch_size 1 \
|
| 141 |
--backends sdpa,sublinear \
|
| 142 |
--grad_checkpoint \
|
| 143 |
--amp \
|
| 144 |
-
--json_out /workspace/
|
| 145 |
```
|
| 146 |
|
| 147 |
Use this before changing architecture. The profiler reports AR core forward,
|
|
@@ -177,8 +173,8 @@ On a 4090 production lane, first probe:
|
|
| 177 |
|
| 178 |
```bash
|
| 179 |
python /workspace/agillm-4/block_sweep_agillm4.py \
|
| 180 |
-
--preset
|
| 181 |
-
--blocks
|
| 182 |
--batch_size 1 \
|
| 183 |
--attn_backend sublinear \
|
| 184 |
--sublinear_window 256 \
|
|
@@ -277,7 +273,7 @@ Status: wired into `Encoder` as a single `AnchorMemoryLayer` inserted after a co
|
|
| 277 |
|
| 278 |
```bash
|
| 279 |
python /workspace/agillm-4/nB300_agillm4.py train \
|
| 280 |
-
--preset
|
| 281 |
--anchor_memory --anchor_stride 256 --anchor_max 2048
|
| 282 |
```
|
| 283 |
|
|
|
|
| 9 |
|
| 10 |
The near-term codebase target is a measurable long-context trainer that can survive on current hardware:
|
| 11 |
|
| 12 |
+
- RTX 4090: production 24GB lane for the >1B AGILLM-4 floor shape, then block-size growth while preserving AR+SAT+NAT.
|
| 13 |
- B200 180GB: serious 2k-16k experiments.
|
| 14 |
- B300 262GB: 8k-64k experiments, then memory-augmented tests.
|
| 15 |
- Multi-GPU/cluster: ring or sequence parallel experiments later.
|
|
|
|
| 26 |
| `agillm4_main` | d=1536, L=32, H=24, rank=192 | ~1.5B | main target |
|
| 27 |
| `agillm4_big` | d=1792, L=36, H=28, rank=224 | ~2.1B | stretch target after memory works |
|
| 28 |
|
| 29 |
+
Default recommendation: train `agillm4_main` if B200/B300 availability is good. On a 24GB 4090, start with `agillm4_floor` so the run is still larger than AGILLM-3 while leaving enough VRAM for AR+SAT+NAT.
|
| 30 |
|
| 31 |
## 4090 Production Long-Block Plan
|
| 32 |
|
| 33 |
+
The RTX 4090 lane is production. Its job is to keep useful AGILLM-4 training moving when only 24GB VRAM is available without accidentally dropping back to the AGILLM-3-sized `large` preset.
|
| 34 |
|
| 35 |
+
Production first-run recipe on 4090:
|
| 36 |
|
| 37 |
```bash
|
| 38 |
+
AGILLM4_4090_WARMSTART_FROM=/workspace/agillm-4/agillm4_floor_seed_from_v3_v47.pt \
|
| 39 |
+
AGILLM4_4090_PRESET=agillm4_floor \
|
| 40 |
+
AGILLM4_4090_BLOCK=512 \
|
| 41 |
+
AGILLM4_4090_TOKEN_PARAM_RATIO=100 \
|
| 42 |
+
bash /workspace/agillm-4/run_agillm4_4090_longblock.sh
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 43 |
```
|
| 44 |
|
| 45 |
Important: `--sat_every 1 --nat_every 4` keeps SAT trained every step and NAT active on a cadence that fits 24GB cards. On B200/B300 use `--nat_every 1` for full AR+SAT+NAT every step. The AGILLM-4 code now backprops AR, SAT, and NAT sequentially, so the objective remains joint while peak VRAM is lower than holding all activation graphs at once.
|
| 46 |
|
| 47 |
Escalation ladder on 4090:
|
| 48 |
|
| 49 |
+
1. `block=512`
|
| 50 |
+
2. `block=640`
|
| 51 |
+
3. `block=768`
|
| 52 |
+
4. `block=1024`
|
| 53 |
+
5. `block=1280+` only after measured VRAM headroom
|
| 54 |
|
| 55 |
If 8-bit optimizer is unavailable, install `bitsandbytes` rather than dropping the long-block target. SAT remains active every step; NAT should stay enabled with a slower cadence or `--nat_max_tokens` cap on 24GB. The code lowers peak memory by backpropagating AR, SAT, and NAT sequentially, not by deleting heads.
|
| 56 |
|
|
|
|
| 62 |
|
| 63 |
```bash
|
| 64 |
python /workspace/agillm-4/build_v4_seed.py \
|
| 65 |
+
--from-ckpt /workspace/ckpts_sat_fixed_polish_v47_20260523/final.pt \
|
| 66 |
+
--v4-preset floor \
|
| 67 |
+
--out /workspace/agillm-4/agillm4_floor_seed_from_v3_v47.pt
|
| 68 |
```
|
| 69 |
|
| 70 |
Construction rules (v3 d=1024 / H=16 / r=128 / L=24 → v4 d=1536 / H=24 / r=192 / L=32):
|
|
|
|
| 93 |
|
| 94 |
```bash
|
| 95 |
python /workspace/agillm-4/nB300_agillm4.py train \
|
| 96 |
+
--preset agillm4_floor \
|
| 97 |
+
--warmstart_from /workspace/agillm-4/agillm4_floor_seed_from_v3_v47.pt \
|
| 98 |
+
--batch_size 1 --block 512 --amp --grad_checkpoint --sat_every 1 --nat_every 4 \
|
| 99 |
+
--token_param_ratio 100
|
| 100 |
```
|
| 101 |
|
| 102 |
+
The seed file is several GB on disk (fp32 tensors). Rebuild it whenever a newer v3
|
| 103 |
+
final is preferred. For B200/B300, use `--v4-preset main` and train
|
| 104 |
+
`agillm4_main`; for 4090, use `--v4-preset floor`.
|
| 105 |
|
| 106 |
## First Implemented Scaffold
|
| 107 |
|
|
|
|
| 131 |
|
| 132 |
```bash
|
| 133 |
python /workspace/agillm-4/profile_agillm4.py \
|
| 134 |
+
--preset agillm4_floor \
|
| 135 |
+
--block 512 \
|
| 136 |
--batch_size 1 \
|
| 137 |
--backends sdpa,sublinear \
|
| 138 |
--grad_checkpoint \
|
| 139 |
--amp \
|
| 140 |
+
--json_out /workspace/agillm4_floor_profile_512.json
|
| 141 |
```
|
| 142 |
|
| 143 |
Use this before changing architecture. The profiler reports AR core forward,
|
|
|
|
| 173 |
|
| 174 |
```bash
|
| 175 |
python /workspace/agillm-4/block_sweep_agillm4.py \
|
| 176 |
+
--preset agillm4_floor \
|
| 177 |
+
--blocks 512,640,768,1024,1280 \
|
| 178 |
--batch_size 1 \
|
| 179 |
--attn_backend sublinear \
|
| 180 |
--sublinear_window 256 \
|
|
|
|
| 273 |
|
| 274 |
```bash
|
| 275 |
python /workspace/agillm-4/nB300_agillm4.py train \
|
| 276 |
+
--preset agillm4_floor --batch_size 1 --block 512 --amp --grad_checkpoint \
|
| 277 |
--anchor_memory --anchor_stride 256 --anchor_max 2048
|
| 278 |
```
|
| 279 |
|
README.md
CHANGED
|
@@ -14,8 +14,8 @@ AGILLM-4 is the next training target after AGILLM-3. The current code is a
|
|
| 14 |
production-oriented starting point, copied from the proven single-file trainer
|
| 15 |
and extended for:
|
| 16 |
|
| 17 |
-
- ~1.5B
|
| 18 |
-
- 100 tokens per parameter target ratio
|
| 19 |
- longer block-size work on 24GB, B200, and B300 class GPUs
|
| 20 |
- AR+SAT+NAT training, with sequential backward to reduce peak VRAM
|
| 21 |
- SDPA and experimental sublinear local+landmark attention backends
|
|
@@ -28,4 +28,8 @@ Start with [AGILLM-4.md](AGILLM-4.md) for the training plan and command
|
|
| 28 |
recipes. The current sublinear backend is intentionally experimental: profile it
|
| 29 |
against SDPA before using it for a real run.
|
| 30 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 31 |
Current harvest status from n1.py is tracked in [N1_HARVEST.md](N1_HARVEST.md).
|
|
|
|
| 14 |
production-oriented starting point, copied from the proven single-file trainer
|
| 15 |
and extended for:
|
| 16 |
|
| 17 |
+
- >1B parameter floor preset (`agillm4_floor`) and ~1.5B main preset (`agillm4_main`)
|
| 18 |
+
- 100 tokens per parameter target ratio, above the AGILLM-3 training ratio
|
| 19 |
- longer block-size work on 24GB, B200, and B300 class GPUs
|
| 20 |
- AR+SAT+NAT training, with sequential backward to reduce peak VRAM
|
| 21 |
- SDPA and experimental sublinear local+landmark attention backends
|
|
|
|
| 28 |
recipes. The current sublinear backend is intentionally experimental: profile it
|
| 29 |
against SDPA before using it for a real run.
|
| 30 |
|
| 31 |
+
On RTX 4090-class 24GB cards, `run_agillm4_4090_longblock.sh` now defaults to
|
| 32 |
+
`agillm4_floor` instead of the AGILLM-3-sized `large` preset. Override
|
| 33 |
+
`AGILLM4_4090_BLOCK` upward only after the first floor run is stable.
|
| 34 |
+
|
| 35 |
Current harvest status from n1.py is tracked in [N1_HARVEST.md](N1_HARVEST.md).
|
run_agillm4_4090_longblock.sh
CHANGED
|
@@ -12,16 +12,27 @@ if [ -f /root/.cache/huggingface/token ]; then
|
|
| 12 |
export HUGGING_FACE_HUB_TOKEN="$HF_TOKEN"
|
| 13 |
fi
|
| 14 |
|
| 15 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 16 |
|
| 17 |
echo "START_AGILLM4_4090_LONG_BLOCK $(date -u +%Y-%m-%dT%H:%M:%SZ) host=$(hostname)"
|
| 18 |
-
echo "This is production AGILLM
|
| 19 |
-
echo "preset=$
|
| 20 |
|
| 21 |
exec python -u /workspace/agillm-4/nB300_agillm4.py train \
|
| 22 |
-
--preset "$
|
|
|
|
| 23 |
--batch_size 1 \
|
| 24 |
-
--block "$
|
| 25 |
--amp \
|
| 26 |
--attn_backend sdpa \
|
| 27 |
--grad_checkpoint \
|
|
@@ -30,9 +41,9 @@ exec python -u /workspace/agillm-4/nB300_agillm4.py train \
|
|
| 30 |
--nat_every "${AGILLM4_4090_NAT_EVERY:-4}" \
|
| 31 |
--nat_loss_weight "${AGILLM4_4090_NAT_LOSS_WEIGHT:-1.0}" \
|
| 32 |
--nat_expand "${AGILLM4_4090_NAT_EXPAND:-2}" \
|
| 33 |
-
--nat_max_tokens "${AGILLM4_4090_NAT_MAX_TOKENS:-
|
| 34 |
-
--token_param_ratio
|
| 35 |
-
--save_dir
|
| 36 |
--save_every_sec "${AGILLM4_4090_SAVE_EVERY_SEC:-21600}" \
|
| 37 |
--delta_every_steps "${AGILLM4_4090_DELTA_EVERY_STEPS:-25000}" \
|
| 38 |
--delta_max_keep "${AGILLM4_4090_DELTA_MAX_KEEP:-8}" \
|
|
|
|
| 12 |
export HUGGING_FACE_HUB_TOKEN="$HF_TOKEN"
|
| 13 |
fi
|
| 14 |
|
| 15 |
+
PRESET="${AGILLM4_4090_PRESET:-agillm4_floor}"
|
| 16 |
+
BLOCK="${AGILLM4_4090_BLOCK:-512}"
|
| 17 |
+
TOKEN_PARAM_RATIO="${AGILLM4_4090_TOKEN_PARAM_RATIO:-100}"
|
| 18 |
+
SAVE_DIR="${AGILLM4_4090_SAVE_DIR:-/workspace/agillm4_4090_ckpts}"
|
| 19 |
+
|
| 20 |
+
WARMSTART_ARGS=()
|
| 21 |
+
if [ -n "${AGILLM4_4090_WARMSTART_FROM:-}" ]; then
|
| 22 |
+
WARMSTART_ARGS+=(--warmstart_from "$AGILLM4_4090_WARMSTART_FROM")
|
| 23 |
+
fi
|
| 24 |
+
|
| 25 |
+
mkdir -p "$SAVE_DIR"
|
| 26 |
|
| 27 |
echo "START_AGILLM4_4090_LONG_BLOCK $(date -u +%Y-%m-%dT%H:%M:%SZ) host=$(hostname)"
|
| 28 |
+
echo "This is production AGILLM-4 training on 24GB, not a local toy test."
|
| 29 |
+
echo "preset=$PRESET block=$BLOCK token_param_ratio=$TOKEN_PARAM_RATIO sat_every=1 nat_every=${AGILLM4_4090_NAT_EVERY:-4} warmstart=${AGILLM4_4090_WARMSTART_FROM:-none}"
|
| 30 |
|
| 31 |
exec python -u /workspace/agillm-4/nB300_agillm4.py train \
|
| 32 |
+
--preset "$PRESET" \
|
| 33 |
+
"${WARMSTART_ARGS[@]}" \
|
| 34 |
--batch_size 1 \
|
| 35 |
+
--block "$BLOCK" \
|
| 36 |
--amp \
|
| 37 |
--attn_backend sdpa \
|
| 38 |
--grad_checkpoint \
|
|
|
|
| 41 |
--nat_every "${AGILLM4_4090_NAT_EVERY:-4}" \
|
| 42 |
--nat_loss_weight "${AGILLM4_4090_NAT_LOSS_WEIGHT:-1.0}" \
|
| 43 |
--nat_expand "${AGILLM4_4090_NAT_EXPAND:-2}" \
|
| 44 |
+
--nat_max_tokens "${AGILLM4_4090_NAT_MAX_TOKENS:-512}" \
|
| 45 |
+
--token_param_ratio "$TOKEN_PARAM_RATIO" \
|
| 46 |
+
--save_dir "$SAVE_DIR" \
|
| 47 |
--save_every_sec "${AGILLM4_4090_SAVE_EVERY_SEC:-21600}" \
|
| 48 |
--delta_every_steps "${AGILLM4_4090_DELTA_EVERY_STEPS:-25000}" \
|
| 49 |
--delta_max_keep "${AGILLM4_4090_DELTA_MAX_KEEP:-8}" \
|
run_agillm4_4090_sublinear_probe.sh
CHANGED
|
@@ -3,10 +3,20 @@ set -euo pipefail
|
|
| 3 |
|
| 4 |
cd "$(dirname "$0")"
|
| 5 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
python -u ./nB300_agillm4.py train \
|
| 7 |
-
--preset "$
|
|
|
|
| 8 |
--batch_size "${AGILLM4_BATCH:-1}" \
|
| 9 |
-
--block "$
|
| 10 |
--amp \
|
| 11 |
--attn_backend sublinear \
|
| 12 |
--sublinear_window "${AGILLM4_SUBLINEAR_WINDOW:-256}" \
|
|
@@ -20,5 +30,5 @@ python -u ./nB300_agillm4.py train \
|
|
| 20 |
--nat_loss_weight "${AGILLM4_NAT_LOSS_WEIGHT:-1.0}" \
|
| 21 |
--nat_expand "${AGILLM4_NAT_EXPAND:-2}" \
|
| 22 |
--nat_max_tokens "${AGILLM4_NAT_MAX_TOKENS:-768}" \
|
| 23 |
-
--token_param_ratio
|
| 24 |
--save_dir "${AGILLM4_SAVE_DIR:-/workspace/ckpts_agillm4_sublinear_4090}"
|
|
|
|
| 3 |
|
| 4 |
cd "$(dirname "$0")"
|
| 5 |
|
| 6 |
+
PRESET="${AGILLM4_PRESET:-agillm4_floor}"
|
| 7 |
+
BLOCK="${AGILLM4_BLOCK:-768}"
|
| 8 |
+
TOKEN_PARAM_RATIO="${AGILLM4_TOKEN_PARAM_RATIO:-100}"
|
| 9 |
+
|
| 10 |
+
WARMSTART_ARGS=()
|
| 11 |
+
if [ -n "${AGILLM4_WARMSTART_FROM:-}" ]; then
|
| 12 |
+
WARMSTART_ARGS+=(--warmstart_from "$AGILLM4_WARMSTART_FROM")
|
| 13 |
+
fi
|
| 14 |
+
|
| 15 |
python -u ./nB300_agillm4.py train \
|
| 16 |
+
--preset "$PRESET" \
|
| 17 |
+
"${WARMSTART_ARGS[@]}" \
|
| 18 |
--batch_size "${AGILLM4_BATCH:-1}" \
|
| 19 |
+
--block "$BLOCK" \
|
| 20 |
--amp \
|
| 21 |
--attn_backend sublinear \
|
| 22 |
--sublinear_window "${AGILLM4_SUBLINEAR_WINDOW:-256}" \
|
|
|
|
| 30 |
--nat_loss_weight "${AGILLM4_NAT_LOSS_WEIGHT:-1.0}" \
|
| 31 |
--nat_expand "${AGILLM4_NAT_EXPAND:-2}" \
|
| 32 |
--nat_max_tokens "${AGILLM4_NAT_MAX_TOKENS:-768}" \
|
| 33 |
+
--token_param_ratio "$TOKEN_PARAM_RATIO" \
|
| 34 |
--save_dir "${AGILLM4_SAVE_DIR:-/workspace/ckpts_agillm4_sublinear_4090}"
|
run_agillm4_main_b200_b300.sh
CHANGED
|
@@ -12,13 +12,20 @@ if [ -f /root/.cache/huggingface/token ]; then
|
|
| 12 |
export HUGGING_FACE_HUB_TOKEN="$HF_TOKEN"
|
| 13 |
fi
|
| 14 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 15 |
mkdir -p /workspace/agillm4_ckpts
|
| 16 |
|
| 17 |
echo "START_AGILLM4_MAIN $(date -u +%Y-%m-%dT%H:%M:%SZ) host=$(hostname)"
|
| 18 |
-
echo "preset=agillm4_main target_tokens=150000000000 token_param_ratio=
|
| 19 |
|
| 20 |
exec python -u /workspace/agillm-4/nB300_agillm4.py train \
|
| 21 |
--preset agillm4_main \
|
|
|
|
| 22 |
--batch_size "${AGILLM4_BATCH:-1}" \
|
| 23 |
--block "${AGILLM4_BLOCK:-2048}" \
|
| 24 |
--amp \
|
|
@@ -30,7 +37,7 @@ exec python -u /workspace/agillm-4/nB300_agillm4.py train \
|
|
| 30 |
--nat_loss_weight "${AGILLM4_NAT_LOSS_WEIGHT:-1.0}" \
|
| 31 |
--nat_expand "${AGILLM4_NAT_EXPAND:-2}" \
|
| 32 |
--nat_max_tokens "${AGILLM4_NAT_MAX_TOKENS:-0}" \
|
| 33 |
-
--token_param_ratio
|
| 34 |
--target_tokens 150000000000 \
|
| 35 |
--save_dir /workspace/agillm4_ckpts \
|
| 36 |
--save_every_sec "${AGILLM4_SAVE_EVERY_SEC:-21600}" \
|
|
|
|
| 12 |
export HUGGING_FACE_HUB_TOKEN="$HF_TOKEN"
|
| 13 |
fi
|
| 14 |
|
| 15 |
+
TOKEN_PARAM_RATIO="${AGILLM4_TOKEN_PARAM_RATIO:-100}"
|
| 16 |
+
WARMSTART_ARGS=()
|
| 17 |
+
if [ -n "${AGILLM4_WARMSTART_FROM:-}" ]; then
|
| 18 |
+
WARMSTART_ARGS+=(--warmstart_from "$AGILLM4_WARMSTART_FROM")
|
| 19 |
+
fi
|
| 20 |
+
|
| 21 |
mkdir -p /workspace/agillm4_ckpts
|
| 22 |
|
| 23 |
echo "START_AGILLM4_MAIN $(date -u +%Y-%m-%dT%H:%M:%SZ) host=$(hostname)"
|
| 24 |
+
echo "preset=agillm4_main target_tokens=150000000000 token_param_ratio=$TOKEN_PARAM_RATIO block=${AGILLM4_BLOCK:-2048} sat_every=1 nat_every=${AGILLM4_NAT_EVERY:-1} warmstart=${AGILLM4_WARMSTART_FROM:-none}"
|
| 25 |
|
| 26 |
exec python -u /workspace/agillm-4/nB300_agillm4.py train \
|
| 27 |
--preset agillm4_main \
|
| 28 |
+
"${WARMSTART_ARGS[@]}" \
|
| 29 |
--batch_size "${AGILLM4_BATCH:-1}" \
|
| 30 |
--block "${AGILLM4_BLOCK:-2048}" \
|
| 31 |
--amp \
|
|
|
|
| 37 |
--nat_loss_weight "${AGILLM4_NAT_LOSS_WEIGHT:-1.0}" \
|
| 38 |
--nat_expand "${AGILLM4_NAT_EXPAND:-2}" \
|
| 39 |
--nat_max_tokens "${AGILLM4_NAT_MAX_TOKENS:-0}" \
|
| 40 |
+
--token_param_ratio "$TOKEN_PARAM_RATIO" \
|
| 41 |
--target_tokens 150000000000 \
|
| 42 |
--save_dir /workspace/agillm4_ckpts \
|
| 43 |
--save_every_sec "${AGILLM4_SAVE_EVERY_SEC:-21600}" \
|