MediaStreamAI commited on
Commit
809df71
·
verified ·
1 Parent(s): 5112f48

Correct training hyperparameters: SEQ=4096 (not 2048), GRAD_ACCUM_STEPS=32 (not 8). Training is at full architecture context length; no RoPE extrapolation needed for inference.

Browse files
Files changed (1) hide show
  1. README.md +3 -3
README.md CHANGED
@@ -298,8 +298,8 @@ Forward return:
298
  |---|---|
299
  | Learning rate | 1e-5 |
300
  | Gradient clip | 10.0 |
301
- | Effective batch size | 8 (BATCH_PHYSICAL=1 × GRAD_ACCUM_STEPS=8) |
302
- | Sequence length (training) | 2048 |
303
  | Optimiser | AdamW (β₁=0.9, β₂=0.95) |
304
  | Weight decay | 0.1 |
305
  | Warmup steps | 100 |
@@ -307,7 +307,7 @@ Forward return:
307
  | Hardware | NVIDIA GB10 Blackwell (Grace–Blackwell unified memory, 128GB) |
308
  | Training site | MSAI Wright Avenue, Dundee — sovereign UK infrastructure |
309
 
310
- Training was performed at sequence length **2048** using physical microbatches of 1 with gradient accumulation of 8 (effective batch = 8). The architecture supports 4,096-token inference; 2048 4096 is a modest RoPE extrapolation, but long-context behaviour at full 4096 has not been benchmarked at this checkpoint.
311
 
312
  ---
313
 
 
298
  |---|---|
299
  | Learning rate | 1e-5 |
300
  | Gradient clip | 10.0 |
301
+ | Effective batch size | 32 (BATCH_PHYSICAL=1 × GRAD_ACCUM_STEPS=32) |
302
+ | Sequence length (training) | 4096 |
303
  | Optimiser | AdamW (β₁=0.9, β₂=0.95) |
304
  | Weight decay | 0.1 |
305
  | Warmup steps | 100 |
 
307
  | Hardware | NVIDIA GB10 Blackwell (Grace–Blackwell unified memory, 128GB) |
308
  | Training site | MSAI Wright Avenue, Dundee — sovereign UK infrastructure |
309
 
310
+ Training was performed at the full architecture sequence length of **4096** using physical microbatches of 1 with gradient accumulation of 32 (effective batch = 32). Because training and inference share the same context length, no RoPE extrapolation is required for 4096-token inference. Long-context behaviour at full 4096 has been exposed during training but not formally benchmarked at this checkpoint.
311
 
312
  ---
313