IvanHU commited on
Commit
ec3be61
·
verified ·
1 Parent(s): 5a7e3bf

Upload folder using huggingface_hub

Browse files
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. .gitattributes +8 -0
  2. mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/8k-100.sh +65 -0
  3. mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/config.json +57 -0
  4. mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/configs/delta_net_1B.json +29 -0
  5. mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/configs/delta_net_340M.json +26 -0
  6. mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/configs/gated_deltanet_1B.json +22 -0
  7. mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/configs/gated_deltanet_340M.json +22 -0
  8. mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/configs/gdn_6_1_340M.json +50 -0
  9. mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/configs/gla_340M.json +24 -0
  10. mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/configs/gla_7B.json +25 -0
  11. mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/configs/gsa_340M.json +29 -0
  12. mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/configs/hgrn2_340M.json +20 -0
  13. mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/configs/mamba2_1B.json +32 -0
  14. mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/configs/mamba2_340M.json +32 -0
  15. mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/configs/mamba2_6_1_340M.json +50 -0
  16. mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/configs/mamba_1B.json +30 -0
  17. mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/configs/mamba_340M.json +30 -0
  18. mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/configs/samba_1B.json +52 -0
  19. mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/configs/sba_340m.json +18 -0
  20. mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/configs/transformer_1B.json +22 -0
  21. mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/configs/transformer_340M.json +18 -0
  22. mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/configs/transformer_7B.json +21 -0
  23. mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/generation_config.json +7 -0
  24. mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_07su7ijp/attempt_0/0/error.json +1 -0
  25. mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_07su7ijp/attempt_0/0/stderr.log +463 -0
  26. mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_07su7ijp/attempt_0/0/stdout.log +0 -0
  27. mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_07su7ijp/attempt_0/1/error.json +1 -0
  28. mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_07su7ijp/attempt_0/1/stderr.log +387 -0
  29. mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_07su7ijp/attempt_0/1/stdout.log +0 -0
  30. mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_07su7ijp/attempt_0/2/error.json +1 -0
  31. mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_07su7ijp/attempt_0/2/stderr.log +387 -0
  32. mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_07su7ijp/attempt_0/2/stdout.log +0 -0
  33. mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_07su7ijp/attempt_0/3/error.json +1 -0
  34. mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_07su7ijp/attempt_0/3/stderr.log +387 -0
  35. mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_07su7ijp/attempt_0/3/stdout.log +0 -0
  36. mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_07su7ijp/attempt_0/4/error.json +1 -0
  37. mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_07su7ijp/attempt_0/4/stderr.log +387 -0
  38. mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_07su7ijp/attempt_0/4/stdout.log +0 -0
  39. mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_07su7ijp/attempt_0/5/error.json +1 -0
  40. mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_07su7ijp/attempt_0/5/stderr.log +387 -0
  41. mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_07su7ijp/attempt_0/5/stdout.log +0 -0
  42. mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_07su7ijp/attempt_0/6/error.json +1 -0
  43. mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_07su7ijp/attempt_0/6/stderr.log +387 -0
  44. mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_07su7ijp/attempt_0/6/stdout.log +0 -0
  45. mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_07su7ijp/attempt_0/7/error.json +1 -0
  46. mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_07su7ijp/attempt_0/7/stderr.log +387 -0
  47. mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_07su7ijp/attempt_0/7/stdout.log +0 -0
  48. mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_77qh1j5t/attempt_0/0/error.json +1 -0
  49. mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_77qh1j5t/attempt_0/0/stderr.log +467 -0
  50. mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_77qh1j5t/attempt_0/0/stdout.log +0 -0
.gitattributes CHANGED
@@ -57,3 +57,11 @@ gdn_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn
57
  gdn_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_z0tiim1_/attempt_0/5/stderr.log filter=lfs diff=lfs merge=lfs -text
58
  gdn_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_z0tiim1_/attempt_0/6/stderr.log filter=lfs diff=lfs merge=lfs -text
59
  gdn_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_z0tiim1_/attempt_0/7/stderr.log filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
 
 
 
57
  gdn_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_z0tiim1_/attempt_0/5/stderr.log filter=lfs diff=lfs merge=lfs -text
58
  gdn_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_z0tiim1_/attempt_0/6/stderr.log filter=lfs diff=lfs merge=lfs -text
59
  gdn_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_z0tiim1_/attempt_0/7/stderr.log filter=lfs diff=lfs merge=lfs -text
60
+ mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_v3h3fbcf/attempt_0/0/stderr.log filter=lfs diff=lfs merge=lfs -text
61
+ mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_v3h3fbcf/attempt_0/1/stderr.log filter=lfs diff=lfs merge=lfs -text
62
+ mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_v3h3fbcf/attempt_0/2/stderr.log filter=lfs diff=lfs merge=lfs -text
63
+ mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_v3h3fbcf/attempt_0/3/stderr.log filter=lfs diff=lfs merge=lfs -text
64
+ mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_v3h3fbcf/attempt_0/4/stderr.log filter=lfs diff=lfs merge=lfs -text
65
+ mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_v3h3fbcf/attempt_0/5/stderr.log filter=lfs diff=lfs merge=lfs -text
66
+ mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_v3h3fbcf/attempt_0/6/stderr.log filter=lfs diff=lfs merge=lfs -text
67
+ mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_v3h3fbcf/attempt_0/7/stderr.log filter=lfs diff=lfs merge=lfs -text
mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/8k-100.sh ADDED
@@ -0,0 +1,65 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ FLAME_PATH=/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame
2
+ DATASET_ROOT=/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset
3
+ TOKENIZER=/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/tokenizer
4
+
5
+ cd $FLAME_PATH
6
+ source .venv/bin/activate
7
+
8
+ # =========== train config ===========
9
+ CONFIG=${1:-transformer_340M.json}
10
+ SEQ_LEN=8192
11
+ WARMUP_STEPS=100
12
+ STEPS=95366
13
+ LR=3e-4
14
+ BATCH_SIZE=8
15
+ GAS=2
16
+ DECAY_TYPE=linear
17
+ DECAY_RATIO=1
18
+ NNODE=1
19
+ NGPU=8
20
+ LOG_RANK=0
21
+ # ====================================
22
+
23
+ # if jq command is not found, install it
24
+ if ! command -v jq &> /dev/null; then
25
+ echo "jq could not be found, installing it..."
26
+ sudo yum install -y jq
27
+ fi
28
+
29
+ EXP_NAME=$(basename $CONFIG | sed 's/\.config//')-ctx${SEQ_LEN}-steps${STEPS}-lr${LR}-decay_type${DECAY_TYPE}-decay_ratio${DECAY_RATIO}-bs${BATCH_SIZE}-nn${NNODE}-gas${GAS}
30
+
31
+ bash train.sh \
32
+ --job.config_file flame/models/fla.toml \
33
+ --job.dump_folder $FLAME_PATH/exp/$EXP_NAME \
34
+ --model.config $FLAME_PATH/configs/$CONFIG \
35
+ --model.tokenizer_path $TOKENIZER \
36
+ --optimizer.name AdamW \
37
+ --optimizer.eps 1e-8 \
38
+ --optimizer.lr $LR \
39
+ --lr_scheduler.warmup_steps $WARMUP_STEPS \
40
+ --lr_scheduler.lr_min 0.01 \
41
+ --lr_scheduler.decay_type $DECAY_TYPE \
42
+ --lr_scheduler.decay_ratio $DECAY_RATIO \
43
+ --training.batch_size $BATCH_SIZE \
44
+ --training.seq_len $SEQ_LEN \
45
+ --training.context_len $SEQ_LEN \
46
+ --training.gradient_accumulation_steps $GAS \
47
+ --training.steps $STEPS \
48
+ --training.max_norm 1.0 \
49
+ --training.skip_nan_inf \
50
+ --training.dataset $DATASET_ROOT/fineweb-edu-sample,$DATASET_ROOT/small_repos_20B_sample_merged,$DATASET_ROOT/megamath-web-pro \
51
+ --training.data_probs 0.55,0.3,0.15 \
52
+ --training.dataset_split train,train,train \
53
+ --training.dataset_name default,default,default \
54
+ --training.streaming \
55
+ --training.num_workers 32 \
56
+ --training.prefetch_factor 2 \
57
+ --training.seed 42 \
58
+ --training.compile \
59
+ --checkpoint.interval 8192 \
60
+ --checkpoint.load_step -1 \
61
+ --checkpoint.keep_latest_k 100 \
62
+ --metrics.log_freq 1 \
63
+ --metrics.enable_tensorboard \
64
+ --training.streaming
65
+
mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/config.json ADDED
@@ -0,0 +1,57 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "Mamba2ForCausalLM"
4
+ ],
5
+ "attn": {
6
+ "layers": [
7
+ 5,
8
+ 11,
9
+ 17,
10
+ 23
11
+ ],
12
+ "num_heads": 16,
13
+ "num_kv_heads": 8,
14
+ "qkv_bias": false,
15
+ "rope_theta": 160000.0,
16
+ "window_size": null
17
+ },
18
+ "attn_mode": "chunk",
19
+ "bos_token_id": 1,
20
+ "chunk_size": 256,
21
+ "conv_kernel": 4,
22
+ "eos_token_id": 2,
23
+ "expand": 2,
24
+ "fuse_cross_entropy": true,
25
+ "fuse_norm": true,
26
+ "fuse_swiglu": true,
27
+ "head_dim": 64,
28
+ "hidden_act": "silu",
29
+ "hidden_size": 1024,
30
+ "initializer_range": 0.02,
31
+ "model_type": "mamba2",
32
+ "n_groups": 1,
33
+ "norm_eps": 1e-05,
34
+ "num_heads": 32,
35
+ "num_hidden_layers": 48,
36
+ "pad_token_id": 0,
37
+ "rescale_prenorm_residual": true,
38
+ "residual_in_fp32": true,
39
+ "rms_norm": true,
40
+ "state_size": 128,
41
+ "tie_word_embeddings": false,
42
+ "time_step_floor": 0.0001,
43
+ "time_step_limit": [
44
+ 0.0,
45
+ Infinity
46
+ ],
47
+ "time_step_max": 0.1,
48
+ "time_step_min": 0.001,
49
+ "time_step_rank": 128,
50
+ "torch_dtype": "float32",
51
+ "transformers_version": "4.53.3",
52
+ "use_bias": false,
53
+ "use_cache": true,
54
+ "use_conv_bias": true,
55
+ "use_l2warp": false,
56
+ "vocab_size": 32000
57
+ }
mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/configs/delta_net_1B.json ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "attn": null,
3
+ "attn_mode": "chunk",
4
+ "bos_token_id": 1,
5
+ "conv_size": 4,
6
+ "eos_token_id": 2,
7
+ "expand_k": 1,
8
+ "expand_v": 1,
9
+ "fuse_cross_entropy": true,
10
+ "fuse_norm": true,
11
+ "hidden_act": "swish",
12
+ "hidden_ratio": 4,
13
+ "hidden_size": 2048,
14
+ "initializer_range": 0.02,
15
+ "intermediate_size": null,
16
+ "model_type": "delta_net",
17
+ "norm_eps": 1e-06,
18
+ "num_heads": 16,
19
+ "num_hidden_layers": 24,
20
+ "pad_token_id": 2,
21
+ "qk_activation": "silu",
22
+ "qk_norm": "l2",
23
+ "tie_word_embeddings": false,
24
+ "use_beta": true,
25
+ "use_cache": true,
26
+ "use_gate": false,
27
+ "use_output_norm": true,
28
+ "use_short_conv": true
29
+ }
mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/configs/delta_net_340M.json ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "attn_mode": "chunk",
3
+ "bos_token_id": 1,
4
+ "conv_size": 4,
5
+ "eos_token_id": 2,
6
+ "expand_k": 1,
7
+ "expand_v": 1,
8
+ "fuse_cross_entropy": true,
9
+ "hidden_act": "swish",
10
+ "hidden_ratio": 4,
11
+ "hidden_size": 1024,
12
+ "initializer_range": 0.02,
13
+ "intermediate_size": null,
14
+ "model_type": "delta_net",
15
+ "norm_eps": 1e-06,
16
+ "num_heads": 8,
17
+ "num_hidden_layers": 24,
18
+ "qk_activation": "silu",
19
+ "qk_norm": "l2",
20
+ "tie_word_embeddings": false,
21
+ "use_beta": true,
22
+ "use_cache": true,
23
+ "use_gate": false,
24
+ "use_output_norm": true,
25
+ "use_short_conv": true
26
+ }
mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/configs/gated_deltanet_1B.json ADDED
@@ -0,0 +1,22 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "attn_mode": "chunk",
3
+ "bos_token_id": 1,
4
+ "conv_size": 4,
5
+ "eos_token_id": 2,
6
+ "expand_v": 2,
7
+ "fuse_cross_entropy": true,
8
+ "head_dim": 256,
9
+ "hidden_act": "swish",
10
+ "hidden_ratio": 4,
11
+ "hidden_size": 2048,
12
+ "initializer_range": 0.02,
13
+ "intermediate_size": null,
14
+ "model_type": "gated_deltanet",
15
+ "norm_eps": 1e-06,
16
+ "num_heads": 6,
17
+ "num_hidden_layers": 21,
18
+ "tie_word_embeddings": false,
19
+ "use_cache": true,
20
+ "use_gate": true,
21
+ "use_short_conv": true
22
+ }
mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/configs/gated_deltanet_340M.json ADDED
@@ -0,0 +1,22 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "attn_mode": "chunk",
3
+ "bos_token_id": 1,
4
+ "conv_size": 4,
5
+ "eos_token_id": 2,
6
+ "expand_v": 2,
7
+ "fuse_cross_entropy": true,
8
+ "head_dim": 256,
9
+ "hidden_act": "swish",
10
+ "hidden_ratio": 4,
11
+ "hidden_size": 1024,
12
+ "initializer_range": 0.02,
13
+ "intermediate_size": null,
14
+ "model_type": "gated_deltanet",
15
+ "norm_eps": 1e-06,
16
+ "num_heads": 6,
17
+ "num_hidden_layers": 21,
18
+ "tie_word_embeddings": false,
19
+ "use_cache": true,
20
+ "use_gate": true,
21
+ "use_short_conv": true
22
+ }
mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/configs/gdn_6_1_340M.json ADDED
@@ -0,0 +1,50 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "GatedDeltaNetForCausalLM"
4
+ ],
5
+ "attn": {
6
+ "layers": [
7
+ 5,
8
+ 11,
9
+ 17,
10
+ 23
11
+ ],
12
+ "num_heads": 16,
13
+ "num_kv_heads": 8,
14
+ "qkv_bias": false,
15
+ "rope_theta": 160000.0,
16
+ "window_size": null
17
+ },
18
+ "attn_mode": "chunk",
19
+ "bos_token_id": 1,
20
+ "conv_size": 4,
21
+ "eos_token_id": 2,
22
+ "expand_k": 1,
23
+ "expand_v": 1,
24
+ "fuse_cross_entropy": true,
25
+ "fuse_norm": true,
26
+ "fuse_swiglu": true,
27
+ "head_dim": 256,
28
+ "hidden_act": "swish",
29
+ "hidden_ratio": 4,
30
+ "hidden_size": 1024,
31
+ "initializer_range": 0.02,
32
+ "intermediate_size": null,
33
+ "max_position_embeddings": 8192,
34
+ "model_type": "gated_deltanet",
35
+ "norm_eps": 1e-06,
36
+ "norm_first": false,
37
+ "num_heads": 4,
38
+ "num_hidden_layers": 24,
39
+ "qk_activation": "silu",
40
+ "qk_norm": "l2",
41
+ "tie_word_embeddings": false,
42
+ "torch_dtype": "float32",
43
+ "transformers_version": "4.51.3",
44
+ "use_beta": true,
45
+ "use_cache": true,
46
+ "use_gate": true,
47
+ "use_output_norm": true,
48
+ "use_short_conv": true,
49
+ "vocab_size": 32000
50
+ }
mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/configs/gla_340M.json ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "attn_mode": "chunk",
3
+ "bos_token_id": 1,
4
+ "clamp_min": null,
5
+ "eos_token_id": 2,
6
+ "expand_k": 0.5,
7
+ "expand_v": 1,
8
+ "fuse_cross_entropy": true,
9
+ "fuse_norm": true,
10
+ "hidden_act": "swish",
11
+ "hidden_ratio": 4,
12
+ "hidden_size": 1024,
13
+ "initializer_range": 0.02,
14
+ "intermediate_size": null,
15
+ "model_type": "gla",
16
+ "num_heads": 4,
17
+ "num_hidden_layers": 24,
18
+ "norm_eps": 1e-06,
19
+ "tie_word_embeddings": false,
20
+ "use_cache": true,
21
+ "use_gk": true,
22
+ "use_gv": false,
23
+ "vocab_size": 32000
24
+ }
mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/configs/gla_7B.json ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "attn": null,
3
+ "attn_mode": "chunk",
4
+ "bos_token_id": 1,
5
+ "eos_token_id": 2,
6
+ "expand_k": 0.5,
7
+ "expand_v": 1,
8
+ "fuse_cross_entropy": true,
9
+ "fuse_norm": true,
10
+ "hidden_act": "swish",
11
+ "hidden_ratio": 4,
12
+ "hidden_size": 4096,
13
+ "initializer_range": 0.02,
14
+ "intermediate_size": 11008,
15
+ "model_type": "gla",
16
+ "norm_eps": 1e-06,
17
+ "num_heads": 16,
18
+ "num_hidden_layers": 32,
19
+ "tie_word_embeddings": false,
20
+ "use_cache": true,
21
+ "use_gk": true,
22
+ "use_gv": false,
23
+ "use_output_gate": true,
24
+ "use_short_conv": false
25
+ }
mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/configs/gsa_340M.json ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token_id": 1,
3
+ "conv_size": 4,
4
+ "eos_token_id": 2,
5
+ "expand_k": 1,
6
+ "expand_v": 1,
7
+ "elementwise_affine": false,
8
+ "feature_map": "swish",
9
+ "fuse_cross_entropy": true,
10
+ "fuse_norm": true,
11
+ "gate_logit_normalizer": 4,
12
+ "hidden_act": "swish",
13
+ "hidden_ratio": 4,
14
+ "hidden_size": 1024,
15
+ "initializer_range": 0.02,
16
+ "intermediate_size": null,
17
+ "model_type": "gsa",
18
+ "num_heads": 4,
19
+ "num_hidden_layers": 24,
20
+ "num_slots": 64,
21
+ "norm_eps": 1e-06,
22
+ "share_conv_kernel": true,
23
+ "tie_word_embeddings": false,
24
+ "use_cache": true,
25
+ "use_norm": true,
26
+ "use_output_gate": true,
27
+ "use_rope": false,
28
+ "use_short_conv": false
29
+ }
mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/configs/hgrn2_340M.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "attn_mode": "chunk",
3
+ "bos_token_id": 1,
4
+ "eos_token_id": 2,
5
+ "expand_ratio": 128,
6
+ "fuse_cross_entropy": true,
7
+ "fuse_norm": true,
8
+ "hidden_act": "swish",
9
+ "hidden_ratio": 4,
10
+ "hidden_size": 1024,
11
+ "initializer_range": 0.02,
12
+ "intermediate_size": null,
13
+ "model_type": "hgrn2",
14
+ "num_heads": 8,
15
+ "num_hidden_layers": 24,
16
+ "norm_eps": 1e-06,
17
+ "tie_word_embeddings": false,
18
+ "use_cache": true,
19
+ "vocab_size": 32000
20
+ }
mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/configs/mamba2_1B.json ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token_id": 1,
3
+ "chunk_size": 256,
4
+ "conv_kernel": 4,
5
+ "eos_token_id": 2,
6
+ "expand": 2,
7
+ "fuse_cross_entropy": true,
8
+ "fuse_norm": true,
9
+ "head_dim": 64,
10
+ "hidden_act": "silu",
11
+ "hidden_size": 2048,
12
+ "initializer_range": 0.02,
13
+ "norm_eps": 1e-05,
14
+ "model_type": "mamba2",
15
+ "n_groups": 1,
16
+ "num_hidden_layers": 48,
17
+ "pad_token_id": 0,
18
+ "rescale_prenorm_residual": true,
19
+ "residual_in_fp32": true,
20
+ "rms_norm": true,
21
+ "state_size": 128,
22
+ "tie_word_embeddings": false,
23
+ "time_step_floor": 0.0001,
24
+ "time_step_max": 0.1,
25
+ "time_step_min": 0.001,
26
+ "time_step_rank": 128,
27
+ "transformers_version": "4.50.1",
28
+ "use_bias": false,
29
+ "use_cache": true,
30
+ "use_conv_bias": true,
31
+ "vocab_size": 32000
32
+ }
mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/configs/mamba2_340M.json ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token_id": 1,
3
+ "chunk_size": 256,
4
+ "conv_kernel": 4,
5
+ "eos_token_id": 2,
6
+ "expand": 2,
7
+ "fuse_cross_entropy": true,
8
+ "fuse_norm": true,
9
+ "head_dim": 64,
10
+ "hidden_act": "silu",
11
+ "hidden_size": 1024,
12
+ "initializer_range": 0.02,
13
+ "norm_eps": 1e-05,
14
+ "model_type": "mamba2",
15
+ "n_groups": 1,
16
+ "num_hidden_layers": 48,
17
+ "pad_token_id": 0,
18
+ "rescale_prenorm_residual": true,
19
+ "residual_in_fp32": true,
20
+ "rms_norm": true,
21
+ "state_size": 128,
22
+ "tie_word_embeddings": false,
23
+ "time_step_floor": 0.0001,
24
+ "time_step_max": 0.1,
25
+ "time_step_min": 0.001,
26
+ "time_step_rank": 128,
27
+ "transformers_version": "4.50.1",
28
+ "use_bias": false,
29
+ "use_cache": true,
30
+ "use_conv_bias": true,
31
+ "vocab_size": 32000
32
+ }
mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/configs/mamba2_6_1_340M.json ADDED
@@ -0,0 +1,50 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "Mamba2ForCausalLM"
4
+ ],
5
+ "attn": {
6
+ "layers": [
7
+ 5,
8
+ 11,
9
+ 17,
10
+ 23
11
+ ],
12
+ "num_heads": 16,
13
+ "num_kv_heads": 8,
14
+ "qkv_bias": false,
15
+ "rope_theta": 160000.0,
16
+ "window_size": null
17
+ },
18
+ "attn_mode": "chunk",
19
+ "bos_token_id": 1,
20
+ "chunk_size": 256,
21
+ "conv_kernel": 4,
22
+ "eos_token_id": 2,
23
+ "expand": 2,
24
+ "fuse_cross_entropy": true,
25
+ "fuse_norm": true,
26
+ "fuse_swiglu": true,
27
+ "head_dim": 64,
28
+ "hidden_act": "silu",
29
+ "hidden_size": 1024,
30
+ "initializer_range": 0.02,
31
+ "norm_eps": 1e-05,
32
+ "model_type": "mamba2",
33
+ "n_groups": 1,
34
+ "num_hidden_layers": 48,
35
+ "pad_token_id": 0,
36
+ "rescale_prenorm_residual": true,
37
+ "residual_in_fp32": true,
38
+ "rms_norm": true,
39
+ "state_size": 128,
40
+ "tie_word_embeddings": false,
41
+ "time_step_floor": 0.0001,
42
+ "time_step_max": 0.1,
43
+ "time_step_min": 0.001,
44
+ "time_step_rank": 128,
45
+ "transformers_version": "4.50.1",
46
+ "use_bias": false,
47
+ "use_cache": true,
48
+ "use_conv_bias": true,
49
+ "vocab_size": 32000
50
+ }
mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/configs/mamba_1B.json ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token_id": 1,
3
+ "conv_kernel": 4,
4
+ "eos_token_id": 2,
5
+ "expand": 2,
6
+ "fuse_cross_entropy": true,
7
+ "fuse_norm": true,
8
+ "hidden_act": "silu",
9
+ "hidden_size": 2048,
10
+ "initializer_range": 0.02,
11
+ "model_type": "mamba",
12
+ "norm_eps": 1e-05,
13
+ "num_hidden_layers": 48,
14
+ "pad_token_id": 0,
15
+ "rescale_prenorm_residual": false,
16
+ "residual_in_fp32": false,
17
+ "state_size": 16,
18
+ "tie_word_embeddings": false,
19
+ "time_step_floor": 0.0001,
20
+ "time_step_init_scheme": "random",
21
+ "time_step_max": 0.1,
22
+ "time_step_min": 0.001,
23
+ "time_step_rank": 128,
24
+ "time_step_scale": 1.0,
25
+ "transformers_version": "4.50.1",
26
+ "use_bias": false,
27
+ "use_cache": true,
28
+ "use_conv_bias": true,
29
+ "vocab_size": 32000
30
+ }
mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/configs/mamba_340M.json ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token_id": 1,
3
+ "conv_kernel": 4,
4
+ "eos_token_id": 2,
5
+ "expand": 2,
6
+ "fuse_cross_entropy": true,
7
+ "fuse_norm": true,
8
+ "hidden_act": "silu",
9
+ "hidden_size": 1024,
10
+ "initializer_range": 0.02,
11
+ "model_type": "mamba",
12
+ "norm_eps": 1e-05,
13
+ "num_hidden_layers": 48,
14
+ "pad_token_id": 0,
15
+ "rescale_prenorm_residual": false,
16
+ "residual_in_fp32": false,
17
+ "state_size": 16,
18
+ "tie_word_embeddings": false,
19
+ "time_step_floor": 0.0001,
20
+ "time_step_init_scheme": "random",
21
+ "time_step_max": 0.1,
22
+ "time_step_min": 0.001,
23
+ "time_step_rank": 128,
24
+ "time_step_scale": 1.0,
25
+ "transformers_version": "4.50.1",
26
+ "use_bias": false,
27
+ "use_cache": true,
28
+ "use_conv_bias": true,
29
+ "vocab_size": 32000
30
+ }
mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/configs/samba_1B.json ADDED
@@ -0,0 +1,52 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "attn": {
3
+ "layers": [
4
+ 1,
5
+ 3,
6
+ 5,
7
+ 7,
8
+ 9,
9
+ 11,
10
+ 13,
11
+ 15,
12
+ 17
13
+ ],
14
+ "num_heads": 18,
15
+ "num_kv_heads": 18,
16
+ "qkv_bias": false,
17
+ "rope_theta": 10000.0,
18
+ "window_size": 2048
19
+ },
20
+ "bos_token_id": 1,
21
+ "conv_kernel": 4,
22
+ "eos_token_id": 2,
23
+ "expand": 2,
24
+ "fuse_cross_entropy": true,
25
+ "fuse_norm": true,
26
+ "fuse_swiglu": true,
27
+ "hidden_act": "swish",
28
+ "hidden_ratio": 4,
29
+ "hidden_size": 2304,
30
+ "initializer_range": 0.02,
31
+ "intermediate_size": 4608,
32
+ "max_position_embeddings": 2048,
33
+ "model_type": "samba",
34
+ "norm_eps": 1e-05,
35
+ "num_hidden_layers": 18,
36
+ "pad_token_id": 0,
37
+ "rescale_prenorm_residual": false,
38
+ "residual_in_fp32": false,
39
+ "state_size": 16,
40
+ "tie_word_embeddings": false,
41
+ "time_step_floor": 0.0001,
42
+ "time_step_init_scheme": "random",
43
+ "time_step_max": 0.1,
44
+ "time_step_min": 0.001,
45
+ "time_step_rank": 144,
46
+ "time_step_scale": 1.0,
47
+ "transformers_version": "4.50.1",
48
+ "use_bias": false,
49
+ "use_cache": true,
50
+ "use_conv_bias": true,
51
+ "vocab_size": 32000
52
+ }
mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/configs/sba_340m.json ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "attention_bias": false,
3
+ "bos_token_id": 1,
4
+ "eos_token_id": 2,
5
+ "fuse_cross_entropy": true,
6
+ "fuse_norm": true,
7
+ "hidden_act": "swish",
8
+ "hidden_size": 1024,
9
+ "initializer_range": 0.006,
10
+ "max_position_embeddings": 8192,
11
+ "model_type": "sba",
12
+ "num_heads": 16,
13
+ "num_hidden_layers": 24,
14
+ "norm_eps": 1e-06,
15
+ "tie_word_embeddings": false,
16
+ "use_cache": true,
17
+ "vocab_size": 32000
18
+ }
mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/configs/transformer_1B.json ADDED
@@ -0,0 +1,22 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token_id": 1,
3
+ "elementwise_affine": true,
4
+ "eos_token_id": 2,
5
+ "fuse_cross_entropy": true,
6
+ "fuse_norm": true,
7
+ "fuse_swiglu": true,
8
+ "hidden_act": "swish",
9
+ "hidden_ratio": 4,
10
+ "hidden_size": 2048,
11
+ "initializer_range": 0.02,
12
+ "intermediate_size": null,
13
+ "max_position_embeddings": 8192,
14
+ "model_type": "transformer",
15
+ "norm_eps": 1e-06,
16
+ "num_heads": 32,
17
+ "num_hidden_layers": 24,
18
+ "num_kv_heads": null,
19
+ "pad_token_id": 2,
20
+ "rope_theta": 10000.0,
21
+ "tie_word_embeddings": false
22
+ }
mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/configs/transformer_340M.json ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "attention_bias": false,
3
+ "bos_token_id": 1,
4
+ "eos_token_id": 2,
5
+ "fuse_cross_entropy": true,
6
+ "fuse_norm": true,
7
+ "hidden_act": "swish",
8
+ "hidden_size": 1024,
9
+ "initializer_range": 0.02,
10
+ "max_position_embeddings": 8192,
11
+ "model_type": "transformer",
12
+ "num_heads": 16,
13
+ "num_hidden_layers": 24,
14
+ "norm_eps": 1e-06,
15
+ "tie_word_embeddings": false,
16
+ "use_cache": true,
17
+ "vocab_size": 32000
18
+ }
mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/configs/transformer_7B.json ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "attention_bias": false,
3
+ "bos_token_id": 1,
4
+ "eos_token_id": 2,
5
+ "fuse_cross_entropy": true,
6
+ "fuse_norm": true,
7
+ "hidden_act": "swish",
8
+ "hidden_ratio": 4,
9
+ "hidden_size": 4096,
10
+ "initializer_range": 0.02,
11
+ "intermediate_size": 14336,
12
+ "model_type": "transformer",
13
+ "norm_eps": 1e-06,
14
+ "num_heads": 32,
15
+ "num_hidden_layers": 32,
16
+ "num_kv_heads": 8,
17
+ "rope_theta": 10000.0,
18
+ "tie_word_embeddings": false,
19
+ "use_cache": true,
20
+ "window_size": null
21
+ }
mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/generation_config.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 1,
4
+ "eos_token_id": 2,
5
+ "pad_token_id": 0,
6
+ "transformers_version": "4.53.3"
7
+ }
mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_07su7ijp/attempt_0/0/error.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"message": {"message": "OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 GiB. GPU 0 has a total capacity of 95.00 GiB of which 85.09 GiB is free. Process 2003896 has 316.00 MiB memory in use. Process 696027 has 316.00 MiB memory in use. Process 1114693 has 9.27 GiB memory in use. Of the allocated memory 7.99 GiB is allocated by PyTorch, and 73.66 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)", "extraInfo": {"py_callstack": "Traceback (most recent call last):\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py\", line 355, in wrapper\n return f(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/flame/train.py\", line 487, in main\n output = model(\n ^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1751, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1857, in _call_impl\n return inner()\n ^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1805, in inner\n result = forward_call(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/transformers/utils/deprecation.py\", line 172, in wrapped_func\n return func(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/mamba2/modeling_mamba2.py\", line 526, in forward\n outputs = self.backbone(\n ^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1751, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1762, in _call_impl\n return forward_call(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/mamba2/modeling_mamba2.py\", line 405, in forward\n hidden_states = mixer_block(\n ^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1751, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1857, in _call_impl\n return inner()\n ^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1805, in inner\n result = forward_call(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py\", line 655, in _fn\n return fn(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1751, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1762, in _call_impl\n return forward_call(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/mamba2/modeling_mamba2.py\", line 161, in forward\n hidden_states = self.norm(hidden_states)\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/mamba2/modeling_mamba2.py\", line 165, in torch_dynamo_resume_in_forward_at_161\n hidden_states = self.mixer(\n ^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1751, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1762, in _call_impl\n return forward_call(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/layers/mamba2.py\", line 601, in forward\n return self.torch_forward(hidden_states, cache_params, cache_position, attention_mask)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/layers/mamba2.py\", line 528, in torch_forward\n G_intermediate = C[:, :, :, None, :, :] * B[:, :, None, :, :, :]\n ~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~\ntorch.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 GiB. GPU 0 has a total capacity of 95.00 GiB of which 85.09 GiB is free. Process 2003896 has 316.00 MiB memory in use. Process 696027 has 316.00 MiB memory in use. Process 1114693 has 9.27 GiB memory in use. Of the allocated memory 7.99 GiB is allocated by PyTorch, and 73.66 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)\n", "timestamp": "1753252283"}}}
mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_07su7ijp/attempt_0/0/stderr.log ADDED
@@ -0,0 +1,463 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [titan] 2025-07-23 14:27:46,756 - root - INFO - Starting job: default job
2
+ [titan] 2025-07-23 14:27:46,756 - root - INFO - {
3
+ "activation_checkpoint": {
4
+ "mode": "none",
5
+ "selective_ac_option": "2"
6
+ },
7
+ "activation_offload": {
8
+ "mode": "none"
9
+ },
10
+ "checkpoint": {
11
+ "async_mode": "disabled",
12
+ "create_seed_checkpoint": false,
13
+ "enable_checkpoint": true,
14
+ "exclude_from_loading": [],
15
+ "export_dtype": "float32",
16
+ "folder": "checkpoint",
17
+ "interval": 8192,
18
+ "interval_type": "steps",
19
+ "keep_latest_k": 100,
20
+ "load_step": -1,
21
+ "model_weights_only": false
22
+ },
23
+ "comm": {
24
+ "init_timeout_seconds": 300,
25
+ "trace_buf_size": 20000,
26
+ "train_timeout_seconds": 100
27
+ },
28
+ "experimental": {
29
+ "context_parallel_degree": 1,
30
+ "context_parallel_rotate_method": "allgather",
31
+ "custom_model_path": "",
32
+ "enable_async_tensor_parallel": false,
33
+ "enable_compiled_autograd": false,
34
+ "pipeline_parallel_degree": 1,
35
+ "pipeline_parallel_microbatches": null,
36
+ "pipeline_parallel_schedule": "1F1B",
37
+ "pipeline_parallel_schedule_csv": "",
38
+ "pipeline_parallel_split_points": []
39
+ },
40
+ "fault_tolerance": {
41
+ "enable": false,
42
+ "group_size": 0,
43
+ "min_replica_size": 1,
44
+ "replica_id": 0
45
+ },
46
+ "float8": {
47
+ "enable_fsdp_float8_all_gather": false,
48
+ "force_recompute_fp8_weight_in_bwd": false,
49
+ "precompute_float8_dynamic_scale_for_fsdp": false,
50
+ "recipe_name": null
51
+ },
52
+ "job": {
53
+ "config_file": "flame/models/fla.toml",
54
+ "description": "default job",
55
+ "dump_folder": "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/exp/mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2",
56
+ "print_args": true,
57
+ "use_for_integration_test": false
58
+ },
59
+ "lr_scheduler": {
60
+ "decay_ratio": 1.0,
61
+ "decay_type": "linear",
62
+ "lr_min": 0.01,
63
+ "warmup_steps": 100
64
+ },
65
+ "memory_estimation": {
66
+ "disable_fake_mode": false,
67
+ "enabled": false
68
+ },
69
+ "metrics": {
70
+ "disable_color_printing": false,
71
+ "enable_tensorboard": true,
72
+ "enable_wandb": true,
73
+ "log_freq": 1,
74
+ "save_for_all_ranks": false,
75
+ "save_tb_folder": "tb"
76
+ },
77
+ "model": {
78
+ "config": "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/configs/mamba2_6_1_340M.json",
79
+ "converters": [],
80
+ "name": "fla",
81
+ "print_after_conversion": false,
82
+ "tokenizer_path": "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/tokenizer"
83
+ },
84
+ "optimizer": {
85
+ "early_step_in_backward": false,
86
+ "eps": 1e-08,
87
+ "implementation": "fused",
88
+ "lr": 0.0003,
89
+ "name": "AdamW"
90
+ },
91
+ "profiling": {
92
+ "enable_memory_snapshot": false,
93
+ "enable_profiling": true,
94
+ "profile_freq": 512,
95
+ "save_memory_snapshot_folder": "memory_snapshot",
96
+ "save_traces_folder": "profile_trace"
97
+ },
98
+ "training": {
99
+ "batch_size": 8,
100
+ "compile": true,
101
+ "context_len": 8192,
102
+ "data_dir": null,
103
+ "data_files": null,
104
+ "data_parallel_replicate_degree": 1,
105
+ "data_parallel_shard_degree": -1,
106
+ "data_probs": "0.55,0.3,0.15",
107
+ "dataset": "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample,/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged,/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro",
108
+ "dataset_name": "default,default,default",
109
+ "dataset_split": "train,train,train",
110
+ "deterministic": false,
111
+ "disable_loss_parallel": false,
112
+ "enable_cpu_offload": false,
113
+ "fsdp_reshard_after_forward": "default",
114
+ "gc_freq": 50,
115
+ "gradient_accumulation_steps": 2,
116
+ "max_norm": 1.0,
117
+ "mixed_precision_param": "bfloat16",
118
+ "mixed_precision_reduce": "float32",
119
+ "num_workers": 32,
120
+ "persistent_workers": false,
121
+ "pin_memory": false,
122
+ "prefetch_factor": 2,
123
+ "seed": 42,
124
+ "seq_len": 8192,
125
+ "skip_nan_inf": true,
126
+ "steps": 95366,
127
+ "streaming": true,
128
+ "tensor_parallel_degree": 1,
129
+ "varlen": false
130
+ }
131
+ }
132
+ [titan] 2025-07-23 14:27:46,756 - root - INFO - [GC] Initial GC collection. 0.00 seconds.
133
+ [titan] 2025-07-23 14:27:46,757 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config
134
+ [titan] 2025-07-23 14:27:46,772 - root - INFO - CUDA capacity: NVIDIA H20 with 95.00GiB memory
135
+ [titan] 2025-07-23 14:27:46,951 - root - WARNING - Peak flops undefined for: NVIDIA H20, fallback to A100
136
+ [titan] 2025-07-23 14:27:46,951 - root - INFO - Peak FLOPS used for computing MFU: 3.120e+14
137
+ [titan] 2025-07-23 14:27:46,951 - root - INFO - Building 1-D device mesh with ['dp_shard'], [8]
138
+ [titan] 2025-07-23 14:27:47,501 - root - INFO - Loading tokenizer...
139
+ [titan] 2025-07-23 14:27:47,997 - root - INFO - LlamaTokenizerFast(name_or_path='/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/tokenizer', vocab_size=32000, model_max_length=10000000000, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>'}, clean_up_tokenization_spaces=False, added_tokens_decoder={
140
+ 0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
141
+ 1: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
142
+ 2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
143
+ }
144
+ )
145
+ [titan] 2025-07-23 14:27:47,998 - root - INFO - Loading dataset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample,/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged,/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro:default,default,default
146
+ `trust_remote_code` is not supported anymore.
147
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample' isn't based on a loading script and remove `trust_remote_code`.
148
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
149
+ [titan] 2025-07-23 14:27:47,998 - datasets.load - ERROR - `trust_remote_code` is not supported anymore.
150
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample' isn't based on a loading script and remove `trust_remote_code`.
151
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
152
+ [titan] 2025-07-23 14:27:48,644 - root - INFO - Subset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample:default (p = 0.550):
153
+ IterableDataset({
154
+ features: ['text', 'id', 'dump', 'url', 'file_path', 'language', 'language_score', 'token_count', 'score', 'int_score'],
155
+ num_shards: 140
156
+ })
157
+ [titan] 2025-07-23 14:27:48,644 - root - INFO - Shuffling the dataset with seed 42
158
+ [titan] 2025-07-23 14:27:48,645 - root - WARNING - Dataset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample has insufficient shards (140). Need 256 shards minimum for desired data parallel workers × 32 dataloader workers. Resharding dataset to 256 shards and disabling streaming mode.
159
+ `trust_remote_code` is not supported anymore.
160
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample' isn't based on a loading script and remove `trust_remote_code`.
161
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
162
+ [titan] 2025-07-23 14:27:48,645 - datasets.load - ERROR - `trust_remote_code` is not supported anymore.
163
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample' isn't based on a loading script and remove `trust_remote_code`.
164
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
165
+ `trust_remote_code` is not supported anymore.
166
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged' isn't based on a loading script and remove `trust_remote_code`.
167
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
168
+ [titan] 2025-07-23 14:28:39,750 - datasets.load - ERROR - `trust_remote_code` is not supported anymore.
169
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged' isn't based on a loading script and remove `trust_remote_code`.
170
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
171
+ [titan] 2025-07-23 14:28:39,881 - root - INFO - Subset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged:default (p = 0.300):
172
+ IterableDataset({
173
+ features: ['repo', 'content'],
174
+ num_shards: 1
175
+ })
176
+ [titan] 2025-07-23 14:28:39,881 - root - INFO - Shuffling the dataset with seed 42
177
+ [titan] 2025-07-23 14:28:39,882 - root - WARNING - Dataset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged has insufficient shards (1). Need 256 shards minimum for desired data parallel workers × 32 dataloader workers. Resharding dataset to 256 shards and disabling streaming mode.
178
+ `trust_remote_code` is not supported anymore.
179
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged' isn't based on a loading script and remove `trust_remote_code`.
180
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
181
+ [titan] 2025-07-23 14:28:39,882 - datasets.load - ERROR - `trust_remote_code` is not supported anymore.
182
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged' isn't based on a loading script and remove `trust_remote_code`.
183
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
184
+ `trust_remote_code` is not supported anymore.
185
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro' isn't based on a loading script and remove `trust_remote_code`.
186
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
187
+ [titan] 2025-07-23 14:28:40,150 - datasets.load - ERROR - `trust_remote_code` is not supported anymore.
188
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro' isn't based on a loading script and remove `trust_remote_code`.
189
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
190
+ [titan] 2025-07-23 14:28:40,316 - root - INFO - Subset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro:default (p = 0.150):
191
+ IterableDataset({
192
+ features: ['text', 'cc-path', 'domain', 'lang', 'lang_score', 'timestamp', 'url', 'math_score'],
193
+ num_shards: 100
194
+ })
195
+ [titan] 2025-07-23 14:28:40,316 - root - INFO - Shuffling the dataset with seed 42
196
+ [titan] 2025-07-23 14:28:40,316 - root - WARNING - Dataset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro has insufficient shards (100). Need 256 shards minimum for desired data parallel workers × 32 dataloader workers. Resharding dataset to 256 shards and disabling streaming mode.
197
+ `trust_remote_code` is not supported anymore.
198
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro' isn't based on a loading script and remove `trust_remote_code`.
199
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
200
+ [titan] 2025-07-23 14:28:40,316 - datasets.load - ERROR - `trust_remote_code` is not supported anymore.
201
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro' isn't based on a loading script and remove `trust_remote_code`.
202
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
203
+ [titan] 2025-07-23 14:28:46,507 - root - INFO - Interleaving 3 datasets with probabilities [0.55, 0.3, 0.15]
204
+ [titan] 2025-07-23 14:28:47,196 - root - INFO - IterableDataset({
205
+ features: ['text', 'content'],
206
+ num_shards: 256
207
+ })
208
+ [titan] 2025-07-23 14:28:47,310 - root - INFO - Building dataloader...
209
+ [titan] 2025-07-23 14:28:47,312 - root - INFO - Loading model config from /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/configs/mamba2_6_1_340M.json
210
+ [titan] 2025-07-23 14:28:47,314 - root - INFO - Building model from the config
211
+ Mamba2Config {
212
+ "architectures": [
213
+ "Mamba2ForCausalLM"
214
+ ],
215
+ "attn": {
216
+ "layers": [
217
+ 5,
218
+ 11,
219
+ 17,
220
+ 23
221
+ ],
222
+ "num_heads": 16,
223
+ "num_kv_heads": 8,
224
+ "qkv_bias": false,
225
+ "rope_theta": 160000.0,
226
+ "window_size": null
227
+ },
228
+ "attn_mode": "chunk",
229
+ "bos_token_id": 1,
230
+ "chunk_size": 256,
231
+ "conv_kernel": 4,
232
+ "eos_token_id": 2,
233
+ "expand": 2,
234
+ "fuse_cross_entropy": true,
235
+ "fuse_norm": true,
236
+ "fuse_swiglu": true,
237
+ "head_dim": 64,
238
+ "hidden_act": "silu",
239
+ "hidden_size": 1024,
240
+ "initializer_range": 0.02,
241
+ "model_type": "mamba2",
242
+ "n_groups": 1,
243
+ "norm_eps": 1e-05,
244
+ "num_heads": 32,
245
+ "num_hidden_layers": 48,
246
+ "pad_token_id": 0,
247
+ "rescale_prenorm_residual": true,
248
+ "residual_in_fp32": true,
249
+ "rms_norm": true,
250
+ "state_size": 128,
251
+ "tie_word_embeddings": false,
252
+ "time_step_floor": 0.0001,
253
+ "time_step_limit": [
254
+ 0.0,
255
+ Infinity
256
+ ],
257
+ "time_step_max": 0.1,
258
+ "time_step_min": 0.001,
259
+ "time_step_rank": 128,
260
+ "transformers_version": "4.53.3",
261
+ "use_bias": false,
262
+ "use_cache": true,
263
+ "use_conv_bias": true,
264
+ "use_l2warp": false,
265
+ "vocab_size": 32000
266
+ }
267
+ 
268
+ [titan] 2025-07-23 14:28:50,147 - fla.layers.mamba2 - WARNING - The fast path is not available because one of `(selective_state_update)` is None. Falling back to the naive implementation. To install follow https://github.com/state-spaces/mamba/#installation
269
+ [titan] 2025-07-23 14:28:50,147 - fla.layers.mamba2 - WARNING - The CUDA backend is not available because `causal_conv1d` is None. Falling back to the Triton backend. To install follow https://github.com/Dao-AILab/causal-conv1d
270
+ [titan] 2025-07-23 14:28:50,265 - root - INFO - 
271
+ Mamba2ForCausalLM(
272
+ (backbone): Mamba2Model(
273
+ (embeddings): Embedding(32000, 1024)
274
+ (layers): ModuleList(
275
+ (0-47): 48 x Mamba2Block(
276
+ (norm): RMSNorm(1024, eps=1e-05)
277
+ (mixer): Mamba2(
278
+ (conv1d): Conv1d(2304, 2304, kernel_size=(4,), stride=(1,), padding=(3,), groups=2304)
279
+ (in_proj): Linear(in_features=1024, out_features=4384, bias=False)
280
+ (norm): RMSNormGated()
281
+ (out_proj): Linear(in_features=2048, out_features=1024, bias=False)
282
+ )
283
+ )
284
+ )
285
+ (norm_f): RMSNorm(1024, eps=1e-05)
286
+ )
287
+ (lm_head): Linear(in_features=1024, out_features=32000, bias=False)
288
+ (criterion): FusedLinearCrossEntropyLoss()
289
+ )
290
+
291
+ [titan] 2025-07-23 14:28:50,317 - root - INFO - Compiling each block with torch.compile
292
+ [titan] 2025-07-23 14:28:50,318 - root - INFO - Compiling the embedding, norm, and lm_head layers with torch.compile
293
+ [titan] 2025-07-23 14:28:50,318 - root - WARNING - No norm found in model
294
+ [titan] 2025-07-23 14:28:50,318 - root - INFO - Compiling the entire model with torch.compile
295
+ [titan] 2025-07-23 14:28:50,540 - root - INFO - Applied FSDP to the model
296
+ [titan] 2025-07-23 14:28:50,884 - fla.models.mamba2.modeling_mamba2 - WARNING - `A_log` is a DTensor, skipping initialization
297
+ [titan] 2025-07-23 14:28:51,042 - fla.models.mamba2.modeling_mamba2 - WARNING - `dt_bias` is a DTensor, skipping initialization
298
+ [titan] 2025-07-23 14:28:51,272 - root - INFO - CUDA memory usage for model: 0.19GiB(0.20%)
299
+ [titan] 2025-07-23 14:28:51,273 - root - WARNING - Warmup (100) + decay (95366) steps exceed total training steps (95366). Adjusting decay steps to 95266.
300
+ [titan] 2025-07-23 14:28:51,297 - root - INFO - Checkpointing active. Checkpoints will be loaded from and saved to /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/exp/mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/checkpoint
301
+ wandb: Network error (InvalidURL), entering retry loop.
302
+ wandb: W&B API key is configured. Use `wandb login --relogin` to force relogin
303
+ wandb: Network error (InvalidURL), entering retry loop.
304
+ [titan] 2025-07-23 14:30:44,436 - root - ERROR - Failed to create WandB logger: Run initialization has timed out after 90.0 sec. Please try increasing the timeout with the `init_timeout` setting: `wandb.init(settings=wandb.Settings(init_timeout=120))`.
305
+ [titan] 2025-07-23 14:30:44,442 - root - INFO - TensorBoard logging enabled. Logs will be saved at /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/exp/mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/tb/20250723-1428
306
+ [titan] 2025-07-23 14:30:44,442 - root - INFO - CUDA capacity: NVIDIA H20 with 95.00GiB memory
307
+ [titan] 2025-07-23 14:30:44,527 - root - WARNING - Peak flops undefined for: NVIDIA H20, fallback to A100
308
+ [titan] 2025-07-23 14:30:50,781 - root - INFO - ***** Running training *****
309
+ [titan] 2025-07-23 14:30:50,784 - root - INFO -  Training starts at step 1
310
+ [titan] 2025-07-23 14:30:50,784 - root - INFO -  Number of tokens per sequence = 8,192
311
+ [titan] 2025-07-23 14:30:50,784 - root - INFO -  Gradient Accumulation steps = 2
312
+ [titan] 2025-07-23 14:30:50,785 - root - INFO -  Instantaneous batch size (per device) = 8
313
+ [titan] 2025-07-23 14:30:50,785 - root - INFO -  Global batch size (w. parallel, distributed & accumulation) = 128 (1,048,576 tokens)
314
+ [titan] 2025-07-23 14:30:50,785 - root - INFO -  Total optimization steps = 95,366 (99,998,498,816 tokens)
315
+ [titan] 2025-07-23 14:30:50,785 - root - INFO -  Warmup steps = 100 (104,857,600 tokens)
316
+ [titan] 2025-07-23 14:30:50,785 - root - INFO -  Number of parameters = 382,387,712 
317
+ [titan] 2025-07-23 14:30:50,785 - root - INFO - Profiling active. Traces will be saved at /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/exp/mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/profile_trace
318
+ /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/variables/functions.py:1263: UserWarning: Dynamo does not know how to trace the builtin `cuda_utils.get_device_properties.` This function is either a Python builtin (e.g. _warnings.warn) or a third-party C/C++ Python extension (perhaps created with pybind).
319
+ If it is a Python builtin, please file an issue on GitHub so the PyTorch team can add support for it and see the next case for a workaround.
320
+ If it is a third-party C/C++ Python extension, please either wrap it into a PyTorch-understood custom operator (see https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html for more details) or, if it is traceable, use `torch.compiler.allow_in_graph`.
321
+ torch._dynamo.utils.warn_once(explanation + "\n" + "\n".join(hints))
322
+ Traceback (most recent call last):
323
+ File "<frozen runpy>", line 198, in _run_module_as_main
324
+ File "<frozen runpy>", line 88, in _run_code
325
+ File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/flame/train.py", line 615, in <module>
326
+ main(config)
327
+ File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
328
+ return f(*args, **kwargs)
329
+ ^^^^^^^^^^^^^^^^^^
330
+ File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/flame/train.py", line 487, in main
331
+ output = model(
332
+ ^^^^^^
333
+ File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
334
+ return self._call_impl(*args, **kwargs)
335
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
336
+ File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1857, in _call_impl
337
+ return inner()
338
+ ^^^^^^^
339
+ File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1805, in inner
340
+ result = forward_call(*args, **kwargs)
341
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
342
+ File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
343
+ return func(*args, **kwargs)
344
+ ^^^^^^^^^^^^^^^^^^^^^
345
+ File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/mamba2/modeling_mamba2.py", line 526, in forward
346
+ outputs = self.backbone(
347
+ ^^^^^^^^^^^^^^
348
+ File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
349
+ return self._call_impl(*args, **kwargs)
350
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
351
+ File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
352
+ return forward_call(*args, **kwargs)
353
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
354
+ File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/mamba2/modeling_mamba2.py", line 405, in forward
355
+ hidden_states = mixer_block(
356
+ ^^^^^^^^^^^^
357
+ File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
358
+ return self._call_impl(*args, **kwargs)
359
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
360
+ File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1857, in _call_impl
361
+ return inner()
362
+ ^^^^^^^
363
+ File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1805, in inner
364
+ result = forward_call(*args, **kwargs)
365
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
366
+ File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 655, in _fn
367
+ return fn(*args, **kwargs)
368
+ ^^^^^^^^^^^^^^^^^^^
369
+ File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
370
+ return self._call_impl(*args, **kwargs)
371
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
372
+ File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
373
+ return forward_call(*args, **kwargs)
374
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
375
+ File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/mamba2/modeling_mamba2.py", line 161, in forward
376
+ hidden_states = self.norm(hidden_states)
377
+ File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/mamba2/modeling_mamba2.py", line 165, in torch_dynamo_resume_in_forward_at_161
378
+ hidden_states = self.mixer(
379
+ ^^^^^^^^^^^
380
+ File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
381
+ return self._call_impl(*args, **kwargs)
382
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
383
+ File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
384
+ return forward_call(*args, **kwargs)
385
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
386
+ File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/layers/mamba2.py", line 601, in forward
387
+ return self.torch_forward(hidden_states, cache_params, cache_position, attention_mask)
388
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
389
+ File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/layers/mamba2.py", line 528, in torch_forward
390
+ G_intermediate = C[:, :, :, None, :, :] * B[:, :, None, :, :, :]
391
+ ~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~
392
+ torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 GiB. GPU 0 has a total capacity of 95.00 GiB of which 85.09 GiB is free. Process 2003896 has 316.00 MiB memory in use. Process 696027 has 316.00 MiB memory in use. Process 1114693 has 9.27 GiB memory in use. Of the allocated memory 7.99 GiB is allocated by PyTorch, and 73.66 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
393
+ [rank0]: Traceback (most recent call last):
394
+ [rank0]: File "<frozen runpy>", line 198, in _run_module_as_main
395
+ [rank0]: File "<frozen runpy>", line 88, in _run_code
396
+ [rank0]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/flame/train.py", line 615, in <module>
397
+ [rank0]: main(config)
398
+ [rank0]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
399
+ [rank0]: return f(*args, **kwargs)
400
+ [rank0]: ^^^^^^^^^^^^^^^^^^
401
+ [rank0]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/flame/train.py", line 487, in main
402
+ [rank0]: output = model(
403
+ [rank0]: ^^^^^^
404
+ [rank0]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
405
+ [rank0]: return self._call_impl(*args, **kwargs)
406
+ [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
407
+ [rank0]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1857, in _call_impl
408
+ [rank0]: return inner()
409
+ [rank0]: ^^^^^^^
410
+ [rank0]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1805, in inner
411
+ [rank0]: result = forward_call(*args, **kwargs)
412
+ [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
413
+ [rank0]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
414
+ [rank0]: return func(*args, **kwargs)
415
+ [rank0]: ^^^^^^^^^^^^^^^^^^^^^
416
+ [rank0]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/mamba2/modeling_mamba2.py", line 526, in forward
417
+ [rank0]: outputs = self.backbone(
418
+ [rank0]: ^^^^^^^^^^^^^^
419
+ [rank0]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
420
+ [rank0]: return self._call_impl(*args, **kwargs)
421
+ [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
422
+ [rank0]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
423
+ [rank0]: return forward_call(*args, **kwargs)
424
+ [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
425
+ [rank0]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/mamba2/modeling_mamba2.py", line 405, in forward
426
+ [rank0]: hidden_states = mixer_block(
427
+ [rank0]: ^^^^^^^^^^^^
428
+ [rank0]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
429
+ [rank0]: return self._call_impl(*args, **kwargs)
430
+ [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
431
+ [rank0]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1857, in _call_impl
432
+ [rank0]: return inner()
433
+ [rank0]: ^^^^^^^
434
+ [rank0]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1805, in inner
435
+ [rank0]: result = forward_call(*args, **kwargs)
436
+ [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
437
+ [rank0]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 655, in _fn
438
+ [rank0]: return fn(*args, **kwargs)
439
+ [rank0]: ^^^^^^^^^^^^^^^^^^^
440
+ [rank0]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
441
+ [rank0]: return self._call_impl(*args, **kwargs)
442
+ [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
443
+ [rank0]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
444
+ [rank0]: return forward_call(*args, **kwargs)
445
+ [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
446
+ [rank0]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/mamba2/modeling_mamba2.py", line 161, in forward
447
+ [rank0]: hidden_states = self.norm(hidden_states)
448
+ [rank0]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/mamba2/modeling_mamba2.py", line 165, in torch_dynamo_resume_in_forward_at_161
449
+ [rank0]: hidden_states = self.mixer(
450
+ [rank0]: ^^^^^^^^^^^
451
+ [rank0]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
452
+ [rank0]: return self._call_impl(*args, **kwargs)
453
+ [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
454
+ [rank0]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
455
+ [rank0]: return forward_call(*args, **kwargs)
456
+ [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
457
+ [rank0]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/layers/mamba2.py", line 601, in forward
458
+ [rank0]: return self.torch_forward(hidden_states, cache_params, cache_position, attention_mask)
459
+ [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
460
+ [rank0]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/layers/mamba2.py", line 528, in torch_forward
461
+ [rank0]: G_intermediate = C[:, :, :, None, :, :] * B[:, :, None, :, :, :]
462
+ [rank0]: ~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~
463
+ [rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 GiB. GPU 0 has a total capacity of 95.00 GiB of which 85.09 GiB is free. Process 2003896 has 316.00 MiB memory in use. Process 696027 has 316.00 MiB memory in use. Process 1114693 has 9.27 GiB memory in use. Of the allocated memory 7.99 GiB is allocated by PyTorch, and 73.66 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_07su7ijp/attempt_0/0/stdout.log ADDED
File without changes
mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_07su7ijp/attempt_0/1/error.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"message": {"message": "OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 GiB. GPU 1 has a total capacity of 95.00 GiB of which 85.09 GiB is free. Process 2003901 has 316.00 MiB memory in use. Process 696029 has 316.00 MiB memory in use. Process 1114694 has 9.27 GiB memory in use. Of the allocated memory 7.99 GiB is allocated by PyTorch, and 73.66 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)", "extraInfo": {"py_callstack": "Traceback (most recent call last):\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py\", line 355, in wrapper\n return f(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/flame/train.py\", line 487, in main\n output = model(\n ^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1751, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1857, in _call_impl\n return inner()\n ^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1805, in inner\n result = forward_call(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/transformers/utils/deprecation.py\", line 172, in wrapped_func\n return func(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/mamba2/modeling_mamba2.py\", line 526, in forward\n outputs = self.backbone(\n ^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1751, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1762, in _call_impl\n return forward_call(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/mamba2/modeling_mamba2.py\", line 405, in forward\n hidden_states = mixer_block(\n ^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1751, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1857, in _call_impl\n return inner()\n ^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1805, in inner\n result = forward_call(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py\", line 655, in _fn\n return fn(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1751, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1762, in _call_impl\n return forward_call(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/mamba2/modeling_mamba2.py\", line 161, in forward\n hidden_states = self.norm(hidden_states)\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/mamba2/modeling_mamba2.py\", line 165, in torch_dynamo_resume_in_forward_at_161\n hidden_states = self.mixer(\n ^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1751, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1762, in _call_impl\n return forward_call(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/layers/mamba2.py\", line 601, in forward\n return self.torch_forward(hidden_states, cache_params, cache_position, attention_mask)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/layers/mamba2.py\", line 528, in torch_forward\n G_intermediate = C[:, :, :, None, :, :] * B[:, :, None, :, :, :]\n ~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~\ntorch.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 GiB. GPU 1 has a total capacity of 95.00 GiB of which 85.09 GiB is free. Process 2003901 has 316.00 MiB memory in use. Process 696029 has 316.00 MiB memory in use. Process 1114694 has 9.27 GiB memory in use. Of the allocated memory 7.99 GiB is allocated by PyTorch, and 73.66 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)\n", "timestamp": "1753252283"}}}
mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_07su7ijp/attempt_0/1/stderr.log ADDED
@@ -0,0 +1,387 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [titan] 2025-07-23 14:27:46,323 - root - INFO - Starting job: default job
2
+ [titan] 2025-07-23 14:27:46,323 - root - INFO - {
3
+ "activation_checkpoint": {
4
+ "mode": "none",
5
+ "selective_ac_option": "2"
6
+ },
7
+ "activation_offload": {
8
+ "mode": "none"
9
+ },
10
+ "checkpoint": {
11
+ "async_mode": "disabled",
12
+ "create_seed_checkpoint": false,
13
+ "enable_checkpoint": true,
14
+ "exclude_from_loading": [],
15
+ "export_dtype": "float32",
16
+ "folder": "checkpoint",
17
+ "interval": 8192,
18
+ "interval_type": "steps",
19
+ "keep_latest_k": 100,
20
+ "load_step": -1,
21
+ "model_weights_only": false
22
+ },
23
+ "comm": {
24
+ "init_timeout_seconds": 300,
25
+ "trace_buf_size": 20000,
26
+ "train_timeout_seconds": 100
27
+ },
28
+ "experimental": {
29
+ "context_parallel_degree": 1,
30
+ "context_parallel_rotate_method": "allgather",
31
+ "custom_model_path": "",
32
+ "enable_async_tensor_parallel": false,
33
+ "enable_compiled_autograd": false,
34
+ "pipeline_parallel_degree": 1,
35
+ "pipeline_parallel_microbatches": null,
36
+ "pipeline_parallel_schedule": "1F1B",
37
+ "pipeline_parallel_schedule_csv": "",
38
+ "pipeline_parallel_split_points": []
39
+ },
40
+ "fault_tolerance": {
41
+ "enable": false,
42
+ "group_size": 0,
43
+ "min_replica_size": 1,
44
+ "replica_id": 0
45
+ },
46
+ "float8": {
47
+ "enable_fsdp_float8_all_gather": false,
48
+ "force_recompute_fp8_weight_in_bwd": false,
49
+ "precompute_float8_dynamic_scale_for_fsdp": false,
50
+ "recipe_name": null
51
+ },
52
+ "job": {
53
+ "config_file": "flame/models/fla.toml",
54
+ "description": "default job",
55
+ "dump_folder": "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/exp/mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2",
56
+ "print_args": true,
57
+ "use_for_integration_test": false
58
+ },
59
+ "lr_scheduler": {
60
+ "decay_ratio": 1.0,
61
+ "decay_type": "linear",
62
+ "lr_min": 0.01,
63
+ "warmup_steps": 100
64
+ },
65
+ "memory_estimation": {
66
+ "disable_fake_mode": false,
67
+ "enabled": false
68
+ },
69
+ "metrics": {
70
+ "disable_color_printing": false,
71
+ "enable_tensorboard": true,
72
+ "enable_wandb": true,
73
+ "log_freq": 1,
74
+ "save_for_all_ranks": false,
75
+ "save_tb_folder": "tb"
76
+ },
77
+ "model": {
78
+ "config": "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/configs/mamba2_6_1_340M.json",
79
+ "converters": [],
80
+ "name": "fla",
81
+ "print_after_conversion": false,
82
+ "tokenizer_path": "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/tokenizer"
83
+ },
84
+ "optimizer": {
85
+ "early_step_in_backward": false,
86
+ "eps": 1e-08,
87
+ "implementation": "fused",
88
+ "lr": 0.0003,
89
+ "name": "AdamW"
90
+ },
91
+ "profiling": {
92
+ "enable_memory_snapshot": false,
93
+ "enable_profiling": true,
94
+ "profile_freq": 512,
95
+ "save_memory_snapshot_folder": "memory_snapshot",
96
+ "save_traces_folder": "profile_trace"
97
+ },
98
+ "training": {
99
+ "batch_size": 8,
100
+ "compile": true,
101
+ "context_len": 8192,
102
+ "data_dir": null,
103
+ "data_files": null,
104
+ "data_parallel_replicate_degree": 1,
105
+ "data_parallel_shard_degree": -1,
106
+ "data_probs": "0.55,0.3,0.15",
107
+ "dataset": "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample,/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged,/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro",
108
+ "dataset_name": "default,default,default",
109
+ "dataset_split": "train,train,train",
110
+ "deterministic": false,
111
+ "disable_loss_parallel": false,
112
+ "enable_cpu_offload": false,
113
+ "fsdp_reshard_after_forward": "default",
114
+ "gc_freq": 50,
115
+ "gradient_accumulation_steps": 2,
116
+ "max_norm": 1.0,
117
+ "mixed_precision_param": "bfloat16",
118
+ "mixed_precision_reduce": "float32",
119
+ "num_workers": 32,
120
+ "persistent_workers": false,
121
+ "pin_memory": false,
122
+ "prefetch_factor": 2,
123
+ "seed": 42,
124
+ "seq_len": 8192,
125
+ "skip_nan_inf": true,
126
+ "steps": 95366,
127
+ "streaming": true,
128
+ "tensor_parallel_degree": 1,
129
+ "varlen": false
130
+ }
131
+ }
132
+ [titan] 2025-07-23 14:27:46,324 - root - INFO - [GC] Initial GC collection. 0.00 seconds.
133
+ [titan] 2025-07-23 14:27:47,255 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config
134
+ [titan] 2025-07-23 14:27:47,258 - root - INFO - CUDA capacity: NVIDIA H20 with 95.00GiB memory
135
+ [titan] 2025-07-23 14:27:47,324 - root - WARNING - Peak flops undefined for: NVIDIA H20, fallback to A100
136
+ [titan] 2025-07-23 14:27:47,324 - root - INFO - Peak FLOPS used for computing MFU: 3.120e+14
137
+ [titan] 2025-07-23 14:27:47,324 - root - INFO - Building 1-D device mesh with ['dp_shard'], [8]
138
+ [titan] 2025-07-23 14:27:47,411 - root - INFO - Loading tokenizer...
139
+ [titan] 2025-07-23 14:27:47,997 - root - INFO - LlamaTokenizerFast(name_or_path='/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/tokenizer', vocab_size=32000, model_max_length=10000000000, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>'}, clean_up_tokenization_spaces=False, added_tokens_decoder={
140
+ 0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
141
+ 1: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
142
+ 2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
143
+ }
144
+ )
145
+ [titan] 2025-07-23 14:27:47,998 - root - INFO - Loading dataset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample,/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged,/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro:default,default,default
146
+ `trust_remote_code` is not supported anymore.
147
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample' isn't based on a loading script and remove `trust_remote_code`.
148
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
149
+ [titan] 2025-07-23 14:27:47,998 - datasets.load - ERROR - `trust_remote_code` is not supported anymore.
150
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample' isn't based on a loading script and remove `trust_remote_code`.
151
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
152
+ [titan] 2025-07-23 14:27:48,493 - root - INFO - Subset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample:default (p = 0.550):
153
+ IterableDataset({
154
+ features: ['text', 'id', 'dump', 'url', 'file_path', 'language', 'language_score', 'token_count', 'score', 'int_score'],
155
+ num_shards: 140
156
+ })
157
+ [titan] 2025-07-23 14:27:48,493 - root - INFO - Shuffling the dataset with seed 42
158
+ [titan] 2025-07-23 14:27:48,493 - root - WARNING - Dataset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample has insufficient shards (140). Need 256 shards minimum for desired data parallel workers × 32 dataloader workers. Resharding dataset to 256 shards and disabling streaming mode.
159
+ `trust_remote_code` is not supported anymore.
160
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample' isn't based on a loading script and remove `trust_remote_code`.
161
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
162
+ [titan] 2025-07-23 14:27:48,493 - datasets.load - ERROR - `trust_remote_code` is not supported anymore.
163
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample' isn't based on a loading script and remove `trust_remote_code`.
164
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
165
+ `trust_remote_code` is not supported anymore.
166
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged' isn't based on a loading script and remove `trust_remote_code`.
167
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
168
+ [titan] 2025-07-23 14:28:41,064 - datasets.load - ERROR - `trust_remote_code` is not supported anymore.
169
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged' isn't based on a loading script and remove `trust_remote_code`.
170
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
171
+ [titan] 2025-07-23 14:28:41,096 - root - INFO - Subset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged:default (p = 0.300):
172
+ IterableDataset({
173
+ features: ['repo', 'content'],
174
+ num_shards: 1
175
+ })
176
+ [titan] 2025-07-23 14:28:41,096 - root - INFO - Shuffling the dataset with seed 42
177
+ [titan] 2025-07-23 14:28:41,096 - root - WARNING - Dataset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged has insufficient shards (1). Need 256 shards minimum for desired data parallel workers × 32 dataloader workers. Resharding dataset to 256 shards and disabling streaming mode.
178
+ `trust_remote_code` is not supported anymore.
179
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged' isn't based on a loading script and remove `trust_remote_code`.
180
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
181
+ [titan] 2025-07-23 14:28:41,097 - datasets.load - ERROR - `trust_remote_code` is not supported anymore.
182
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged' isn't based on a loading script and remove `trust_remote_code`.
183
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
184
+ `trust_remote_code` is not supported anymore.
185
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro' isn't based on a loading script and remove `trust_remote_code`.
186
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
187
+ [titan] 2025-07-23 14:28:41,357 - datasets.load - ERROR - `trust_remote_code` is not supported anymore.
188
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro' isn't based on a loading script and remove `trust_remote_code`.
189
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
190
+ [titan] 2025-07-23 14:28:41,441 - root - INFO - Subset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro:default (p = 0.150):
191
+ IterableDataset({
192
+ features: ['text', 'cc-path', 'domain', 'lang', 'lang_score', 'timestamp', 'url', 'math_score'],
193
+ num_shards: 100
194
+ })
195
+ [titan] 2025-07-23 14:28:41,441 - root - INFO - Shuffling the dataset with seed 42
196
+ [titan] 2025-07-23 14:28:41,441 - root - WARNING - Dataset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro has insufficient shards (100). Need 256 shards minimum for desired data parallel workers × 32 dataloader workers. Resharding dataset to 256 shards and disabling streaming mode.
197
+ `trust_remote_code` is not supported anymore.
198
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro' isn't based on a loading script and remove `trust_remote_code`.
199
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
200
+ [titan] 2025-07-23 14:28:41,441 - datasets.load - ERROR - `trust_remote_code` is not supported anymore.
201
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro' isn't based on a loading script and remove `trust_remote_code`.
202
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
203
+ [titan] 2025-07-23 14:28:47,757 - root - INFO - Interleaving 3 datasets with probabilities [0.55, 0.3, 0.15]
204
+ [titan] 2025-07-23 14:28:48,445 - root - INFO - IterableDataset({
205
+ features: ['text', 'content'],
206
+ num_shards: 256
207
+ })
208
+ [titan] 2025-07-23 14:28:48,560 - root - INFO - Building dataloader...
209
+ [titan] 2025-07-23 14:28:48,562 - root - INFO - Loading model config from /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/configs/mamba2_6_1_340M.json
210
+ [titan] 2025-07-23 14:28:48,564 - root - INFO - Building model from the config
211
+ Mamba2Config {
212
+ "architectures": [
213
+ "Mamba2ForCausalLM"
214
+ ],
215
+ "attn": {
216
+ "layers": [
217
+ 5,
218
+ 11,
219
+ 17,
220
+ 23
221
+ ],
222
+ "num_heads": 16,
223
+ "num_kv_heads": 8,
224
+ "qkv_bias": false,
225
+ "rope_theta": 160000.0,
226
+ "window_size": null
227
+ },
228
+ "attn_mode": "chunk",
229
+ "bos_token_id": 1,
230
+ "chunk_size": 256,
231
+ "conv_kernel": 4,
232
+ "eos_token_id": 2,
233
+ "expand": 2,
234
+ "fuse_cross_entropy": true,
235
+ "fuse_norm": true,
236
+ "fuse_swiglu": true,
237
+ "head_dim": 64,
238
+ "hidden_act": "silu",
239
+ "hidden_size": 1024,
240
+ "initializer_range": 0.02,
241
+ "model_type": "mamba2",
242
+ "n_groups": 1,
243
+ "norm_eps": 1e-05,
244
+ "num_heads": 32,
245
+ "num_hidden_layers": 48,
246
+ "pad_token_id": 0,
247
+ "rescale_prenorm_residual": true,
248
+ "residual_in_fp32": true,
249
+ "rms_norm": true,
250
+ "state_size": 128,
251
+ "tie_word_embeddings": false,
252
+ "time_step_floor": 0.0001,
253
+ "time_step_limit": [
254
+ 0.0,
255
+ Infinity
256
+ ],
257
+ "time_step_max": 0.1,
258
+ "time_step_min": 0.001,
259
+ "time_step_rank": 128,
260
+ "transformers_version": "4.53.3",
261
+ "use_bias": false,
262
+ "use_cache": true,
263
+ "use_conv_bias": true,
264
+ "use_l2warp": false,
265
+ "vocab_size": 32000
266
+ }
267
+ 
268
+ [titan] 2025-07-23 14:28:50,147 - fla.layers.mamba2 - WARNING - The fast path is not available because one of `(selective_state_update)` is None. Falling back to the naive implementation. To install follow https://github.com/state-spaces/mamba/#installation
269
+ [titan] 2025-07-23 14:28:50,147 - fla.layers.mamba2 - WARNING - The CUDA backend is not available because `causal_conv1d` is None. Falling back to the Triton backend. To install follow https://github.com/Dao-AILab/causal-conv1d
270
+ [titan] 2025-07-23 14:28:50,264 - root - INFO - 
271
+ Mamba2ForCausalLM(
272
+ (backbone): Mamba2Model(
273
+ (embeddings): Embedding(32000, 1024)
274
+ (layers): ModuleList(
275
+ (0-47): 48 x Mamba2Block(
276
+ (norm): RMSNorm(1024, eps=1e-05)
277
+ (mixer): Mamba2(
278
+ (conv1d): Conv1d(2304, 2304, kernel_size=(4,), stride=(1,), padding=(3,), groups=2304)
279
+ (in_proj): Linear(in_features=1024, out_features=4384, bias=False)
280
+ (norm): RMSNormGated()
281
+ (out_proj): Linear(in_features=2048, out_features=1024, bias=False)
282
+ )
283
+ )
284
+ )
285
+ (norm_f): RMSNorm(1024, eps=1e-05)
286
+ )
287
+ (lm_head): Linear(in_features=1024, out_features=32000, bias=False)
288
+ (criterion): FusedLinearCrossEntropyLoss()
289
+ )
290
+
291
+ [titan] 2025-07-23 14:28:50,316 - root - INFO - Compiling each block with torch.compile
292
+ [titan] 2025-07-23 14:28:50,316 - root - INFO - Compiling the embedding, norm, and lm_head layers with torch.compile
293
+ [titan] 2025-07-23 14:28:50,317 - root - WARNING - No norm found in model
294
+ [titan] 2025-07-23 14:28:50,317 - root - INFO - Compiling the entire model with torch.compile
295
+ [titan] 2025-07-23 14:28:50,541 - root - INFO - Applied FSDP to the model
296
+ [titan] 2025-07-23 14:28:50,886 - fla.models.mamba2.modeling_mamba2 - WARNING - `A_log` is a DTensor, skipping initialization
297
+ [titan] 2025-07-23 14:28:51,042 - fla.models.mamba2.modeling_mamba2 - WARNING - `dt_bias` is a DTensor, skipping initialization
298
+ [titan] 2025-07-23 14:28:51,271 - root - INFO - CUDA memory usage for model: 0.19GiB(0.20%)
299
+ [titan] 2025-07-23 14:28:51,273 - root - WARNING - Warmup (100) + decay (95366) steps exceed total training steps (95366). Adjusting decay steps to 95266.
300
+ [titan] 2025-07-23 14:28:51,297 - root - INFO - Checkpointing active. Checkpoints will be loaded from and saved to /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/exp/mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/checkpoint
301
+ [titan] 2025-07-23 14:28:51,302 - root - INFO - CUDA capacity: NVIDIA H20 with 95.00GiB memory
302
+ [titan] 2025-07-23 14:28:51,429 - root - WARNING - Peak flops undefined for: NVIDIA H20, fallback to A100
303
+ [titan] 2025-07-23 14:28:58,662 - root - INFO - ***** Running training *****
304
+ [titan] 2025-07-23 14:28:58,667 - root - INFO -  Training starts at step 1
305
+ [titan] 2025-07-23 14:28:58,670 - root - INFO -  Number of tokens per sequence = 8,192
306
+ [titan] 2025-07-23 14:28:58,670 - root - INFO -  Gradient Accumulation steps = 2
307
+ [titan] 2025-07-23 14:28:58,670 - root - INFO -  Instantaneous batch size (per device) = 8
308
+ [titan] 2025-07-23 14:28:58,670 - root - INFO -  Global batch size (w. parallel, distributed & accumulation) = 128 (1,048,576 tokens)
309
+ [titan] 2025-07-23 14:28:58,670 - root - INFO -  Total optimization steps = 95,366 (99,998,498,816 tokens)
310
+ [titan] 2025-07-23 14:28:58,670 - root - INFO -  Warmup steps = 100 (104,857,600 tokens)
311
+ [titan] 2025-07-23 14:28:58,671 - root - INFO -  Number of parameters = 382,387,712 
312
+ [titan] 2025-07-23 14:28:58,671 - root - INFO - Profiling active. Traces will be saved at /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/exp/mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/profile_trace
313
+ /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/variables/functions.py:1263: UserWarning: Dynamo does not know how to trace the builtin `cuda_utils.get_device_properties.` This function is either a Python builtin (e.g. _warnings.warn) or a third-party C/C++ Python extension (perhaps created with pybind).
314
+ If it is a Python builtin, please file an issue on GitHub so the PyTorch team can add support for it and see the next case for a workaround.
315
+ If it is a third-party C/C++ Python extension, please either wrap it into a PyTorch-understood custom operator (see https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html for more details) or, if it is traceable, use `torch.compiler.allow_in_graph`.
316
+ torch._dynamo.utils.warn_once(explanation + "\n" + "\n".join(hints))
317
+ [rank1]: Traceback (most recent call last):
318
+ [rank1]: File "<frozen runpy>", line 198, in _run_module_as_main
319
+ [rank1]: File "<frozen runpy>", line 88, in _run_code
320
+ [rank1]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/flame/train.py", line 615, in <module>
321
+ [rank1]: main(config)
322
+ [rank1]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
323
+ [rank1]: return f(*args, **kwargs)
324
+ [rank1]: ^^^^^^^^^^^^^^^^^^
325
+ [rank1]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/flame/train.py", line 487, in main
326
+ [rank1]: output = model(
327
+ [rank1]: ^^^^^^
328
+ [rank1]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
329
+ [rank1]: return self._call_impl(*args, **kwargs)
330
+ [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
331
+ [rank1]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1857, in _call_impl
332
+ [rank1]: return inner()
333
+ [rank1]: ^^^^^^^
334
+ [rank1]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1805, in inner
335
+ [rank1]: result = forward_call(*args, **kwargs)
336
+ [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
337
+ [rank1]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
338
+ [rank1]: return func(*args, **kwargs)
339
+ [rank1]: ^^^^^^^^^^^^^^^^^^^^^
340
+ [rank1]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/mamba2/modeling_mamba2.py", line 526, in forward
341
+ [rank1]: outputs = self.backbone(
342
+ [rank1]: ^^^^^^^^^^^^^^
343
+ [rank1]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
344
+ [rank1]: return self._call_impl(*args, **kwargs)
345
+ [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
346
+ [rank1]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
347
+ [rank1]: return forward_call(*args, **kwargs)
348
+ [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
349
+ [rank1]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/mamba2/modeling_mamba2.py", line 405, in forward
350
+ [rank1]: hidden_states = mixer_block(
351
+ [rank1]: ^^^^^^^^^^^^
352
+ [rank1]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
353
+ [rank1]: return self._call_impl(*args, **kwargs)
354
+ [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
355
+ [rank1]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1857, in _call_impl
356
+ [rank1]: return inner()
357
+ [rank1]: ^^^^^^^
358
+ [rank1]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1805, in inner
359
+ [rank1]: result = forward_call(*args, **kwargs)
360
+ [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
361
+ [rank1]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 655, in _fn
362
+ [rank1]: return fn(*args, **kwargs)
363
+ [rank1]: ^^^^^^^^^^^^^^^^^^^
364
+ [rank1]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
365
+ [rank1]: return self._call_impl(*args, **kwargs)
366
+ [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
367
+ [rank1]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
368
+ [rank1]: return forward_call(*args, **kwargs)
369
+ [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
370
+ [rank1]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/mamba2/modeling_mamba2.py", line 161, in forward
371
+ [rank1]: hidden_states = self.norm(hidden_states)
372
+ [rank1]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/mamba2/modeling_mamba2.py", line 165, in torch_dynamo_resume_in_forward_at_161
373
+ [rank1]: hidden_states = self.mixer(
374
+ [rank1]: ^^^^^^^^^^^
375
+ [rank1]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
376
+ [rank1]: return self._call_impl(*args, **kwargs)
377
+ [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
378
+ [rank1]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
379
+ [rank1]: return forward_call(*args, **kwargs)
380
+ [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
381
+ [rank1]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/layers/mamba2.py", line 601, in forward
382
+ [rank1]: return self.torch_forward(hidden_states, cache_params, cache_position, attention_mask)
383
+ [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
384
+ [rank1]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/layers/mamba2.py", line 528, in torch_forward
385
+ [rank1]: G_intermediate = C[:, :, :, None, :, :] * B[:, :, None, :, :, :]
386
+ [rank1]: ~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~
387
+ [rank1]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 GiB. GPU 1 has a total capacity of 95.00 GiB of which 85.09 GiB is free. Process 2003901 has 316.00 MiB memory in use. Process 696029 has 316.00 MiB memory in use. Process 1114694 has 9.27 GiB memory in use. Of the allocated memory 7.99 GiB is allocated by PyTorch, and 73.66 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_07su7ijp/attempt_0/1/stdout.log ADDED
File without changes
mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_07su7ijp/attempt_0/2/error.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"message": {"message": "OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 GiB. GPU 2 has a total capacity of 95.00 GiB of which 85.09 GiB is free. Process 2003900 has 316.00 MiB memory in use. Process 696030 has 316.00 MiB memory in use. Process 1114695 has 9.27 GiB memory in use. Of the allocated memory 7.99 GiB is allocated by PyTorch, and 73.66 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)", "extraInfo": {"py_callstack": "Traceback (most recent call last):\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py\", line 355, in wrapper\n return f(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/flame/train.py\", line 487, in main\n output = model(\n ^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1751, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1857, in _call_impl\n return inner()\n ^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1805, in inner\n result = forward_call(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/transformers/utils/deprecation.py\", line 172, in wrapped_func\n return func(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/mamba2/modeling_mamba2.py\", line 526, in forward\n outputs = self.backbone(\n ^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1751, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1762, in _call_impl\n return forward_call(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/mamba2/modeling_mamba2.py\", line 405, in forward\n hidden_states = mixer_block(\n ^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1751, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1857, in _call_impl\n return inner()\n ^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1805, in inner\n result = forward_call(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py\", line 655, in _fn\n return fn(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1751, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1762, in _call_impl\n return forward_call(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/mamba2/modeling_mamba2.py\", line 161, in forward\n hidden_states = self.norm(hidden_states)\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/mamba2/modeling_mamba2.py\", line 165, in torch_dynamo_resume_in_forward_at_161\n hidden_states = self.mixer(\n ^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1751, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1762, in _call_impl\n return forward_call(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/layers/mamba2.py\", line 601, in forward\n return self.torch_forward(hidden_states, cache_params, cache_position, attention_mask)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/layers/mamba2.py\", line 528, in torch_forward\n G_intermediate = C[:, :, :, None, :, :] * B[:, :, None, :, :, :]\n ~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~\ntorch.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 GiB. GPU 2 has a total capacity of 95.00 GiB of which 85.09 GiB is free. Process 2003900 has 316.00 MiB memory in use. Process 696030 has 316.00 MiB memory in use. Process 1114695 has 9.27 GiB memory in use. Of the allocated memory 7.99 GiB is allocated by PyTorch, and 73.66 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)\n", "timestamp": "1753252283"}}}
mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_07su7ijp/attempt_0/2/stderr.log ADDED
@@ -0,0 +1,387 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [titan] 2025-07-23 14:27:46,385 - root - INFO - Starting job: default job
2
+ [titan] 2025-07-23 14:27:46,386 - root - INFO - {
3
+ "activation_checkpoint": {
4
+ "mode": "none",
5
+ "selective_ac_option": "2"
6
+ },
7
+ "activation_offload": {
8
+ "mode": "none"
9
+ },
10
+ "checkpoint": {
11
+ "async_mode": "disabled",
12
+ "create_seed_checkpoint": false,
13
+ "enable_checkpoint": true,
14
+ "exclude_from_loading": [],
15
+ "export_dtype": "float32",
16
+ "folder": "checkpoint",
17
+ "interval": 8192,
18
+ "interval_type": "steps",
19
+ "keep_latest_k": 100,
20
+ "load_step": -1,
21
+ "model_weights_only": false
22
+ },
23
+ "comm": {
24
+ "init_timeout_seconds": 300,
25
+ "trace_buf_size": 20000,
26
+ "train_timeout_seconds": 100
27
+ },
28
+ "experimental": {
29
+ "context_parallel_degree": 1,
30
+ "context_parallel_rotate_method": "allgather",
31
+ "custom_model_path": "",
32
+ "enable_async_tensor_parallel": false,
33
+ "enable_compiled_autograd": false,
34
+ "pipeline_parallel_degree": 1,
35
+ "pipeline_parallel_microbatches": null,
36
+ "pipeline_parallel_schedule": "1F1B",
37
+ "pipeline_parallel_schedule_csv": "",
38
+ "pipeline_parallel_split_points": []
39
+ },
40
+ "fault_tolerance": {
41
+ "enable": false,
42
+ "group_size": 0,
43
+ "min_replica_size": 1,
44
+ "replica_id": 0
45
+ },
46
+ "float8": {
47
+ "enable_fsdp_float8_all_gather": false,
48
+ "force_recompute_fp8_weight_in_bwd": false,
49
+ "precompute_float8_dynamic_scale_for_fsdp": false,
50
+ "recipe_name": null
51
+ },
52
+ "job": {
53
+ "config_file": "flame/models/fla.toml",
54
+ "description": "default job",
55
+ "dump_folder": "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/exp/mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2",
56
+ "print_args": true,
57
+ "use_for_integration_test": false
58
+ },
59
+ "lr_scheduler": {
60
+ "decay_ratio": 1.0,
61
+ "decay_type": "linear",
62
+ "lr_min": 0.01,
63
+ "warmup_steps": 100
64
+ },
65
+ "memory_estimation": {
66
+ "disable_fake_mode": false,
67
+ "enabled": false
68
+ },
69
+ "metrics": {
70
+ "disable_color_printing": false,
71
+ "enable_tensorboard": true,
72
+ "enable_wandb": true,
73
+ "log_freq": 1,
74
+ "save_for_all_ranks": false,
75
+ "save_tb_folder": "tb"
76
+ },
77
+ "model": {
78
+ "config": "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/configs/mamba2_6_1_340M.json",
79
+ "converters": [],
80
+ "name": "fla",
81
+ "print_after_conversion": false,
82
+ "tokenizer_path": "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/tokenizer"
83
+ },
84
+ "optimizer": {
85
+ "early_step_in_backward": false,
86
+ "eps": 1e-08,
87
+ "implementation": "fused",
88
+ "lr": 0.0003,
89
+ "name": "AdamW"
90
+ },
91
+ "profiling": {
92
+ "enable_memory_snapshot": false,
93
+ "enable_profiling": true,
94
+ "profile_freq": 512,
95
+ "save_memory_snapshot_folder": "memory_snapshot",
96
+ "save_traces_folder": "profile_trace"
97
+ },
98
+ "training": {
99
+ "batch_size": 8,
100
+ "compile": true,
101
+ "context_len": 8192,
102
+ "data_dir": null,
103
+ "data_files": null,
104
+ "data_parallel_replicate_degree": 1,
105
+ "data_parallel_shard_degree": -1,
106
+ "data_probs": "0.55,0.3,0.15",
107
+ "dataset": "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample,/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged,/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro",
108
+ "dataset_name": "default,default,default",
109
+ "dataset_split": "train,train,train",
110
+ "deterministic": false,
111
+ "disable_loss_parallel": false,
112
+ "enable_cpu_offload": false,
113
+ "fsdp_reshard_after_forward": "default",
114
+ "gc_freq": 50,
115
+ "gradient_accumulation_steps": 2,
116
+ "max_norm": 1.0,
117
+ "mixed_precision_param": "bfloat16",
118
+ "mixed_precision_reduce": "float32",
119
+ "num_workers": 32,
120
+ "persistent_workers": false,
121
+ "pin_memory": false,
122
+ "prefetch_factor": 2,
123
+ "seed": 42,
124
+ "seq_len": 8192,
125
+ "skip_nan_inf": true,
126
+ "steps": 95366,
127
+ "streaming": true,
128
+ "tensor_parallel_degree": 1,
129
+ "varlen": false
130
+ }
131
+ }
132
+ [titan] 2025-07-23 14:27:46,386 - root - INFO - [GC] Initial GC collection. 0.00 seconds.
133
+ [titan] 2025-07-23 14:27:47,325 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config
134
+ [titan] 2025-07-23 14:27:47,327 - root - INFO - CUDA capacity: NVIDIA H20 with 95.00GiB memory
135
+ [titan] 2025-07-23 14:27:47,375 - root - WARNING - Peak flops undefined for: NVIDIA H20, fallback to A100
136
+ [titan] 2025-07-23 14:27:47,375 - root - INFO - Peak FLOPS used for computing MFU: 3.120e+14
137
+ [titan] 2025-07-23 14:27:47,376 - root - INFO - Building 1-D device mesh with ['dp_shard'], [8]
138
+ [titan] 2025-07-23 14:27:47,418 - root - INFO - Loading tokenizer...
139
+ [titan] 2025-07-23 14:27:47,997 - root - INFO - LlamaTokenizerFast(name_or_path='/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/tokenizer', vocab_size=32000, model_max_length=10000000000, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>'}, clean_up_tokenization_spaces=False, added_tokens_decoder={
140
+ 0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
141
+ 1: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
142
+ 2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
143
+ }
144
+ )
145
+ [titan] 2025-07-23 14:27:47,998 - root - INFO - Loading dataset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample,/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged,/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro:default,default,default
146
+ `trust_remote_code` is not supported anymore.
147
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample' isn't based on a loading script and remove `trust_remote_code`.
148
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
149
+ [titan] 2025-07-23 14:27:47,998 - datasets.load - ERROR - `trust_remote_code` is not supported anymore.
150
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample' isn't based on a loading script and remove `trust_remote_code`.
151
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
152
+ [titan] 2025-07-23 14:27:48,543 - root - INFO - Subset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample:default (p = 0.550):
153
+ IterableDataset({
154
+ features: ['text', 'id', 'dump', 'url', 'file_path', 'language', 'language_score', 'token_count', 'score', 'int_score'],
155
+ num_shards: 140
156
+ })
157
+ [titan] 2025-07-23 14:27:48,543 - root - INFO - Shuffling the dataset with seed 42
158
+ [titan] 2025-07-23 14:27:48,544 - root - WARNING - Dataset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample has insufficient shards (140). Need 256 shards minimum for desired data parallel workers × 32 dataloader workers. Resharding dataset to 256 shards and disabling streaming mode.
159
+ `trust_remote_code` is not supported anymore.
160
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample' isn't based on a loading script and remove `trust_remote_code`.
161
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
162
+ [titan] 2025-07-23 14:27:48,544 - datasets.load - ERROR - `trust_remote_code` is not supported anymore.
163
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample' isn't based on a loading script and remove `trust_remote_code`.
164
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
165
+ `trust_remote_code` is not supported anymore.
166
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged' isn't based on a loading script and remove `trust_remote_code`.
167
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
168
+ [titan] 2025-07-23 14:28:39,999 - datasets.load - ERROR - `trust_remote_code` is not supported anymore.
169
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged' isn't based on a loading script and remove `trust_remote_code`.
170
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
171
+ [titan] 2025-07-23 14:28:40,032 - root - INFO - Subset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged:default (p = 0.300):
172
+ IterableDataset({
173
+ features: ['repo', 'content'],
174
+ num_shards: 1
175
+ })
176
+ [titan] 2025-07-23 14:28:40,032 - root - INFO - Shuffling the dataset with seed 42
177
+ [titan] 2025-07-23 14:28:40,032 - root - WARNING - Dataset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged has insufficient shards (1). Need 256 shards minimum for desired data parallel workers × 32 dataloader workers. Resharding dataset to 256 shards and disabling streaming mode.
178
+ `trust_remote_code` is not supported anymore.
179
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged' isn't based on a loading script and remove `trust_remote_code`.
180
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
181
+ [titan] 2025-07-23 14:28:40,032 - datasets.load - ERROR - `trust_remote_code` is not supported anymore.
182
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged' isn't based on a loading script and remove `trust_remote_code`.
183
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
184
+ `trust_remote_code` is not supported anymore.
185
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro' isn't based on a loading script and remove `trust_remote_code`.
186
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
187
+ [titan] 2025-07-23 14:28:40,289 - datasets.load - ERROR - `trust_remote_code` is not supported anymore.
188
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro' isn't based on a loading script and remove `trust_remote_code`.
189
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
190
+ [titan] 2025-07-23 14:28:40,382 - root - INFO - Subset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro:default (p = 0.150):
191
+ IterableDataset({
192
+ features: ['text', 'cc-path', 'domain', 'lang', 'lang_score', 'timestamp', 'url', 'math_score'],
193
+ num_shards: 100
194
+ })
195
+ [titan] 2025-07-23 14:28:40,382 - root - INFO - Shuffling the dataset with seed 42
196
+ [titan] 2025-07-23 14:28:40,382 - root - WARNING - Dataset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro has insufficient shards (100). Need 256 shards minimum for desired data parallel workers × 32 dataloader workers. Resharding dataset to 256 shards and disabling streaming mode.
197
+ `trust_remote_code` is not supported anymore.
198
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro' isn't based on a loading script and remove `trust_remote_code`.
199
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
200
+ [titan] 2025-07-23 14:28:40,382 - datasets.load - ERROR - `trust_remote_code` is not supported anymore.
201
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro' isn't based on a loading script and remove `trust_remote_code`.
202
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
203
+ [titan] 2025-07-23 14:28:46,791 - root - INFO - Interleaving 3 datasets with probabilities [0.55, 0.3, 0.15]
204
+ [titan] 2025-07-23 14:28:47,494 - root - INFO - IterableDataset({
205
+ features: ['text', 'content'],
206
+ num_shards: 256
207
+ })
208
+ [titan] 2025-07-23 14:28:47,612 - root - INFO - Building dataloader...
209
+ [titan] 2025-07-23 14:28:47,614 - root - INFO - Loading model config from /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/configs/mamba2_6_1_340M.json
210
+ [titan] 2025-07-23 14:28:47,616 - root - INFO - Building model from the config
211
+ Mamba2Config {
212
+ "architectures": [
213
+ "Mamba2ForCausalLM"
214
+ ],
215
+ "attn": {
216
+ "layers": [
217
+ 5,
218
+ 11,
219
+ 17,
220
+ 23
221
+ ],
222
+ "num_heads": 16,
223
+ "num_kv_heads": 8,
224
+ "qkv_bias": false,
225
+ "rope_theta": 160000.0,
226
+ "window_size": null
227
+ },
228
+ "attn_mode": "chunk",
229
+ "bos_token_id": 1,
230
+ "chunk_size": 256,
231
+ "conv_kernel": 4,
232
+ "eos_token_id": 2,
233
+ "expand": 2,
234
+ "fuse_cross_entropy": true,
235
+ "fuse_norm": true,
236
+ "fuse_swiglu": true,
237
+ "head_dim": 64,
238
+ "hidden_act": "silu",
239
+ "hidden_size": 1024,
240
+ "initializer_range": 0.02,
241
+ "model_type": "mamba2",
242
+ "n_groups": 1,
243
+ "norm_eps": 1e-05,
244
+ "num_heads": 32,
245
+ "num_hidden_layers": 48,
246
+ "pad_token_id": 0,
247
+ "rescale_prenorm_residual": true,
248
+ "residual_in_fp32": true,
249
+ "rms_norm": true,
250
+ "state_size": 128,
251
+ "tie_word_embeddings": false,
252
+ "time_step_floor": 0.0001,
253
+ "time_step_limit": [
254
+ 0.0,
255
+ Infinity
256
+ ],
257
+ "time_step_max": 0.1,
258
+ "time_step_min": 0.001,
259
+ "time_step_rank": 128,
260
+ "transformers_version": "4.53.3",
261
+ "use_bias": false,
262
+ "use_cache": true,
263
+ "use_conv_bias": true,
264
+ "use_l2warp": false,
265
+ "vocab_size": 32000
266
+ }
267
+ 
268
+ [titan] 2025-07-23 14:28:50,147 - fla.layers.mamba2 - WARNING - The fast path is not available because one of `(selective_state_update)` is None. Falling back to the naive implementation. To install follow https://github.com/state-spaces/mamba/#installation
269
+ [titan] 2025-07-23 14:28:50,148 - fla.layers.mamba2 - WARNING - The CUDA backend is not available because `causal_conv1d` is None. Falling back to the Triton backend. To install follow https://github.com/Dao-AILab/causal-conv1d
270
+ [titan] 2025-07-23 14:28:50,264 - root - INFO - 
271
+ Mamba2ForCausalLM(
272
+ (backbone): Mamba2Model(
273
+ (embeddings): Embedding(32000, 1024)
274
+ (layers): ModuleList(
275
+ (0-47): 48 x Mamba2Block(
276
+ (norm): RMSNorm(1024, eps=1e-05)
277
+ (mixer): Mamba2(
278
+ (conv1d): Conv1d(2304, 2304, kernel_size=(4,), stride=(1,), padding=(3,), groups=2304)
279
+ (in_proj): Linear(in_features=1024, out_features=4384, bias=False)
280
+ (norm): RMSNormGated()
281
+ (out_proj): Linear(in_features=2048, out_features=1024, bias=False)
282
+ )
283
+ )
284
+ )
285
+ (norm_f): RMSNorm(1024, eps=1e-05)
286
+ )
287
+ (lm_head): Linear(in_features=1024, out_features=32000, bias=False)
288
+ (criterion): FusedLinearCrossEntropyLoss()
289
+ )
290
+
291
+ [titan] 2025-07-23 14:28:50,315 - root - INFO - Compiling each block with torch.compile
292
+ [titan] 2025-07-23 14:28:50,316 - root - INFO - Compiling the embedding, norm, and lm_head layers with torch.compile
293
+ [titan] 2025-07-23 14:28:50,316 - root - WARNING - No norm found in model
294
+ [titan] 2025-07-23 14:28:50,316 - root - INFO - Compiling the entire model with torch.compile
295
+ [titan] 2025-07-23 14:28:50,539 - root - INFO - Applied FSDP to the model
296
+ [titan] 2025-07-23 14:28:50,882 - fla.models.mamba2.modeling_mamba2 - WARNING - `A_log` is a DTensor, skipping initialization
297
+ [titan] 2025-07-23 14:28:51,042 - fla.models.mamba2.modeling_mamba2 - WARNING - `dt_bias` is a DTensor, skipping initialization
298
+ [titan] 2025-07-23 14:28:51,274 - root - INFO - CUDA memory usage for model: 0.19GiB(0.20%)
299
+ [titan] 2025-07-23 14:28:51,275 - root - WARNING - Warmup (100) + decay (95366) steps exceed total training steps (95366). Adjusting decay steps to 95266.
300
+ [titan] 2025-07-23 14:28:51,299 - root - INFO - Checkpointing active. Checkpoints will be loaded from and saved to /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/exp/mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/checkpoint
301
+ [titan] 2025-07-23 14:28:51,307 - root - INFO - CUDA capacity: NVIDIA H20 with 95.00GiB memory
302
+ [titan] 2025-07-23 14:28:51,458 - root - WARNING - Peak flops undefined for: NVIDIA H20, fallback to A100
303
+ [titan] 2025-07-23 14:28:58,658 - root - INFO - ***** Running training *****
304
+ [titan] 2025-07-23 14:28:58,660 - root - INFO -  Training starts at step 1
305
+ [titan] 2025-07-23 14:28:58,664 - root - INFO -  Number of tokens per sequence = 8,192
306
+ [titan] 2025-07-23 14:28:58,664 - root - INFO -  Gradient Accumulation steps = 2
307
+ [titan] 2025-07-23 14:28:58,665 - root - INFO -  Instantaneous batch size (per device) = 8
308
+ [titan] 2025-07-23 14:28:58,665 - root - INFO -  Global batch size (w. parallel, distributed & accumulation) = 128 (1,048,576 tokens)
309
+ [titan] 2025-07-23 14:28:58,665 - root - INFO -  Total optimization steps = 95,366 (99,998,498,816 tokens)
310
+ [titan] 2025-07-23 14:28:58,665 - root - INFO -  Warmup steps = 100 (104,857,600 tokens)
311
+ [titan] 2025-07-23 14:28:58,666 - root - INFO -  Number of parameters = 382,387,712 
312
+ [titan] 2025-07-23 14:28:58,666 - root - INFO - Profiling active. Traces will be saved at /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/exp/mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/profile_trace
313
+ /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/variables/functions.py:1263: UserWarning: Dynamo does not know how to trace the builtin `cuda_utils.get_device_properties.` This function is either a Python builtin (e.g. _warnings.warn) or a third-party C/C++ Python extension (perhaps created with pybind).
314
+ If it is a Python builtin, please file an issue on GitHub so the PyTorch team can add support for it and see the next case for a workaround.
315
+ If it is a third-party C/C++ Python extension, please either wrap it into a PyTorch-understood custom operator (see https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html for more details) or, if it is traceable, use `torch.compiler.allow_in_graph`.
316
+ torch._dynamo.utils.warn_once(explanation + "\n" + "\n".join(hints))
317
+ [rank2]: Traceback (most recent call last):
318
+ [rank2]: File "<frozen runpy>", line 198, in _run_module_as_main
319
+ [rank2]: File "<frozen runpy>", line 88, in _run_code
320
+ [rank2]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/flame/train.py", line 615, in <module>
321
+ [rank2]: main(config)
322
+ [rank2]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
323
+ [rank2]: return f(*args, **kwargs)
324
+ [rank2]: ^^^^^^^^^^^^^^^^^^
325
+ [rank2]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/flame/train.py", line 487, in main
326
+ [rank2]: output = model(
327
+ [rank2]: ^^^^^^
328
+ [rank2]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
329
+ [rank2]: return self._call_impl(*args, **kwargs)
330
+ [rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
331
+ [rank2]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1857, in _call_impl
332
+ [rank2]: return inner()
333
+ [rank2]: ^^^^^^^
334
+ [rank2]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1805, in inner
335
+ [rank2]: result = forward_call(*args, **kwargs)
336
+ [rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
337
+ [rank2]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
338
+ [rank2]: return func(*args, **kwargs)
339
+ [rank2]: ^^^^^^^^^^^^^^^^^^^^^
340
+ [rank2]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/mamba2/modeling_mamba2.py", line 526, in forward
341
+ [rank2]: outputs = self.backbone(
342
+ [rank2]: ^^^^^^^^^^^^^^
343
+ [rank2]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
344
+ [rank2]: return self._call_impl(*args, **kwargs)
345
+ [rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
346
+ [rank2]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
347
+ [rank2]: return forward_call(*args, **kwargs)
348
+ [rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
349
+ [rank2]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/mamba2/modeling_mamba2.py", line 405, in forward
350
+ [rank2]: hidden_states = mixer_block(
351
+ [rank2]: ^^^^^^^^^^^^
352
+ [rank2]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
353
+ [rank2]: return self._call_impl(*args, **kwargs)
354
+ [rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
355
+ [rank2]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1857, in _call_impl
356
+ [rank2]: return inner()
357
+ [rank2]: ^^^^^^^
358
+ [rank2]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1805, in inner
359
+ [rank2]: result = forward_call(*args, **kwargs)
360
+ [rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
361
+ [rank2]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 655, in _fn
362
+ [rank2]: return fn(*args, **kwargs)
363
+ [rank2]: ^^^^^^^^^^^^^^^^^^^
364
+ [rank2]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
365
+ [rank2]: return self._call_impl(*args, **kwargs)
366
+ [rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
367
+ [rank2]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
368
+ [rank2]: return forward_call(*args, **kwargs)
369
+ [rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
370
+ [rank2]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/mamba2/modeling_mamba2.py", line 161, in forward
371
+ [rank2]: hidden_states = self.norm(hidden_states)
372
+ [rank2]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/mamba2/modeling_mamba2.py", line 165, in torch_dynamo_resume_in_forward_at_161
373
+ [rank2]: hidden_states = self.mixer(
374
+ [rank2]: ^^^^^^^^^^^
375
+ [rank2]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
376
+ [rank2]: return self._call_impl(*args, **kwargs)
377
+ [rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
378
+ [rank2]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
379
+ [rank2]: return forward_call(*args, **kwargs)
380
+ [rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
381
+ [rank2]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/layers/mamba2.py", line 601, in forward
382
+ [rank2]: return self.torch_forward(hidden_states, cache_params, cache_position, attention_mask)
383
+ [rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
384
+ [rank2]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/layers/mamba2.py", line 528, in torch_forward
385
+ [rank2]: G_intermediate = C[:, :, :, None, :, :] * B[:, :, None, :, :, :]
386
+ [rank2]: ~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~
387
+ [rank2]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 GiB. GPU 2 has a total capacity of 95.00 GiB of which 85.09 GiB is free. Process 2003900 has 316.00 MiB memory in use. Process 696030 has 316.00 MiB memory in use. Process 1114695 has 9.27 GiB memory in use. Of the allocated memory 7.99 GiB is allocated by PyTorch, and 73.66 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_07su7ijp/attempt_0/2/stdout.log ADDED
File without changes
mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_07su7ijp/attempt_0/3/error.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"message": {"message": "OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 GiB. GPU 3 has a total capacity of 95.00 GiB of which 85.09 GiB is free. Process 2003902 has 316.00 MiB memory in use. Process 696032 has 316.00 MiB memory in use. Process 1114696 has 9.27 GiB memory in use. Of the allocated memory 7.99 GiB is allocated by PyTorch, and 73.66 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)", "extraInfo": {"py_callstack": "Traceback (most recent call last):\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py\", line 355, in wrapper\n return f(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/flame/train.py\", line 487, in main\n output = model(\n ^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1751, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1857, in _call_impl\n return inner()\n ^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1805, in inner\n result = forward_call(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/transformers/utils/deprecation.py\", line 172, in wrapped_func\n return func(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/mamba2/modeling_mamba2.py\", line 526, in forward\n outputs = self.backbone(\n ^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1751, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1762, in _call_impl\n return forward_call(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/mamba2/modeling_mamba2.py\", line 405, in forward\n hidden_states = mixer_block(\n ^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1751, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1857, in _call_impl\n return inner()\n ^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1805, in inner\n result = forward_call(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py\", line 655, in _fn\n return fn(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1751, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1762, in _call_impl\n return forward_call(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/mamba2/modeling_mamba2.py\", line 161, in forward\n hidden_states = self.norm(hidden_states)\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/mamba2/modeling_mamba2.py\", line 165, in torch_dynamo_resume_in_forward_at_161\n hidden_states = self.mixer(\n ^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1751, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1762, in _call_impl\n return forward_call(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/layers/mamba2.py\", line 601, in forward\n return self.torch_forward(hidden_states, cache_params, cache_position, attention_mask)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/layers/mamba2.py\", line 528, in torch_forward\n G_intermediate = C[:, :, :, None, :, :] * B[:, :, None, :, :, :]\n ~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~\ntorch.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 GiB. GPU 3 has a total capacity of 95.00 GiB of which 85.09 GiB is free. Process 2003902 has 316.00 MiB memory in use. Process 696032 has 316.00 MiB memory in use. Process 1114696 has 9.27 GiB memory in use. Of the allocated memory 7.99 GiB is allocated by PyTorch, and 73.66 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)\n", "timestamp": "1753252283"}}}
mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_07su7ijp/attempt_0/3/stderr.log ADDED
@@ -0,0 +1,387 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [titan] 2025-07-23 14:27:46,466 - root - INFO - Starting job: default job
2
+ [titan] 2025-07-23 14:27:46,466 - root - INFO - {
3
+ "activation_checkpoint": {
4
+ "mode": "none",
5
+ "selective_ac_option": "2"
6
+ },
7
+ "activation_offload": {
8
+ "mode": "none"
9
+ },
10
+ "checkpoint": {
11
+ "async_mode": "disabled",
12
+ "create_seed_checkpoint": false,
13
+ "enable_checkpoint": true,
14
+ "exclude_from_loading": [],
15
+ "export_dtype": "float32",
16
+ "folder": "checkpoint",
17
+ "interval": 8192,
18
+ "interval_type": "steps",
19
+ "keep_latest_k": 100,
20
+ "load_step": -1,
21
+ "model_weights_only": false
22
+ },
23
+ "comm": {
24
+ "init_timeout_seconds": 300,
25
+ "trace_buf_size": 20000,
26
+ "train_timeout_seconds": 100
27
+ },
28
+ "experimental": {
29
+ "context_parallel_degree": 1,
30
+ "context_parallel_rotate_method": "allgather",
31
+ "custom_model_path": "",
32
+ "enable_async_tensor_parallel": false,
33
+ "enable_compiled_autograd": false,
34
+ "pipeline_parallel_degree": 1,
35
+ "pipeline_parallel_microbatches": null,
36
+ "pipeline_parallel_schedule": "1F1B",
37
+ "pipeline_parallel_schedule_csv": "",
38
+ "pipeline_parallel_split_points": []
39
+ },
40
+ "fault_tolerance": {
41
+ "enable": false,
42
+ "group_size": 0,
43
+ "min_replica_size": 1,
44
+ "replica_id": 0
45
+ },
46
+ "float8": {
47
+ "enable_fsdp_float8_all_gather": false,
48
+ "force_recompute_fp8_weight_in_bwd": false,
49
+ "precompute_float8_dynamic_scale_for_fsdp": false,
50
+ "recipe_name": null
51
+ },
52
+ "job": {
53
+ "config_file": "flame/models/fla.toml",
54
+ "description": "default job",
55
+ "dump_folder": "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/exp/mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2",
56
+ "print_args": true,
57
+ "use_for_integration_test": false
58
+ },
59
+ "lr_scheduler": {
60
+ "decay_ratio": 1.0,
61
+ "decay_type": "linear",
62
+ "lr_min": 0.01,
63
+ "warmup_steps": 100
64
+ },
65
+ "memory_estimation": {
66
+ "disable_fake_mode": false,
67
+ "enabled": false
68
+ },
69
+ "metrics": {
70
+ "disable_color_printing": false,
71
+ "enable_tensorboard": true,
72
+ "enable_wandb": true,
73
+ "log_freq": 1,
74
+ "save_for_all_ranks": false,
75
+ "save_tb_folder": "tb"
76
+ },
77
+ "model": {
78
+ "config": "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/configs/mamba2_6_1_340M.json",
79
+ "converters": [],
80
+ "name": "fla",
81
+ "print_after_conversion": false,
82
+ "tokenizer_path": "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/tokenizer"
83
+ },
84
+ "optimizer": {
85
+ "early_step_in_backward": false,
86
+ "eps": 1e-08,
87
+ "implementation": "fused",
88
+ "lr": 0.0003,
89
+ "name": "AdamW"
90
+ },
91
+ "profiling": {
92
+ "enable_memory_snapshot": false,
93
+ "enable_profiling": true,
94
+ "profile_freq": 512,
95
+ "save_memory_snapshot_folder": "memory_snapshot",
96
+ "save_traces_folder": "profile_trace"
97
+ },
98
+ "training": {
99
+ "batch_size": 8,
100
+ "compile": true,
101
+ "context_len": 8192,
102
+ "data_dir": null,
103
+ "data_files": null,
104
+ "data_parallel_replicate_degree": 1,
105
+ "data_parallel_shard_degree": -1,
106
+ "data_probs": "0.55,0.3,0.15",
107
+ "dataset": "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample,/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged,/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro",
108
+ "dataset_name": "default,default,default",
109
+ "dataset_split": "train,train,train",
110
+ "deterministic": false,
111
+ "disable_loss_parallel": false,
112
+ "enable_cpu_offload": false,
113
+ "fsdp_reshard_after_forward": "default",
114
+ "gc_freq": 50,
115
+ "gradient_accumulation_steps": 2,
116
+ "max_norm": 1.0,
117
+ "mixed_precision_param": "bfloat16",
118
+ "mixed_precision_reduce": "float32",
119
+ "num_workers": 32,
120
+ "persistent_workers": false,
121
+ "pin_memory": false,
122
+ "prefetch_factor": 2,
123
+ "seed": 42,
124
+ "seq_len": 8192,
125
+ "skip_nan_inf": true,
126
+ "steps": 95366,
127
+ "streaming": true,
128
+ "tensor_parallel_degree": 1,
129
+ "varlen": false
130
+ }
131
+ }
132
+ [titan] 2025-07-23 14:27:46,466 - root - INFO - [GC] Initial GC collection. 0.00 seconds.
133
+ [titan] 2025-07-23 14:27:47,421 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config
134
+ [titan] 2025-07-23 14:27:47,423 - root - INFO - CUDA capacity: NVIDIA H20 with 95.00GiB memory
135
+ [titan] 2025-07-23 14:27:47,484 - root - WARNING - Peak flops undefined for: NVIDIA H20, fallback to A100
136
+ [titan] 2025-07-23 14:27:47,485 - root - INFO - Peak FLOPS used for computing MFU: 3.120e+14
137
+ [titan] 2025-07-23 14:27:47,485 - root - INFO - Building 1-D device mesh with ['dp_shard'], [8]
138
+ [titan] 2025-07-23 14:27:47,493 - root - INFO - Loading tokenizer...
139
+ [titan] 2025-07-23 14:27:47,997 - root - INFO - LlamaTokenizerFast(name_or_path='/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/tokenizer', vocab_size=32000, model_max_length=10000000000, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>'}, clean_up_tokenization_spaces=False, added_tokens_decoder={
140
+ 0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
141
+ 1: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
142
+ 2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
143
+ }
144
+ )
145
+ [titan] 2025-07-23 14:27:47,998 - root - INFO - Loading dataset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample,/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged,/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro:default,default,default
146
+ `trust_remote_code` is not supported anymore.
147
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample' isn't based on a loading script and remove `trust_remote_code`.
148
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
149
+ [titan] 2025-07-23 14:27:47,999 - datasets.load - ERROR - `trust_remote_code` is not supported anymore.
150
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample' isn't based on a loading script and remove `trust_remote_code`.
151
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
152
+ [titan] 2025-07-23 14:27:48,494 - root - INFO - Subset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample:default (p = 0.550):
153
+ IterableDataset({
154
+ features: ['text', 'id', 'dump', 'url', 'file_path', 'language', 'language_score', 'token_count', 'score', 'int_score'],
155
+ num_shards: 140
156
+ })
157
+ [titan] 2025-07-23 14:27:48,494 - root - INFO - Shuffling the dataset with seed 42
158
+ [titan] 2025-07-23 14:27:48,494 - root - WARNING - Dataset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample has insufficient shards (140). Need 256 shards minimum for desired data parallel workers × 32 dataloader workers. Resharding dataset to 256 shards and disabling streaming mode.
159
+ `trust_remote_code` is not supported anymore.
160
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample' isn't based on a loading script and remove `trust_remote_code`.
161
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
162
+ [titan] 2025-07-23 14:27:48,494 - datasets.load - ERROR - `trust_remote_code` is not supported anymore.
163
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample' isn't based on a loading script and remove `trust_remote_code`.
164
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
165
+ `trust_remote_code` is not supported anymore.
166
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged' isn't based on a loading script and remove `trust_remote_code`.
167
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
168
+ [titan] 2025-07-23 14:28:39,997 - datasets.load - ERROR - `trust_remote_code` is not supported anymore.
169
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged' isn't based on a loading script and remove `trust_remote_code`.
170
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
171
+ [titan] 2025-07-23 14:28:40,028 - root - INFO - Subset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged:default (p = 0.300):
172
+ IterableDataset({
173
+ features: ['repo', 'content'],
174
+ num_shards: 1
175
+ })
176
+ [titan] 2025-07-23 14:28:40,029 - root - INFO - Shuffling the dataset with seed 42
177
+ [titan] 2025-07-23 14:28:40,029 - root - WARNING - Dataset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged has insufficient shards (1). Need 256 shards minimum for desired data parallel workers × 32 dataloader workers. Resharding dataset to 256 shards and disabling streaming mode.
178
+ `trust_remote_code` is not supported anymore.
179
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged' isn't based on a loading script and remove `trust_remote_code`.
180
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
181
+ [titan] 2025-07-23 14:28:40,029 - datasets.load - ERROR - `trust_remote_code` is not supported anymore.
182
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged' isn't based on a loading script and remove `trust_remote_code`.
183
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
184
+ `trust_remote_code` is not supported anymore.
185
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro' isn't based on a loading script and remove `trust_remote_code`.
186
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
187
+ [titan] 2025-07-23 14:28:40,298 - datasets.load - ERROR - `trust_remote_code` is not supported anymore.
188
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro' isn't based on a loading script and remove `trust_remote_code`.
189
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
190
+ [titan] 2025-07-23 14:28:40,391 - root - INFO - Subset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro:default (p = 0.150):
191
+ IterableDataset({
192
+ features: ['text', 'cc-path', 'domain', 'lang', 'lang_score', 'timestamp', 'url', 'math_score'],
193
+ num_shards: 100
194
+ })
195
+ [titan] 2025-07-23 14:28:40,392 - root - INFO - Shuffling the dataset with seed 42
196
+ [titan] 2025-07-23 14:28:40,392 - root - WARNING - Dataset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro has insufficient shards (100). Need 256 shards minimum for desired data parallel workers × 32 dataloader workers. Resharding dataset to 256 shards and disabling streaming mode.
197
+ `trust_remote_code` is not supported anymore.
198
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro' isn't based on a loading script and remove `trust_remote_code`.
199
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
200
+ [titan] 2025-07-23 14:28:40,392 - datasets.load - ERROR - `trust_remote_code` is not supported anymore.
201
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro' isn't based on a loading script and remove `trust_remote_code`.
202
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
203
+ [titan] 2025-07-23 14:28:46,760 - root - INFO - Interleaving 3 datasets with probabilities [0.55, 0.3, 0.15]
204
+ [titan] 2025-07-23 14:28:47,557 - root - INFO - IterableDataset({
205
+ features: ['text', 'content'],
206
+ num_shards: 256
207
+ })
208
+ [titan] 2025-07-23 14:28:47,677 - root - INFO - Building dataloader...
209
+ [titan] 2025-07-23 14:28:47,680 - root - INFO - Loading model config from /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/configs/mamba2_6_1_340M.json
210
+ [titan] 2025-07-23 14:28:47,681 - root - INFO - Building model from the config
211
+ Mamba2Config {
212
+ "architectures": [
213
+ "Mamba2ForCausalLM"
214
+ ],
215
+ "attn": {
216
+ "layers": [
217
+ 5,
218
+ 11,
219
+ 17,
220
+ 23
221
+ ],
222
+ "num_heads": 16,
223
+ "num_kv_heads": 8,
224
+ "qkv_bias": false,
225
+ "rope_theta": 160000.0,
226
+ "window_size": null
227
+ },
228
+ "attn_mode": "chunk",
229
+ "bos_token_id": 1,
230
+ "chunk_size": 256,
231
+ "conv_kernel": 4,
232
+ "eos_token_id": 2,
233
+ "expand": 2,
234
+ "fuse_cross_entropy": true,
235
+ "fuse_norm": true,
236
+ "fuse_swiglu": true,
237
+ "head_dim": 64,
238
+ "hidden_act": "silu",
239
+ "hidden_size": 1024,
240
+ "initializer_range": 0.02,
241
+ "model_type": "mamba2",
242
+ "n_groups": 1,
243
+ "norm_eps": 1e-05,
244
+ "num_heads": 32,
245
+ "num_hidden_layers": 48,
246
+ "pad_token_id": 0,
247
+ "rescale_prenorm_residual": true,
248
+ "residual_in_fp32": true,
249
+ "rms_norm": true,
250
+ "state_size": 128,
251
+ "tie_word_embeddings": false,
252
+ "time_step_floor": 0.0001,
253
+ "time_step_limit": [
254
+ 0.0,
255
+ Infinity
256
+ ],
257
+ "time_step_max": 0.1,
258
+ "time_step_min": 0.001,
259
+ "time_step_rank": 128,
260
+ "transformers_version": "4.53.3",
261
+ "use_bias": false,
262
+ "use_cache": true,
263
+ "use_conv_bias": true,
264
+ "use_l2warp": false,
265
+ "vocab_size": 32000
266
+ }
267
+ 
268
+ [titan] 2025-07-23 14:28:50,147 - fla.layers.mamba2 - WARNING - The fast path is not available because one of `(selective_state_update)` is None. Falling back to the naive implementation. To install follow https://github.com/state-spaces/mamba/#installation
269
+ [titan] 2025-07-23 14:28:50,148 - fla.layers.mamba2 - WARNING - The CUDA backend is not available because `causal_conv1d` is None. Falling back to the Triton backend. To install follow https://github.com/Dao-AILab/causal-conv1d
270
+ [titan] 2025-07-23 14:28:50,265 - root - INFO - 
271
+ Mamba2ForCausalLM(
272
+ (backbone): Mamba2Model(
273
+ (embeddings): Embedding(32000, 1024)
274
+ (layers): ModuleList(
275
+ (0-47): 48 x Mamba2Block(
276
+ (norm): RMSNorm(1024, eps=1e-05)
277
+ (mixer): Mamba2(
278
+ (conv1d): Conv1d(2304, 2304, kernel_size=(4,), stride=(1,), padding=(3,), groups=2304)
279
+ (in_proj): Linear(in_features=1024, out_features=4384, bias=False)
280
+ (norm): RMSNormGated()
281
+ (out_proj): Linear(in_features=2048, out_features=1024, bias=False)
282
+ )
283
+ )
284
+ )
285
+ (norm_f): RMSNorm(1024, eps=1e-05)
286
+ )
287
+ (lm_head): Linear(in_features=1024, out_features=32000, bias=False)
288
+ (criterion): FusedLinearCrossEntropyLoss()
289
+ )
290
+
291
+ [titan] 2025-07-23 14:28:50,322 - root - INFO - Compiling each block with torch.compile
292
+ [titan] 2025-07-23 14:28:50,322 - root - INFO - Compiling the embedding, norm, and lm_head layers with torch.compile
293
+ [titan] 2025-07-23 14:28:50,322 - root - WARNING - No norm found in model
294
+ [titan] 2025-07-23 14:28:50,323 - root - INFO - Compiling the entire model with torch.compile
295
+ [titan] 2025-07-23 14:28:50,540 - root - INFO - Applied FSDP to the model
296
+ [titan] 2025-07-23 14:28:50,882 - fla.models.mamba2.modeling_mamba2 - WARNING - `A_log` is a DTensor, skipping initialization
297
+ [titan] 2025-07-23 14:28:51,042 - fla.models.mamba2.modeling_mamba2 - WARNING - `dt_bias` is a DTensor, skipping initialization
298
+ [titan] 2025-07-23 14:28:51,272 - root - INFO - CUDA memory usage for model: 0.19GiB(0.20%)
299
+ [titan] 2025-07-23 14:28:51,273 - root - WARNING - Warmup (100) + decay (95366) steps exceed total training steps (95366). Adjusting decay steps to 95266.
300
+ [titan] 2025-07-23 14:28:51,297 - root - INFO - Checkpointing active. Checkpoints will be loaded from and saved to /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/exp/mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/checkpoint
301
+ [titan] 2025-07-23 14:28:51,324 - root - INFO - CUDA capacity: NVIDIA H20 with 95.00GiB memory
302
+ [titan] 2025-07-23 14:28:51,477 - root - WARNING - Peak flops undefined for: NVIDIA H20, fallback to A100
303
+ [titan] 2025-07-23 14:28:58,659 - root - INFO - ***** Running training *****
304
+ [titan] 2025-07-23 14:28:58,661 - root - INFO -  Training starts at step 1
305
+ [titan] 2025-07-23 14:28:58,663 - root - INFO -  Number of tokens per sequence = 8,192
306
+ [titan] 2025-07-23 14:28:58,663 - root - INFO -  Gradient Accumulation steps = 2
307
+ [titan] 2025-07-23 14:28:58,664 - root - INFO -  Instantaneous batch size (per device) = 8
308
+ [titan] 2025-07-23 14:28:58,664 - root - INFO -  Global batch size (w. parallel, distributed & accumulation) = 128 (1,048,576 tokens)
309
+ [titan] 2025-07-23 14:28:58,664 - root - INFO -  Total optimization steps = 95,366 (99,998,498,816 tokens)
310
+ [titan] 2025-07-23 14:28:58,664 - root - INFO -  Warmup steps = 100 (104,857,600 tokens)
311
+ [titan] 2025-07-23 14:28:58,664 - root - INFO -  Number of parameters = 382,387,712 
312
+ [titan] 2025-07-23 14:28:58,665 - root - INFO - Profiling active. Traces will be saved at /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/exp/mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/profile_trace
313
+ /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/variables/functions.py:1263: UserWarning: Dynamo does not know how to trace the builtin `cuda_utils.get_device_properties.` This function is either a Python builtin (e.g. _warnings.warn) or a third-party C/C++ Python extension (perhaps created with pybind).
314
+ If it is a Python builtin, please file an issue on GitHub so the PyTorch team can add support for it and see the next case for a workaround.
315
+ If it is a third-party C/C++ Python extension, please either wrap it into a PyTorch-understood custom operator (see https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html for more details) or, if it is traceable, use `torch.compiler.allow_in_graph`.
316
+ torch._dynamo.utils.warn_once(explanation + "\n" + "\n".join(hints))
317
+ [rank3]: Traceback (most recent call last):
318
+ [rank3]: File "<frozen runpy>", line 198, in _run_module_as_main
319
+ [rank3]: File "<frozen runpy>", line 88, in _run_code
320
+ [rank3]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/flame/train.py", line 615, in <module>
321
+ [rank3]: main(config)
322
+ [rank3]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
323
+ [rank3]: return f(*args, **kwargs)
324
+ [rank3]: ^^^^^^^^^^^^^^^^^^
325
+ [rank3]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/flame/train.py", line 487, in main
326
+ [rank3]: output = model(
327
+ [rank3]: ^^^^^^
328
+ [rank3]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
329
+ [rank3]: return self._call_impl(*args, **kwargs)
330
+ [rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
331
+ [rank3]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1857, in _call_impl
332
+ [rank3]: return inner()
333
+ [rank3]: ^^^^^^^
334
+ [rank3]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1805, in inner
335
+ [rank3]: result = forward_call(*args, **kwargs)
336
+ [rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
337
+ [rank3]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
338
+ [rank3]: return func(*args, **kwargs)
339
+ [rank3]: ^^^^^^^^^^^^^^^^^^^^^
340
+ [rank3]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/mamba2/modeling_mamba2.py", line 526, in forward
341
+ [rank3]: outputs = self.backbone(
342
+ [rank3]: ^^^^^^^^^^^^^^
343
+ [rank3]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
344
+ [rank3]: return self._call_impl(*args, **kwargs)
345
+ [rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
346
+ [rank3]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
347
+ [rank3]: return forward_call(*args, **kwargs)
348
+ [rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
349
+ [rank3]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/mamba2/modeling_mamba2.py", line 405, in forward
350
+ [rank3]: hidden_states = mixer_block(
351
+ [rank3]: ^^^^^^^^^^^^
352
+ [rank3]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
353
+ [rank3]: return self._call_impl(*args, **kwargs)
354
+ [rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
355
+ [rank3]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1857, in _call_impl
356
+ [rank3]: return inner()
357
+ [rank3]: ^^^^^^^
358
+ [rank3]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1805, in inner
359
+ [rank3]: result = forward_call(*args, **kwargs)
360
+ [rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
361
+ [rank3]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 655, in _fn
362
+ [rank3]: return fn(*args, **kwargs)
363
+ [rank3]: ^^^^^^^^^^^^^^^^^^^
364
+ [rank3]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
365
+ [rank3]: return self._call_impl(*args, **kwargs)
366
+ [rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
367
+ [rank3]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
368
+ [rank3]: return forward_call(*args, **kwargs)
369
+ [rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
370
+ [rank3]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/mamba2/modeling_mamba2.py", line 161, in forward
371
+ [rank3]: hidden_states = self.norm(hidden_states)
372
+ [rank3]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/mamba2/modeling_mamba2.py", line 165, in torch_dynamo_resume_in_forward_at_161
373
+ [rank3]: hidden_states = self.mixer(
374
+ [rank3]: ^^^^^^^^^^^
375
+ [rank3]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
376
+ [rank3]: return self._call_impl(*args, **kwargs)
377
+ [rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
378
+ [rank3]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
379
+ [rank3]: return forward_call(*args, **kwargs)
380
+ [rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
381
+ [rank3]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/layers/mamba2.py", line 601, in forward
382
+ [rank3]: return self.torch_forward(hidden_states, cache_params, cache_position, attention_mask)
383
+ [rank3]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
384
+ [rank3]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/layers/mamba2.py", line 528, in torch_forward
385
+ [rank3]: G_intermediate = C[:, :, :, None, :, :] * B[:, :, None, :, :, :]
386
+ [rank3]: ~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~
387
+ [rank3]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 GiB. GPU 3 has a total capacity of 95.00 GiB of which 85.09 GiB is free. Process 2003902 has 316.00 MiB memory in use. Process 696032 has 316.00 MiB memory in use. Process 1114696 has 9.27 GiB memory in use. Of the allocated memory 7.99 GiB is allocated by PyTorch, and 73.66 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_07su7ijp/attempt_0/3/stdout.log ADDED
File without changes
mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_07su7ijp/attempt_0/4/error.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"message": {"message": "OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 GiB. GPU 4 has a total capacity of 95.00 GiB of which 85.09 GiB is free. Process 2003903 has 316.00 MiB memory in use. Process 696034 has 316.00 MiB memory in use. Process 1114697 has 9.27 GiB memory in use. Of the allocated memory 7.99 GiB is allocated by PyTorch, and 73.66 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)", "extraInfo": {"py_callstack": "Traceback (most recent call last):\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py\", line 355, in wrapper\n return f(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/flame/train.py\", line 487, in main\n output = model(\n ^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1751, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1857, in _call_impl\n return inner()\n ^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1805, in inner\n result = forward_call(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/transformers/utils/deprecation.py\", line 172, in wrapped_func\n return func(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/mamba2/modeling_mamba2.py\", line 526, in forward\n outputs = self.backbone(\n ^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1751, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1762, in _call_impl\n return forward_call(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/mamba2/modeling_mamba2.py\", line 405, in forward\n hidden_states = mixer_block(\n ^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1751, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1857, in _call_impl\n return inner()\n ^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1805, in inner\n result = forward_call(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py\", line 655, in _fn\n return fn(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1751, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1762, in _call_impl\n return forward_call(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/mamba2/modeling_mamba2.py\", line 161, in forward\n hidden_states = self.norm(hidden_states)\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/mamba2/modeling_mamba2.py\", line 165, in torch_dynamo_resume_in_forward_at_161\n hidden_states = self.mixer(\n ^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1751, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1762, in _call_impl\n return forward_call(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/layers/mamba2.py\", line 601, in forward\n return self.torch_forward(hidden_states, cache_params, cache_position, attention_mask)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/layers/mamba2.py\", line 528, in torch_forward\n G_intermediate = C[:, :, :, None, :, :] * B[:, :, None, :, :, :]\n ~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~\ntorch.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 GiB. GPU 4 has a total capacity of 95.00 GiB of which 85.09 GiB is free. Process 2003903 has 316.00 MiB memory in use. Process 696034 has 316.00 MiB memory in use. Process 1114697 has 9.27 GiB memory in use. Of the allocated memory 7.99 GiB is allocated by PyTorch, and 73.66 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)\n", "timestamp": "1753252283"}}}
mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_07su7ijp/attempt_0/4/stderr.log ADDED
@@ -0,0 +1,387 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [titan] 2025-07-23 14:27:46,452 - root - INFO - Starting job: default job
2
+ [titan] 2025-07-23 14:27:46,453 - root - INFO - {
3
+ "activation_checkpoint": {
4
+ "mode": "none",
5
+ "selective_ac_option": "2"
6
+ },
7
+ "activation_offload": {
8
+ "mode": "none"
9
+ },
10
+ "checkpoint": {
11
+ "async_mode": "disabled",
12
+ "create_seed_checkpoint": false,
13
+ "enable_checkpoint": true,
14
+ "exclude_from_loading": [],
15
+ "export_dtype": "float32",
16
+ "folder": "checkpoint",
17
+ "interval": 8192,
18
+ "interval_type": "steps",
19
+ "keep_latest_k": 100,
20
+ "load_step": -1,
21
+ "model_weights_only": false
22
+ },
23
+ "comm": {
24
+ "init_timeout_seconds": 300,
25
+ "trace_buf_size": 20000,
26
+ "train_timeout_seconds": 100
27
+ },
28
+ "experimental": {
29
+ "context_parallel_degree": 1,
30
+ "context_parallel_rotate_method": "allgather",
31
+ "custom_model_path": "",
32
+ "enable_async_tensor_parallel": false,
33
+ "enable_compiled_autograd": false,
34
+ "pipeline_parallel_degree": 1,
35
+ "pipeline_parallel_microbatches": null,
36
+ "pipeline_parallel_schedule": "1F1B",
37
+ "pipeline_parallel_schedule_csv": "",
38
+ "pipeline_parallel_split_points": []
39
+ },
40
+ "fault_tolerance": {
41
+ "enable": false,
42
+ "group_size": 0,
43
+ "min_replica_size": 1,
44
+ "replica_id": 0
45
+ },
46
+ "float8": {
47
+ "enable_fsdp_float8_all_gather": false,
48
+ "force_recompute_fp8_weight_in_bwd": false,
49
+ "precompute_float8_dynamic_scale_for_fsdp": false,
50
+ "recipe_name": null
51
+ },
52
+ "job": {
53
+ "config_file": "flame/models/fla.toml",
54
+ "description": "default job",
55
+ "dump_folder": "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/exp/mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2",
56
+ "print_args": true,
57
+ "use_for_integration_test": false
58
+ },
59
+ "lr_scheduler": {
60
+ "decay_ratio": 1.0,
61
+ "decay_type": "linear",
62
+ "lr_min": 0.01,
63
+ "warmup_steps": 100
64
+ },
65
+ "memory_estimation": {
66
+ "disable_fake_mode": false,
67
+ "enabled": false
68
+ },
69
+ "metrics": {
70
+ "disable_color_printing": false,
71
+ "enable_tensorboard": true,
72
+ "enable_wandb": true,
73
+ "log_freq": 1,
74
+ "save_for_all_ranks": false,
75
+ "save_tb_folder": "tb"
76
+ },
77
+ "model": {
78
+ "config": "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/configs/mamba2_6_1_340M.json",
79
+ "converters": [],
80
+ "name": "fla",
81
+ "print_after_conversion": false,
82
+ "tokenizer_path": "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/tokenizer"
83
+ },
84
+ "optimizer": {
85
+ "early_step_in_backward": false,
86
+ "eps": 1e-08,
87
+ "implementation": "fused",
88
+ "lr": 0.0003,
89
+ "name": "AdamW"
90
+ },
91
+ "profiling": {
92
+ "enable_memory_snapshot": false,
93
+ "enable_profiling": true,
94
+ "profile_freq": 512,
95
+ "save_memory_snapshot_folder": "memory_snapshot",
96
+ "save_traces_folder": "profile_trace"
97
+ },
98
+ "training": {
99
+ "batch_size": 8,
100
+ "compile": true,
101
+ "context_len": 8192,
102
+ "data_dir": null,
103
+ "data_files": null,
104
+ "data_parallel_replicate_degree": 1,
105
+ "data_parallel_shard_degree": -1,
106
+ "data_probs": "0.55,0.3,0.15",
107
+ "dataset": "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample,/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged,/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro",
108
+ "dataset_name": "default,default,default",
109
+ "dataset_split": "train,train,train",
110
+ "deterministic": false,
111
+ "disable_loss_parallel": false,
112
+ "enable_cpu_offload": false,
113
+ "fsdp_reshard_after_forward": "default",
114
+ "gc_freq": 50,
115
+ "gradient_accumulation_steps": 2,
116
+ "max_norm": 1.0,
117
+ "mixed_precision_param": "bfloat16",
118
+ "mixed_precision_reduce": "float32",
119
+ "num_workers": 32,
120
+ "persistent_workers": false,
121
+ "pin_memory": false,
122
+ "prefetch_factor": 2,
123
+ "seed": 42,
124
+ "seq_len": 8192,
125
+ "skip_nan_inf": true,
126
+ "steps": 95366,
127
+ "streaming": true,
128
+ "tensor_parallel_degree": 1,
129
+ "varlen": false
130
+ }
131
+ }
132
+ [titan] 2025-07-23 14:27:46,453 - root - INFO - [GC] Initial GC collection. 0.00 seconds.
133
+ [titan] 2025-07-23 14:27:47,428 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config
134
+ [titan] 2025-07-23 14:27:47,431 - root - INFO - CUDA capacity: NVIDIA H20 with 95.00GiB memory
135
+ [titan] 2025-07-23 14:27:47,487 - root - WARNING - Peak flops undefined for: NVIDIA H20, fallback to A100
136
+ [titan] 2025-07-23 14:27:47,487 - root - INFO - Peak FLOPS used for computing MFU: 3.120e+14
137
+ [titan] 2025-07-23 14:27:47,487 - root - INFO - Building 1-D device mesh with ['dp_shard'], [8]
138
+ [titan] 2025-07-23 14:27:47,494 - root - INFO - Loading tokenizer...
139
+ [titan] 2025-07-23 14:27:47,997 - root - INFO - LlamaTokenizerFast(name_or_path='/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/tokenizer', vocab_size=32000, model_max_length=10000000000, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>'}, clean_up_tokenization_spaces=False, added_tokens_decoder={
140
+ 0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
141
+ 1: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
142
+ 2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
143
+ }
144
+ )
145
+ [titan] 2025-07-23 14:27:47,998 - root - INFO - Loading dataset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample,/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged,/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro:default,default,default
146
+ `trust_remote_code` is not supported anymore.
147
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample' isn't based on a loading script and remove `trust_remote_code`.
148
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
149
+ [titan] 2025-07-23 14:27:47,998 - datasets.load - ERROR - `trust_remote_code` is not supported anymore.
150
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample' isn't based on a loading script and remove `trust_remote_code`.
151
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
152
+ [titan] 2025-07-23 14:27:48,492 - root - INFO - Subset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample:default (p = 0.550):
153
+ IterableDataset({
154
+ features: ['text', 'id', 'dump', 'url', 'file_path', 'language', 'language_score', 'token_count', 'score', 'int_score'],
155
+ num_shards: 140
156
+ })
157
+ [titan] 2025-07-23 14:27:48,492 - root - INFO - Shuffling the dataset with seed 42
158
+ [titan] 2025-07-23 14:27:48,492 - root - WARNING - Dataset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample has insufficient shards (140). Need 256 shards minimum for desired data parallel workers × 32 dataloader workers. Resharding dataset to 256 shards and disabling streaming mode.
159
+ `trust_remote_code` is not supported anymore.
160
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample' isn't based on a loading script and remove `trust_remote_code`.
161
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
162
+ [titan] 2025-07-23 14:27:48,492 - datasets.load - ERROR - `trust_remote_code` is not supported anymore.
163
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample' isn't based on a loading script and remove `trust_remote_code`.
164
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
165
+ `trust_remote_code` is not supported anymore.
166
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged' isn't based on a loading script and remove `trust_remote_code`.
167
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
168
+ [titan] 2025-07-23 14:28:39,720 - datasets.load - ERROR - `trust_remote_code` is not supported anymore.
169
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged' isn't based on a loading script and remove `trust_remote_code`.
170
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
171
+ [titan] 2025-07-23 14:28:39,830 - root - INFO - Subset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged:default (p = 0.300):
172
+ IterableDataset({
173
+ features: ['repo', 'content'],
174
+ num_shards: 1
175
+ })
176
+ [titan] 2025-07-23 14:28:39,831 - root - INFO - Shuffling the dataset with seed 42
177
+ [titan] 2025-07-23 14:28:39,831 - root - WARNING - Dataset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged has insufficient shards (1). Need 256 shards minimum for desired data parallel workers × 32 dataloader workers. Resharding dataset to 256 shards and disabling streaming mode.
178
+ `trust_remote_code` is not supported anymore.
179
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged' isn't based on a loading script and remove `trust_remote_code`.
180
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
181
+ [titan] 2025-07-23 14:28:39,831 - datasets.load - ERROR - `trust_remote_code` is not supported anymore.
182
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged' isn't based on a loading script and remove `trust_remote_code`.
183
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
184
+ `trust_remote_code` is not supported anymore.
185
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro' isn't based on a loading script and remove `trust_remote_code`.
186
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
187
+ [titan] 2025-07-23 14:28:40,087 - datasets.load - ERROR - `trust_remote_code` is not supported anymore.
188
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro' isn't based on a loading script and remove `trust_remote_code`.
189
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
190
+ [titan] 2025-07-23 14:28:40,312 - root - INFO - Subset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro:default (p = 0.150):
191
+ IterableDataset({
192
+ features: ['text', 'cc-path', 'domain', 'lang', 'lang_score', 'timestamp', 'url', 'math_score'],
193
+ num_shards: 100
194
+ })
195
+ [titan] 2025-07-23 14:28:40,313 - root - INFO - Shuffling the dataset with seed 42
196
+ [titan] 2025-07-23 14:28:40,313 - root - WARNING - Dataset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro has insufficient shards (100). Need 256 shards minimum for desired data parallel workers × 32 dataloader workers. Resharding dataset to 256 shards and disabling streaming mode.
197
+ `trust_remote_code` is not supported anymore.
198
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro' isn't based on a loading script and remove `trust_remote_code`.
199
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
200
+ [titan] 2025-07-23 14:28:40,313 - datasets.load - ERROR - `trust_remote_code` is not supported anymore.
201
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro' isn't based on a loading script and remove `trust_remote_code`.
202
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
203
+ [titan] 2025-07-23 14:28:46,672 - root - INFO - Interleaving 3 datasets with probabilities [0.55, 0.3, 0.15]
204
+ [titan] 2025-07-23 14:28:47,386 - root - INFO - IterableDataset({
205
+ features: ['text', 'content'],
206
+ num_shards: 256
207
+ })
208
+ [titan] 2025-07-23 14:28:47,512 - root - INFO - Building dataloader...
209
+ [titan] 2025-07-23 14:28:47,515 - root - INFO - Loading model config from /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/configs/mamba2_6_1_340M.json
210
+ [titan] 2025-07-23 14:28:47,516 - root - INFO - Building model from the config
211
+ Mamba2Config {
212
+ "architectures": [
213
+ "Mamba2ForCausalLM"
214
+ ],
215
+ "attn": {
216
+ "layers": [
217
+ 5,
218
+ 11,
219
+ 17,
220
+ 23
221
+ ],
222
+ "num_heads": 16,
223
+ "num_kv_heads": 8,
224
+ "qkv_bias": false,
225
+ "rope_theta": 160000.0,
226
+ "window_size": null
227
+ },
228
+ "attn_mode": "chunk",
229
+ "bos_token_id": 1,
230
+ "chunk_size": 256,
231
+ "conv_kernel": 4,
232
+ "eos_token_id": 2,
233
+ "expand": 2,
234
+ "fuse_cross_entropy": true,
235
+ "fuse_norm": true,
236
+ "fuse_swiglu": true,
237
+ "head_dim": 64,
238
+ "hidden_act": "silu",
239
+ "hidden_size": 1024,
240
+ "initializer_range": 0.02,
241
+ "model_type": "mamba2",
242
+ "n_groups": 1,
243
+ "norm_eps": 1e-05,
244
+ "num_heads": 32,
245
+ "num_hidden_layers": 48,
246
+ "pad_token_id": 0,
247
+ "rescale_prenorm_residual": true,
248
+ "residual_in_fp32": true,
249
+ "rms_norm": true,
250
+ "state_size": 128,
251
+ "tie_word_embeddings": false,
252
+ "time_step_floor": 0.0001,
253
+ "time_step_limit": [
254
+ 0.0,
255
+ Infinity
256
+ ],
257
+ "time_step_max": 0.1,
258
+ "time_step_min": 0.001,
259
+ "time_step_rank": 128,
260
+ "transformers_version": "4.53.3",
261
+ "use_bias": false,
262
+ "use_cache": true,
263
+ "use_conv_bias": true,
264
+ "use_l2warp": false,
265
+ "vocab_size": 32000
266
+ }
267
+ 
268
+ [titan] 2025-07-23 14:28:50,147 - fla.layers.mamba2 - WARNING - The fast path is not available because one of `(selective_state_update)` is None. Falling back to the naive implementation. To install follow https://github.com/state-spaces/mamba/#installation
269
+ [titan] 2025-07-23 14:28:50,147 - fla.layers.mamba2 - WARNING - The CUDA backend is not available because `causal_conv1d` is None. Falling back to the Triton backend. To install follow https://github.com/Dao-AILab/causal-conv1d
270
+ [titan] 2025-07-23 14:28:50,265 - root - INFO - 
271
+ Mamba2ForCausalLM(
272
+ (backbone): Mamba2Model(
273
+ (embeddings): Embedding(32000, 1024)
274
+ (layers): ModuleList(
275
+ (0-47): 48 x Mamba2Block(
276
+ (norm): RMSNorm(1024, eps=1e-05)
277
+ (mixer): Mamba2(
278
+ (conv1d): Conv1d(2304, 2304, kernel_size=(4,), stride=(1,), padding=(3,), groups=2304)
279
+ (in_proj): Linear(in_features=1024, out_features=4384, bias=False)
280
+ (norm): RMSNormGated()
281
+ (out_proj): Linear(in_features=2048, out_features=1024, bias=False)
282
+ )
283
+ )
284
+ )
285
+ (norm_f): RMSNorm(1024, eps=1e-05)
286
+ )
287
+ (lm_head): Linear(in_features=1024, out_features=32000, bias=False)
288
+ (criterion): FusedLinearCrossEntropyLoss()
289
+ )
290
+
291
+ [titan] 2025-07-23 14:28:50,317 - root - INFO - Compiling each block with torch.compile
292
+ [titan] 2025-07-23 14:28:50,317 - root - INFO - Compiling the embedding, norm, and lm_head layers with torch.compile
293
+ [titan] 2025-07-23 14:28:50,317 - root - WARNING - No norm found in model
294
+ [titan] 2025-07-23 14:28:50,317 - root - INFO - Compiling the entire model with torch.compile
295
+ [titan] 2025-07-23 14:28:50,541 - root - INFO - Applied FSDP to the model
296
+ [titan] 2025-07-23 14:28:50,886 - fla.models.mamba2.modeling_mamba2 - WARNING - `A_log` is a DTensor, skipping initialization
297
+ [titan] 2025-07-23 14:28:51,042 - fla.models.mamba2.modeling_mamba2 - WARNING - `dt_bias` is a DTensor, skipping initialization
298
+ [titan] 2025-07-23 14:28:51,272 - root - INFO - CUDA memory usage for model: 0.19GiB(0.20%)
299
+ [titan] 2025-07-23 14:28:51,274 - root - WARNING - Warmup (100) + decay (95366) steps exceed total training steps (95366). Adjusting decay steps to 95266.
300
+ [titan] 2025-07-23 14:28:51,298 - root - INFO - Checkpointing active. Checkpoints will be loaded from and saved to /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/exp/mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/checkpoint
301
+ [titan] 2025-07-23 14:28:51,315 - root - INFO - CUDA capacity: NVIDIA H20 with 95.00GiB memory
302
+ [titan] 2025-07-23 14:28:51,473 - root - WARNING - Peak flops undefined for: NVIDIA H20, fallback to A100
303
+ [titan] 2025-07-23 14:28:58,659 - root - INFO - ***** Running training *****
304
+ [titan] 2025-07-23 14:28:58,667 - root - INFO -  Training starts at step 1
305
+ [titan] 2025-07-23 14:28:58,668 - root - INFO -  Number of tokens per sequence = 8,192
306
+ [titan] 2025-07-23 14:28:58,668 - root - INFO -  Gradient Accumulation steps = 2
307
+ [titan] 2025-07-23 14:28:58,668 - root - INFO -  Instantaneous batch size (per device) = 8
308
+ [titan] 2025-07-23 14:28:58,668 - root - INFO -  Global batch size (w. parallel, distributed & accumulation) = 128 (1,048,576 tokens)
309
+ [titan] 2025-07-23 14:28:58,668 - root - INFO -  Total optimization steps = 95,366 (99,998,498,816 tokens)
310
+ [titan] 2025-07-23 14:28:58,668 - root - INFO -  Warmup steps = 100 (104,857,600 tokens)
311
+ [titan] 2025-07-23 14:28:58,668 - root - INFO -  Number of parameters = 382,387,712 
312
+ [titan] 2025-07-23 14:28:58,669 - root - INFO - Profiling active. Traces will be saved at /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/exp/mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/profile_trace
313
+ /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/variables/functions.py:1263: UserWarning: Dynamo does not know how to trace the builtin `cuda_utils.get_device_properties.` This function is either a Python builtin (e.g. _warnings.warn) or a third-party C/C++ Python extension (perhaps created with pybind).
314
+ If it is a Python builtin, please file an issue on GitHub so the PyTorch team can add support for it and see the next case for a workaround.
315
+ If it is a third-party C/C++ Python extension, please either wrap it into a PyTorch-understood custom operator (see https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html for more details) or, if it is traceable, use `torch.compiler.allow_in_graph`.
316
+ torch._dynamo.utils.warn_once(explanation + "\n" + "\n".join(hints))
317
+ [rank4]: Traceback (most recent call last):
318
+ [rank4]: File "<frozen runpy>", line 198, in _run_module_as_main
319
+ [rank4]: File "<frozen runpy>", line 88, in _run_code
320
+ [rank4]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/flame/train.py", line 615, in <module>
321
+ [rank4]: main(config)
322
+ [rank4]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
323
+ [rank4]: return f(*args, **kwargs)
324
+ [rank4]: ^^^^^^^^^^^^^^^^^^
325
+ [rank4]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/flame/train.py", line 487, in main
326
+ [rank4]: output = model(
327
+ [rank4]: ^^^^^^
328
+ [rank4]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
329
+ [rank4]: return self._call_impl(*args, **kwargs)
330
+ [rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
331
+ [rank4]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1857, in _call_impl
332
+ [rank4]: return inner()
333
+ [rank4]: ^^^^^^^
334
+ [rank4]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1805, in inner
335
+ [rank4]: result = forward_call(*args, **kwargs)
336
+ [rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
337
+ [rank4]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
338
+ [rank4]: return func(*args, **kwargs)
339
+ [rank4]: ^^^^^^^^^^^^^^^^^^^^^
340
+ [rank4]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/mamba2/modeling_mamba2.py", line 526, in forward
341
+ [rank4]: outputs = self.backbone(
342
+ [rank4]: ^^^^^^^^^^^^^^
343
+ [rank4]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
344
+ [rank4]: return self._call_impl(*args, **kwargs)
345
+ [rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
346
+ [rank4]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
347
+ [rank4]: return forward_call(*args, **kwargs)
348
+ [rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
349
+ [rank4]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/mamba2/modeling_mamba2.py", line 405, in forward
350
+ [rank4]: hidden_states = mixer_block(
351
+ [rank4]: ^^^^^^^^^^^^
352
+ [rank4]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
353
+ [rank4]: return self._call_impl(*args, **kwargs)
354
+ [rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
355
+ [rank4]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1857, in _call_impl
356
+ [rank4]: return inner()
357
+ [rank4]: ^^^^^^^
358
+ [rank4]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1805, in inner
359
+ [rank4]: result = forward_call(*args, **kwargs)
360
+ [rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
361
+ [rank4]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 655, in _fn
362
+ [rank4]: return fn(*args, **kwargs)
363
+ [rank4]: ^^^^^^^^^^^^^^^^^^^
364
+ [rank4]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
365
+ [rank4]: return self._call_impl(*args, **kwargs)
366
+ [rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
367
+ [rank4]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
368
+ [rank4]: return forward_call(*args, **kwargs)
369
+ [rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
370
+ [rank4]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/mamba2/modeling_mamba2.py", line 161, in forward
371
+ [rank4]: hidden_states = self.norm(hidden_states)
372
+ [rank4]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/mamba2/modeling_mamba2.py", line 165, in torch_dynamo_resume_in_forward_at_161
373
+ [rank4]: hidden_states = self.mixer(
374
+ [rank4]: ^^^^^^^^^^^
375
+ [rank4]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
376
+ [rank4]: return self._call_impl(*args, **kwargs)
377
+ [rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
378
+ [rank4]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
379
+ [rank4]: return forward_call(*args, **kwargs)
380
+ [rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
381
+ [rank4]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/layers/mamba2.py", line 601, in forward
382
+ [rank4]: return self.torch_forward(hidden_states, cache_params, cache_position, attention_mask)
383
+ [rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
384
+ [rank4]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/layers/mamba2.py", line 528, in torch_forward
385
+ [rank4]: G_intermediate = C[:, :, :, None, :, :] * B[:, :, None, :, :, :]
386
+ [rank4]: ~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~
387
+ [rank4]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 GiB. GPU 4 has a total capacity of 95.00 GiB of which 85.09 GiB is free. Process 2003903 has 316.00 MiB memory in use. Process 696034 has 316.00 MiB memory in use. Process 1114697 has 9.27 GiB memory in use. Of the allocated memory 7.99 GiB is allocated by PyTorch, and 73.66 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_07su7ijp/attempt_0/4/stdout.log ADDED
File without changes
mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_07su7ijp/attempt_0/5/error.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"message": {"message": "OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 GiB. GPU 5 has a total capacity of 95.00 GiB of which 85.09 GiB is free. Process 2003905 has 316.00 MiB memory in use. Process 696036 has 316.00 MiB memory in use. Process 1114698 has 9.27 GiB memory in use. Of the allocated memory 7.99 GiB is allocated by PyTorch, and 73.66 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)", "extraInfo": {"py_callstack": "Traceback (most recent call last):\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py\", line 355, in wrapper\n return f(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/flame/train.py\", line 487, in main\n output = model(\n ^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1751, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1857, in _call_impl\n return inner()\n ^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1805, in inner\n result = forward_call(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/transformers/utils/deprecation.py\", line 172, in wrapped_func\n return func(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/mamba2/modeling_mamba2.py\", line 526, in forward\n outputs = self.backbone(\n ^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1751, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1762, in _call_impl\n return forward_call(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/mamba2/modeling_mamba2.py\", line 405, in forward\n hidden_states = mixer_block(\n ^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1751, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1857, in _call_impl\n return inner()\n ^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1805, in inner\n result = forward_call(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py\", line 655, in _fn\n return fn(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1751, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1762, in _call_impl\n return forward_call(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/mamba2/modeling_mamba2.py\", line 161, in forward\n hidden_states = self.norm(hidden_states)\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/mamba2/modeling_mamba2.py\", line 165, in torch_dynamo_resume_in_forward_at_161\n hidden_states = self.mixer(\n ^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1751, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1762, in _call_impl\n return forward_call(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/layers/mamba2.py\", line 601, in forward\n return self.torch_forward(hidden_states, cache_params, cache_position, attention_mask)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/layers/mamba2.py\", line 528, in torch_forward\n G_intermediate = C[:, :, :, None, :, :] * B[:, :, None, :, :, :]\n ~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~\ntorch.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 GiB. GPU 5 has a total capacity of 95.00 GiB of which 85.09 GiB is free. Process 2003905 has 316.00 MiB memory in use. Process 696036 has 316.00 MiB memory in use. Process 1114698 has 9.27 GiB memory in use. Of the allocated memory 7.99 GiB is allocated by PyTorch, and 73.66 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)\n", "timestamp": "1753252283"}}}
mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_07su7ijp/attempt_0/5/stderr.log ADDED
@@ -0,0 +1,387 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [titan] 2025-07-23 14:27:46,350 - root - INFO - Starting job: default job
2
+ [titan] 2025-07-23 14:27:46,350 - root - INFO - {
3
+ "activation_checkpoint": {
4
+ "mode": "none",
5
+ "selective_ac_option": "2"
6
+ },
7
+ "activation_offload": {
8
+ "mode": "none"
9
+ },
10
+ "checkpoint": {
11
+ "async_mode": "disabled",
12
+ "create_seed_checkpoint": false,
13
+ "enable_checkpoint": true,
14
+ "exclude_from_loading": [],
15
+ "export_dtype": "float32",
16
+ "folder": "checkpoint",
17
+ "interval": 8192,
18
+ "interval_type": "steps",
19
+ "keep_latest_k": 100,
20
+ "load_step": -1,
21
+ "model_weights_only": false
22
+ },
23
+ "comm": {
24
+ "init_timeout_seconds": 300,
25
+ "trace_buf_size": 20000,
26
+ "train_timeout_seconds": 100
27
+ },
28
+ "experimental": {
29
+ "context_parallel_degree": 1,
30
+ "context_parallel_rotate_method": "allgather",
31
+ "custom_model_path": "",
32
+ "enable_async_tensor_parallel": false,
33
+ "enable_compiled_autograd": false,
34
+ "pipeline_parallel_degree": 1,
35
+ "pipeline_parallel_microbatches": null,
36
+ "pipeline_parallel_schedule": "1F1B",
37
+ "pipeline_parallel_schedule_csv": "",
38
+ "pipeline_parallel_split_points": []
39
+ },
40
+ "fault_tolerance": {
41
+ "enable": false,
42
+ "group_size": 0,
43
+ "min_replica_size": 1,
44
+ "replica_id": 0
45
+ },
46
+ "float8": {
47
+ "enable_fsdp_float8_all_gather": false,
48
+ "force_recompute_fp8_weight_in_bwd": false,
49
+ "precompute_float8_dynamic_scale_for_fsdp": false,
50
+ "recipe_name": null
51
+ },
52
+ "job": {
53
+ "config_file": "flame/models/fla.toml",
54
+ "description": "default job",
55
+ "dump_folder": "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/exp/mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2",
56
+ "print_args": true,
57
+ "use_for_integration_test": false
58
+ },
59
+ "lr_scheduler": {
60
+ "decay_ratio": 1.0,
61
+ "decay_type": "linear",
62
+ "lr_min": 0.01,
63
+ "warmup_steps": 100
64
+ },
65
+ "memory_estimation": {
66
+ "disable_fake_mode": false,
67
+ "enabled": false
68
+ },
69
+ "metrics": {
70
+ "disable_color_printing": false,
71
+ "enable_tensorboard": true,
72
+ "enable_wandb": true,
73
+ "log_freq": 1,
74
+ "save_for_all_ranks": false,
75
+ "save_tb_folder": "tb"
76
+ },
77
+ "model": {
78
+ "config": "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/configs/mamba2_6_1_340M.json",
79
+ "converters": [],
80
+ "name": "fla",
81
+ "print_after_conversion": false,
82
+ "tokenizer_path": "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/tokenizer"
83
+ },
84
+ "optimizer": {
85
+ "early_step_in_backward": false,
86
+ "eps": 1e-08,
87
+ "implementation": "fused",
88
+ "lr": 0.0003,
89
+ "name": "AdamW"
90
+ },
91
+ "profiling": {
92
+ "enable_memory_snapshot": false,
93
+ "enable_profiling": true,
94
+ "profile_freq": 512,
95
+ "save_memory_snapshot_folder": "memory_snapshot",
96
+ "save_traces_folder": "profile_trace"
97
+ },
98
+ "training": {
99
+ "batch_size": 8,
100
+ "compile": true,
101
+ "context_len": 8192,
102
+ "data_dir": null,
103
+ "data_files": null,
104
+ "data_parallel_replicate_degree": 1,
105
+ "data_parallel_shard_degree": -1,
106
+ "data_probs": "0.55,0.3,0.15",
107
+ "dataset": "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample,/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged,/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro",
108
+ "dataset_name": "default,default,default",
109
+ "dataset_split": "train,train,train",
110
+ "deterministic": false,
111
+ "disable_loss_parallel": false,
112
+ "enable_cpu_offload": false,
113
+ "fsdp_reshard_after_forward": "default",
114
+ "gc_freq": 50,
115
+ "gradient_accumulation_steps": 2,
116
+ "max_norm": 1.0,
117
+ "mixed_precision_param": "bfloat16",
118
+ "mixed_precision_reduce": "float32",
119
+ "num_workers": 32,
120
+ "persistent_workers": false,
121
+ "pin_memory": false,
122
+ "prefetch_factor": 2,
123
+ "seed": 42,
124
+ "seq_len": 8192,
125
+ "skip_nan_inf": true,
126
+ "steps": 95366,
127
+ "streaming": true,
128
+ "tensor_parallel_degree": 1,
129
+ "varlen": false
130
+ }
131
+ }
132
+ [titan] 2025-07-23 14:27:46,350 - root - INFO - [GC] Initial GC collection. 0.00 seconds.
133
+ [titan] 2025-07-23 14:27:47,258 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config
134
+ [titan] 2025-07-23 14:27:47,260 - root - INFO - CUDA capacity: NVIDIA H20 with 95.00GiB memory
135
+ [titan] 2025-07-23 14:27:47,324 - root - WARNING - Peak flops undefined for: NVIDIA H20, fallback to A100
136
+ [titan] 2025-07-23 14:27:47,324 - root - INFO - Peak FLOPS used for computing MFU: 3.120e+14
137
+ [titan] 2025-07-23 14:27:47,324 - root - INFO - Building 1-D device mesh with ['dp_shard'], [8]
138
+ [titan] 2025-07-23 14:27:47,411 - root - INFO - Loading tokenizer...
139
+ [titan] 2025-07-23 14:27:47,997 - root - INFO - LlamaTokenizerFast(name_or_path='/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/tokenizer', vocab_size=32000, model_max_length=10000000000, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>'}, clean_up_tokenization_spaces=False, added_tokens_decoder={
140
+ 0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
141
+ 1: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
142
+ 2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
143
+ }
144
+ )
145
+ [titan] 2025-07-23 14:27:47,999 - root - INFO - Loading dataset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample,/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged,/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro:default,default,default
146
+ `trust_remote_code` is not supported anymore.
147
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample' isn't based on a loading script and remove `trust_remote_code`.
148
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
149
+ [titan] 2025-07-23 14:27:47,999 - datasets.load - ERROR - `trust_remote_code` is not supported anymore.
150
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample' isn't based on a loading script and remove `trust_remote_code`.
151
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
152
+ [titan] 2025-07-23 14:27:48,496 - root - INFO - Subset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample:default (p = 0.550):
153
+ IterableDataset({
154
+ features: ['text', 'id', 'dump', 'url', 'file_path', 'language', 'language_score', 'token_count', 'score', 'int_score'],
155
+ num_shards: 140
156
+ })
157
+ [titan] 2025-07-23 14:27:48,496 - root - INFO - Shuffling the dataset with seed 42
158
+ [titan] 2025-07-23 14:27:48,496 - root - WARNING - Dataset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample has insufficient shards (140). Need 256 shards minimum for desired data parallel workers × 32 dataloader workers. Resharding dataset to 256 shards and disabling streaming mode.
159
+ `trust_remote_code` is not supported anymore.
160
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample' isn't based on a loading script and remove `trust_remote_code`.
161
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
162
+ [titan] 2025-07-23 14:27:48,496 - datasets.load - ERROR - `trust_remote_code` is not supported anymore.
163
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample' isn't based on a loading script and remove `trust_remote_code`.
164
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
165
+ `trust_remote_code` is not supported anymore.
166
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged' isn't based on a loading script and remove `trust_remote_code`.
167
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
168
+ [titan] 2025-07-23 14:28:39,812 - datasets.load - ERROR - `trust_remote_code` is not supported anymore.
169
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged' isn't based on a loading script and remove `trust_remote_code`.
170
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
171
+ [titan] 2025-07-23 14:28:39,845 - root - INFO - Subset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged:default (p = 0.300):
172
+ IterableDataset({
173
+ features: ['repo', 'content'],
174
+ num_shards: 1
175
+ })
176
+ [titan] 2025-07-23 14:28:39,845 - root - INFO - Shuffling the dataset with seed 42
177
+ [titan] 2025-07-23 14:28:39,845 - root - WARNING - Dataset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged has insufficient shards (1). Need 256 shards minimum for desired data parallel workers × 32 dataloader workers. Resharding dataset to 256 shards and disabling streaming mode.
178
+ `trust_remote_code` is not supported anymore.
179
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged' isn't based on a loading script and remove `trust_remote_code`.
180
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
181
+ [titan] 2025-07-23 14:28:39,845 - datasets.load - ERROR - `trust_remote_code` is not supported anymore.
182
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged' isn't based on a loading script and remove `trust_remote_code`.
183
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
184
+ `trust_remote_code` is not supported anymore.
185
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro' isn't based on a loading script and remove `trust_remote_code`.
186
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
187
+ [titan] 2025-07-23 14:28:40,105 - datasets.load - ERROR - `trust_remote_code` is not supported anymore.
188
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro' isn't based on a loading script and remove `trust_remote_code`.
189
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
190
+ [titan] 2025-07-23 14:28:40,312 - root - INFO - Subset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro:default (p = 0.150):
191
+ IterableDataset({
192
+ features: ['text', 'cc-path', 'domain', 'lang', 'lang_score', 'timestamp', 'url', 'math_score'],
193
+ num_shards: 100
194
+ })
195
+ [titan] 2025-07-23 14:28:40,312 - root - INFO - Shuffling the dataset with seed 42
196
+ [titan] 2025-07-23 14:28:40,312 - root - WARNING - Dataset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro has insufficient shards (100). Need 256 shards minimum for desired data parallel workers × 32 dataloader workers. Resharding dataset to 256 shards and disabling streaming mode.
197
+ `trust_remote_code` is not supported anymore.
198
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro' isn't based on a loading script and remove `trust_remote_code`.
199
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
200
+ [titan] 2025-07-23 14:28:40,312 - datasets.load - ERROR - `trust_remote_code` is not supported anymore.
201
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro' isn't based on a loading script and remove `trust_remote_code`.
202
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
203
+ [titan] 2025-07-23 14:28:46,706 - root - INFO - Interleaving 3 datasets with probabilities [0.55, 0.3, 0.15]
204
+ [titan] 2025-07-23 14:28:47,415 - root - INFO - IterableDataset({
205
+ features: ['text', 'content'],
206
+ num_shards: 256
207
+ })
208
+ [titan] 2025-07-23 14:28:47,539 - root - INFO - Building dataloader...
209
+ [titan] 2025-07-23 14:28:47,541 - root - INFO - Loading model config from /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/configs/mamba2_6_1_340M.json
210
+ [titan] 2025-07-23 14:28:47,543 - root - INFO - Building model from the config
211
+ Mamba2Config {
212
+ "architectures": [
213
+ "Mamba2ForCausalLM"
214
+ ],
215
+ "attn": {
216
+ "layers": [
217
+ 5,
218
+ 11,
219
+ 17,
220
+ 23
221
+ ],
222
+ "num_heads": 16,
223
+ "num_kv_heads": 8,
224
+ "qkv_bias": false,
225
+ "rope_theta": 160000.0,
226
+ "window_size": null
227
+ },
228
+ "attn_mode": "chunk",
229
+ "bos_token_id": 1,
230
+ "chunk_size": 256,
231
+ "conv_kernel": 4,
232
+ "eos_token_id": 2,
233
+ "expand": 2,
234
+ "fuse_cross_entropy": true,
235
+ "fuse_norm": true,
236
+ "fuse_swiglu": true,
237
+ "head_dim": 64,
238
+ "hidden_act": "silu",
239
+ "hidden_size": 1024,
240
+ "initializer_range": 0.02,
241
+ "model_type": "mamba2",
242
+ "n_groups": 1,
243
+ "norm_eps": 1e-05,
244
+ "num_heads": 32,
245
+ "num_hidden_layers": 48,
246
+ "pad_token_id": 0,
247
+ "rescale_prenorm_residual": true,
248
+ "residual_in_fp32": true,
249
+ "rms_norm": true,
250
+ "state_size": 128,
251
+ "tie_word_embeddings": false,
252
+ "time_step_floor": 0.0001,
253
+ "time_step_limit": [
254
+ 0.0,
255
+ Infinity
256
+ ],
257
+ "time_step_max": 0.1,
258
+ "time_step_min": 0.001,
259
+ "time_step_rank": 128,
260
+ "transformers_version": "4.53.3",
261
+ "use_bias": false,
262
+ "use_cache": true,
263
+ "use_conv_bias": true,
264
+ "use_l2warp": false,
265
+ "vocab_size": 32000
266
+ }
267
+ 
268
+ [titan] 2025-07-23 14:28:50,147 - fla.layers.mamba2 - WARNING - The fast path is not available because one of `(selective_state_update)` is None. Falling back to the naive implementation. To install follow https://github.com/state-spaces/mamba/#installation
269
+ [titan] 2025-07-23 14:28:50,147 - fla.layers.mamba2 - WARNING - The CUDA backend is not available because `causal_conv1d` is None. Falling back to the Triton backend. To install follow https://github.com/Dao-AILab/causal-conv1d
270
+ [titan] 2025-07-23 14:28:50,264 - root - INFO - 
271
+ Mamba2ForCausalLM(
272
+ (backbone): Mamba2Model(
273
+ (embeddings): Embedding(32000, 1024)
274
+ (layers): ModuleList(
275
+ (0-47): 48 x Mamba2Block(
276
+ (norm): RMSNorm(1024, eps=1e-05)
277
+ (mixer): Mamba2(
278
+ (conv1d): Conv1d(2304, 2304, kernel_size=(4,), stride=(1,), padding=(3,), groups=2304)
279
+ (in_proj): Linear(in_features=1024, out_features=4384, bias=False)
280
+ (norm): RMSNormGated()
281
+ (out_proj): Linear(in_features=2048, out_features=1024, bias=False)
282
+ )
283
+ )
284
+ )
285
+ (norm_f): RMSNorm(1024, eps=1e-05)
286
+ )
287
+ (lm_head): Linear(in_features=1024, out_features=32000, bias=False)
288
+ (criterion): FusedLinearCrossEntropyLoss()
289
+ )
290
+
291
+ [titan] 2025-07-23 14:28:50,315 - root - INFO - Compiling each block with torch.compile
292
+ [titan] 2025-07-23 14:28:50,316 - root - INFO - Compiling the embedding, norm, and lm_head layers with torch.compile
293
+ [titan] 2025-07-23 14:28:50,316 - root - WARNING - No norm found in model
294
+ [titan] 2025-07-23 14:28:50,316 - root - INFO - Compiling the entire model with torch.compile
295
+ [titan] 2025-07-23 14:28:50,539 - root - INFO - Applied FSDP to the model
296
+ [titan] 2025-07-23 14:28:50,884 - fla.models.mamba2.modeling_mamba2 - WARNING - `A_log` is a DTensor, skipping initialization
297
+ [titan] 2025-07-23 14:28:51,042 - fla.models.mamba2.modeling_mamba2 - WARNING - `dt_bias` is a DTensor, skipping initialization
298
+ [titan] 2025-07-23 14:28:51,273 - root - INFO - CUDA memory usage for model: 0.19GiB(0.20%)
299
+ [titan] 2025-07-23 14:28:51,275 - root - WARNING - Warmup (100) + decay (95366) steps exceed total training steps (95366). Adjusting decay steps to 95266.
300
+ [titan] 2025-07-23 14:28:51,299 - root - INFO - Checkpointing active. Checkpoints will be loaded from and saved to /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/exp/mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/checkpoint
301
+ [titan] 2025-07-23 14:28:51,333 - root - INFO - CUDA capacity: NVIDIA H20 with 95.00GiB memory
302
+ [titan] 2025-07-23 14:28:51,479 - root - WARNING - Peak flops undefined for: NVIDIA H20, fallback to A100
303
+ [titan] 2025-07-23 14:28:59,100 - root - INFO - ***** Running training *****
304
+ [titan] 2025-07-23 14:28:59,101 - root - INFO -  Training starts at step 1
305
+ [titan] 2025-07-23 14:28:59,153 - root - INFO -  Number of tokens per sequence = 8,192
306
+ [titan] 2025-07-23 14:28:59,170 - root - INFO -  Gradient Accumulation steps = 2
307
+ [titan] 2025-07-23 14:28:59,171 - root - INFO -  Instantaneous batch size (per device) = 8
308
+ [titan] 2025-07-23 14:28:59,171 - root - INFO -  Global batch size (w. parallel, distributed & accumulation) = 128 (1,048,576 tokens)
309
+ [titan] 2025-07-23 14:28:59,172 - root - INFO -  Total optimization steps = 95,366 (99,998,498,816 tokens)
310
+ [titan] 2025-07-23 14:28:59,172 - root - INFO -  Warmup steps = 100 (104,857,600 tokens)
311
+ [titan] 2025-07-23 14:28:59,172 - root - INFO -  Number of parameters = 382,387,712 
312
+ [titan] 2025-07-23 14:28:59,173 - root - INFO - Profiling active. Traces will be saved at /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/exp/mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/profile_trace
313
+ /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/variables/functions.py:1263: UserWarning: Dynamo does not know how to trace the builtin `cuda_utils.get_device_properties.` This function is either a Python builtin (e.g. _warnings.warn) or a third-party C/C++ Python extension (perhaps created with pybind).
314
+ If it is a Python builtin, please file an issue on GitHub so the PyTorch team can add support for it and see the next case for a workaround.
315
+ If it is a third-party C/C++ Python extension, please either wrap it into a PyTorch-understood custom operator (see https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html for more details) or, if it is traceable, use `torch.compiler.allow_in_graph`.
316
+ torch._dynamo.utils.warn_once(explanation + "\n" + "\n".join(hints))
317
+ [rank5]: Traceback (most recent call last):
318
+ [rank5]: File "<frozen runpy>", line 198, in _run_module_as_main
319
+ [rank5]: File "<frozen runpy>", line 88, in _run_code
320
+ [rank5]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/flame/train.py", line 615, in <module>
321
+ [rank5]: main(config)
322
+ [rank5]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
323
+ [rank5]: return f(*args, **kwargs)
324
+ [rank5]: ^^^^^^^^^^^^^^^^^^
325
+ [rank5]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/flame/train.py", line 487, in main
326
+ [rank5]: output = model(
327
+ [rank5]: ^^^^^^
328
+ [rank5]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
329
+ [rank5]: return self._call_impl(*args, **kwargs)
330
+ [rank5]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
331
+ [rank5]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1857, in _call_impl
332
+ [rank5]: return inner()
333
+ [rank5]: ^^^^^^^
334
+ [rank5]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1805, in inner
335
+ [rank5]: result = forward_call(*args, **kwargs)
336
+ [rank5]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
337
+ [rank5]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
338
+ [rank5]: return func(*args, **kwargs)
339
+ [rank5]: ^^^^^^^^^^^^^^^^^^^^^
340
+ [rank5]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/mamba2/modeling_mamba2.py", line 526, in forward
341
+ [rank5]: outputs = self.backbone(
342
+ [rank5]: ^^^^^^^^^^^^^^
343
+ [rank5]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
344
+ [rank5]: return self._call_impl(*args, **kwargs)
345
+ [rank5]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
346
+ [rank5]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
347
+ [rank5]: return forward_call(*args, **kwargs)
348
+ [rank5]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
349
+ [rank5]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/mamba2/modeling_mamba2.py", line 405, in forward
350
+ [rank5]: hidden_states = mixer_block(
351
+ [rank5]: ^^^^^^^^^^^^
352
+ [rank5]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
353
+ [rank5]: return self._call_impl(*args, **kwargs)
354
+ [rank5]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
355
+ [rank5]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1857, in _call_impl
356
+ [rank5]: return inner()
357
+ [rank5]: ^^^^^^^
358
+ [rank5]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1805, in inner
359
+ [rank5]: result = forward_call(*args, **kwargs)
360
+ [rank5]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
361
+ [rank5]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 655, in _fn
362
+ [rank5]: return fn(*args, **kwargs)
363
+ [rank5]: ^^^^^^^^^^^^^^^^^^^
364
+ [rank5]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
365
+ [rank5]: return self._call_impl(*args, **kwargs)
366
+ [rank5]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
367
+ [rank5]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
368
+ [rank5]: return forward_call(*args, **kwargs)
369
+ [rank5]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
370
+ [rank5]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/mamba2/modeling_mamba2.py", line 161, in forward
371
+ [rank5]: hidden_states = self.norm(hidden_states)
372
+ [rank5]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/mamba2/modeling_mamba2.py", line 165, in torch_dynamo_resume_in_forward_at_161
373
+ [rank5]: hidden_states = self.mixer(
374
+ [rank5]: ^^^^^^^^^^^
375
+ [rank5]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
376
+ [rank5]: return self._call_impl(*args, **kwargs)
377
+ [rank5]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
378
+ [rank5]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
379
+ [rank5]: return forward_call(*args, **kwargs)
380
+ [rank5]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
381
+ [rank5]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/layers/mamba2.py", line 601, in forward
382
+ [rank5]: return self.torch_forward(hidden_states, cache_params, cache_position, attention_mask)
383
+ [rank5]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
384
+ [rank5]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/layers/mamba2.py", line 528, in torch_forward
385
+ [rank5]: G_intermediate = C[:, :, :, None, :, :] * B[:, :, None, :, :, :]
386
+ [rank5]: ~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~
387
+ [rank5]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 GiB. GPU 5 has a total capacity of 95.00 GiB of which 85.09 GiB is free. Process 2003905 has 316.00 MiB memory in use. Process 696036 has 316.00 MiB memory in use. Process 1114698 has 9.27 GiB memory in use. Of the allocated memory 7.99 GiB is allocated by PyTorch, and 73.66 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_07su7ijp/attempt_0/5/stdout.log ADDED
File without changes
mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_07su7ijp/attempt_0/6/error.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"message": {"message": "OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 GiB. GPU 6 has a total capacity of 95.00 GiB of which 85.09 GiB is free. Process 2003904 has 316.00 MiB memory in use. Process 696035 has 316.00 MiB memory in use. Process 1114699 has 9.27 GiB memory in use. Of the allocated memory 7.99 GiB is allocated by PyTorch, and 73.66 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)", "extraInfo": {"py_callstack": "Traceback (most recent call last):\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py\", line 355, in wrapper\n return f(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/flame/train.py\", line 487, in main\n output = model(\n ^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1751, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1857, in _call_impl\n return inner()\n ^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1805, in inner\n result = forward_call(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/transformers/utils/deprecation.py\", line 172, in wrapped_func\n return func(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/mamba2/modeling_mamba2.py\", line 526, in forward\n outputs = self.backbone(\n ^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1751, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1762, in _call_impl\n return forward_call(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/mamba2/modeling_mamba2.py\", line 405, in forward\n hidden_states = mixer_block(\n ^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1751, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1857, in _call_impl\n return inner()\n ^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1805, in inner\n result = forward_call(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py\", line 655, in _fn\n return fn(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1751, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1762, in _call_impl\n return forward_call(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/mamba2/modeling_mamba2.py\", line 161, in forward\n hidden_states = self.norm(hidden_states)\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/mamba2/modeling_mamba2.py\", line 165, in torch_dynamo_resume_in_forward_at_161\n hidden_states = self.mixer(\n ^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1751, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1762, in _call_impl\n return forward_call(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/layers/mamba2.py\", line 601, in forward\n return self.torch_forward(hidden_states, cache_params, cache_position, attention_mask)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/layers/mamba2.py\", line 528, in torch_forward\n G_intermediate = C[:, :, :, None, :, :] * B[:, :, None, :, :, :]\n ~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~\ntorch.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 GiB. GPU 6 has a total capacity of 95.00 GiB of which 85.09 GiB is free. Process 2003904 has 316.00 MiB memory in use. Process 696035 has 316.00 MiB memory in use. Process 1114699 has 9.27 GiB memory in use. Of the allocated memory 7.99 GiB is allocated by PyTorch, and 73.66 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)\n", "timestamp": "1753252283"}}}
mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_07su7ijp/attempt_0/6/stderr.log ADDED
@@ -0,0 +1,387 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [titan] 2025-07-23 14:27:46,733 - root - INFO - Starting job: default job
2
+ [titan] 2025-07-23 14:27:46,734 - root - INFO - {
3
+ "activation_checkpoint": {
4
+ "mode": "none",
5
+ "selective_ac_option": "2"
6
+ },
7
+ "activation_offload": {
8
+ "mode": "none"
9
+ },
10
+ "checkpoint": {
11
+ "async_mode": "disabled",
12
+ "create_seed_checkpoint": false,
13
+ "enable_checkpoint": true,
14
+ "exclude_from_loading": [],
15
+ "export_dtype": "float32",
16
+ "folder": "checkpoint",
17
+ "interval": 8192,
18
+ "interval_type": "steps",
19
+ "keep_latest_k": 100,
20
+ "load_step": -1,
21
+ "model_weights_only": false
22
+ },
23
+ "comm": {
24
+ "init_timeout_seconds": 300,
25
+ "trace_buf_size": 20000,
26
+ "train_timeout_seconds": 100
27
+ },
28
+ "experimental": {
29
+ "context_parallel_degree": 1,
30
+ "context_parallel_rotate_method": "allgather",
31
+ "custom_model_path": "",
32
+ "enable_async_tensor_parallel": false,
33
+ "enable_compiled_autograd": false,
34
+ "pipeline_parallel_degree": 1,
35
+ "pipeline_parallel_microbatches": null,
36
+ "pipeline_parallel_schedule": "1F1B",
37
+ "pipeline_parallel_schedule_csv": "",
38
+ "pipeline_parallel_split_points": []
39
+ },
40
+ "fault_tolerance": {
41
+ "enable": false,
42
+ "group_size": 0,
43
+ "min_replica_size": 1,
44
+ "replica_id": 0
45
+ },
46
+ "float8": {
47
+ "enable_fsdp_float8_all_gather": false,
48
+ "force_recompute_fp8_weight_in_bwd": false,
49
+ "precompute_float8_dynamic_scale_for_fsdp": false,
50
+ "recipe_name": null
51
+ },
52
+ "job": {
53
+ "config_file": "flame/models/fla.toml",
54
+ "description": "default job",
55
+ "dump_folder": "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/exp/mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2",
56
+ "print_args": true,
57
+ "use_for_integration_test": false
58
+ },
59
+ "lr_scheduler": {
60
+ "decay_ratio": 1.0,
61
+ "decay_type": "linear",
62
+ "lr_min": 0.01,
63
+ "warmup_steps": 100
64
+ },
65
+ "memory_estimation": {
66
+ "disable_fake_mode": false,
67
+ "enabled": false
68
+ },
69
+ "metrics": {
70
+ "disable_color_printing": false,
71
+ "enable_tensorboard": true,
72
+ "enable_wandb": true,
73
+ "log_freq": 1,
74
+ "save_for_all_ranks": false,
75
+ "save_tb_folder": "tb"
76
+ },
77
+ "model": {
78
+ "config": "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/configs/mamba2_6_1_340M.json",
79
+ "converters": [],
80
+ "name": "fla",
81
+ "print_after_conversion": false,
82
+ "tokenizer_path": "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/tokenizer"
83
+ },
84
+ "optimizer": {
85
+ "early_step_in_backward": false,
86
+ "eps": 1e-08,
87
+ "implementation": "fused",
88
+ "lr": 0.0003,
89
+ "name": "AdamW"
90
+ },
91
+ "profiling": {
92
+ "enable_memory_snapshot": false,
93
+ "enable_profiling": true,
94
+ "profile_freq": 512,
95
+ "save_memory_snapshot_folder": "memory_snapshot",
96
+ "save_traces_folder": "profile_trace"
97
+ },
98
+ "training": {
99
+ "batch_size": 8,
100
+ "compile": true,
101
+ "context_len": 8192,
102
+ "data_dir": null,
103
+ "data_files": null,
104
+ "data_parallel_replicate_degree": 1,
105
+ "data_parallel_shard_degree": -1,
106
+ "data_probs": "0.55,0.3,0.15",
107
+ "dataset": "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample,/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged,/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro",
108
+ "dataset_name": "default,default,default",
109
+ "dataset_split": "train,train,train",
110
+ "deterministic": false,
111
+ "disable_loss_parallel": false,
112
+ "enable_cpu_offload": false,
113
+ "fsdp_reshard_after_forward": "default",
114
+ "gc_freq": 50,
115
+ "gradient_accumulation_steps": 2,
116
+ "max_norm": 1.0,
117
+ "mixed_precision_param": "bfloat16",
118
+ "mixed_precision_reduce": "float32",
119
+ "num_workers": 32,
120
+ "persistent_workers": false,
121
+ "pin_memory": false,
122
+ "prefetch_factor": 2,
123
+ "seed": 42,
124
+ "seq_len": 8192,
125
+ "skip_nan_inf": true,
126
+ "steps": 95366,
127
+ "streaming": true,
128
+ "tensor_parallel_degree": 1,
129
+ "varlen": false
130
+ }
131
+ }
132
+ [titan] 2025-07-23 14:27:46,734 - root - INFO - [GC] Initial GC collection. 0.00 seconds.
133
+ [titan] 2025-07-23 14:27:47,466 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config
134
+ [titan] 2025-07-23 14:27:47,468 - root - INFO - CUDA capacity: NVIDIA H20 with 95.00GiB memory
135
+ [titan] 2025-07-23 14:27:47,519 - root - WARNING - Peak flops undefined for: NVIDIA H20, fallback to A100
136
+ [titan] 2025-07-23 14:27:47,519 - root - INFO - Peak FLOPS used for computing MFU: 3.120e+14
137
+ [titan] 2025-07-23 14:27:47,519 - root - INFO - Building 1-D device mesh with ['dp_shard'], [8]
138
+ [titan] 2025-07-23 14:27:47,528 - root - INFO - Loading tokenizer...
139
+ [titan] 2025-07-23 14:27:47,997 - root - INFO - LlamaTokenizerFast(name_or_path='/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/tokenizer', vocab_size=32000, model_max_length=10000000000, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>'}, clean_up_tokenization_spaces=False, added_tokens_decoder={
140
+ 0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
141
+ 1: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
142
+ 2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
143
+ }
144
+ )
145
+ [titan] 2025-07-23 14:27:47,998 - root - INFO - Loading dataset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample,/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged,/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro:default,default,default
146
+ `trust_remote_code` is not supported anymore.
147
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample' isn't based on a loading script and remove `trust_remote_code`.
148
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
149
+ [titan] 2025-07-23 14:27:47,998 - datasets.load - ERROR - `trust_remote_code` is not supported anymore.
150
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample' isn't based on a loading script and remove `trust_remote_code`.
151
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
152
+ [titan] 2025-07-23 14:27:48,495 - root - INFO - Subset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample:default (p = 0.550):
153
+ IterableDataset({
154
+ features: ['text', 'id', 'dump', 'url', 'file_path', 'language', 'language_score', 'token_count', 'score', 'int_score'],
155
+ num_shards: 140
156
+ })
157
+ [titan] 2025-07-23 14:27:48,495 - root - INFO - Shuffling the dataset with seed 42
158
+ [titan] 2025-07-23 14:27:48,495 - root - WARNING - Dataset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample has insufficient shards (140). Need 256 shards minimum for desired data parallel workers × 32 dataloader workers. Resharding dataset to 256 shards and disabling streaming mode.
159
+ `trust_remote_code` is not supported anymore.
160
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample' isn't based on a loading script and remove `trust_remote_code`.
161
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
162
+ [titan] 2025-07-23 14:27:48,495 - datasets.load - ERROR - `trust_remote_code` is not supported anymore.
163
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample' isn't based on a loading script and remove `trust_remote_code`.
164
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
165
+ `trust_remote_code` is not supported anymore.
166
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged' isn't based on a loading script and remove `trust_remote_code`.
167
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
168
+ [titan] 2025-07-23 14:28:39,968 - datasets.load - ERROR - `trust_remote_code` is not supported anymore.
169
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged' isn't based on a loading script and remove `trust_remote_code`.
170
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
171
+ [titan] 2025-07-23 14:28:40,002 - root - INFO - Subset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged:default (p = 0.300):
172
+ IterableDataset({
173
+ features: ['repo', 'content'],
174
+ num_shards: 1
175
+ })
176
+ [titan] 2025-07-23 14:28:40,002 - root - INFO - Shuffling the dataset with seed 42
177
+ [titan] 2025-07-23 14:28:40,002 - root - WARNING - Dataset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged has insufficient shards (1). Need 256 shards minimum for desired data parallel workers × 32 dataloader workers. Resharding dataset to 256 shards and disabling streaming mode.
178
+ `trust_remote_code` is not supported anymore.
179
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged' isn't based on a loading script and remove `trust_remote_code`.
180
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
181
+ [titan] 2025-07-23 14:28:40,002 - datasets.load - ERROR - `trust_remote_code` is not supported anymore.
182
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged' isn't based on a loading script and remove `trust_remote_code`.
183
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
184
+ `trust_remote_code` is not supported anymore.
185
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro' isn't based on a loading script and remove `trust_remote_code`.
186
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
187
+ [titan] 2025-07-23 14:28:40,261 - datasets.load - ERROR - `trust_remote_code` is not supported anymore.
188
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro' isn't based on a loading script and remove `trust_remote_code`.
189
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
190
+ [titan] 2025-07-23 14:28:40,356 - root - INFO - Subset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro:default (p = 0.150):
191
+ IterableDataset({
192
+ features: ['text', 'cc-path', 'domain', 'lang', 'lang_score', 'timestamp', 'url', 'math_score'],
193
+ num_shards: 100
194
+ })
195
+ [titan] 2025-07-23 14:28:40,357 - root - INFO - Shuffling the dataset with seed 42
196
+ [titan] 2025-07-23 14:28:40,357 - root - WARNING - Dataset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro has insufficient shards (100). Need 256 shards minimum for desired data parallel workers × 32 dataloader workers. Resharding dataset to 256 shards and disabling streaming mode.
197
+ `trust_remote_code` is not supported anymore.
198
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro' isn't based on a loading script and remove `trust_remote_code`.
199
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
200
+ [titan] 2025-07-23 14:28:40,357 - datasets.load - ERROR - `trust_remote_code` is not supported anymore.
201
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro' isn't based on a loading script and remove `trust_remote_code`.
202
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
203
+ [titan] 2025-07-23 14:28:46,633 - root - INFO - Interleaving 3 datasets with probabilities [0.55, 0.3, 0.15]
204
+ [titan] 2025-07-23 14:28:47,337 - root - INFO - IterableDataset({
205
+ features: ['text', 'content'],
206
+ num_shards: 256
207
+ })
208
+ [titan] 2025-07-23 14:28:47,461 - root - INFO - Building dataloader...
209
+ [titan] 2025-07-23 14:28:47,463 - root - INFO - Loading model config from /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/configs/mamba2_6_1_340M.json
210
+ [titan] 2025-07-23 14:28:47,465 - root - INFO - Building model from the config
211
+ Mamba2Config {
212
+ "architectures": [
213
+ "Mamba2ForCausalLM"
214
+ ],
215
+ "attn": {
216
+ "layers": [
217
+ 5,
218
+ 11,
219
+ 17,
220
+ 23
221
+ ],
222
+ "num_heads": 16,
223
+ "num_kv_heads": 8,
224
+ "qkv_bias": false,
225
+ "rope_theta": 160000.0,
226
+ "window_size": null
227
+ },
228
+ "attn_mode": "chunk",
229
+ "bos_token_id": 1,
230
+ "chunk_size": 256,
231
+ "conv_kernel": 4,
232
+ "eos_token_id": 2,
233
+ "expand": 2,
234
+ "fuse_cross_entropy": true,
235
+ "fuse_norm": true,
236
+ "fuse_swiglu": true,
237
+ "head_dim": 64,
238
+ "hidden_act": "silu",
239
+ "hidden_size": 1024,
240
+ "initializer_range": 0.02,
241
+ "model_type": "mamba2",
242
+ "n_groups": 1,
243
+ "norm_eps": 1e-05,
244
+ "num_heads": 32,
245
+ "num_hidden_layers": 48,
246
+ "pad_token_id": 0,
247
+ "rescale_prenorm_residual": true,
248
+ "residual_in_fp32": true,
249
+ "rms_norm": true,
250
+ "state_size": 128,
251
+ "tie_word_embeddings": false,
252
+ "time_step_floor": 0.0001,
253
+ "time_step_limit": [
254
+ 0.0,
255
+ Infinity
256
+ ],
257
+ "time_step_max": 0.1,
258
+ "time_step_min": 0.001,
259
+ "time_step_rank": 128,
260
+ "transformers_version": "4.53.3",
261
+ "use_bias": false,
262
+ "use_cache": true,
263
+ "use_conv_bias": true,
264
+ "use_l2warp": false,
265
+ "vocab_size": 32000
266
+ }
267
+ 
268
+ [titan] 2025-07-23 14:28:50,147 - fla.layers.mamba2 - WARNING - The fast path is not available because one of `(selective_state_update)` is None. Falling back to the naive implementation. To install follow https://github.com/state-spaces/mamba/#installation
269
+ [titan] 2025-07-23 14:28:50,147 - fla.layers.mamba2 - WARNING - The CUDA backend is not available because `causal_conv1d` is None. Falling back to the Triton backend. To install follow https://github.com/Dao-AILab/causal-conv1d
270
+ [titan] 2025-07-23 14:28:50,263 - root - INFO - 
271
+ Mamba2ForCausalLM(
272
+ (backbone): Mamba2Model(
273
+ (embeddings): Embedding(32000, 1024)
274
+ (layers): ModuleList(
275
+ (0-47): 48 x Mamba2Block(
276
+ (norm): RMSNorm(1024, eps=1e-05)
277
+ (mixer): Mamba2(
278
+ (conv1d): Conv1d(2304, 2304, kernel_size=(4,), stride=(1,), padding=(3,), groups=2304)
279
+ (in_proj): Linear(in_features=1024, out_features=4384, bias=False)
280
+ (norm): RMSNormGated()
281
+ (out_proj): Linear(in_features=2048, out_features=1024, bias=False)
282
+ )
283
+ )
284
+ )
285
+ (norm_f): RMSNorm(1024, eps=1e-05)
286
+ )
287
+ (lm_head): Linear(in_features=1024, out_features=32000, bias=False)
288
+ (criterion): FusedLinearCrossEntropyLoss()
289
+ )
290
+
291
+ [titan] 2025-07-23 14:28:50,315 - root - INFO - Compiling each block with torch.compile
292
+ [titan] 2025-07-23 14:28:50,315 - root - INFO - Compiling the embedding, norm, and lm_head layers with torch.compile
293
+ [titan] 2025-07-23 14:28:50,316 - root - WARNING - No norm found in model
294
+ [titan] 2025-07-23 14:28:50,316 - root - INFO - Compiling the entire model with torch.compile
295
+ [titan] 2025-07-23 14:28:50,542 - root - INFO - Applied FSDP to the model
296
+ [titan] 2025-07-23 14:28:50,885 - fla.models.mamba2.modeling_mamba2 - WARNING - `A_log` is a DTensor, skipping initialization
297
+ [titan] 2025-07-23 14:28:51,042 - fla.models.mamba2.modeling_mamba2 - WARNING - `dt_bias` is a DTensor, skipping initialization
298
+ [titan] 2025-07-23 14:28:51,273 - root - INFO - CUDA memory usage for model: 0.19GiB(0.20%)
299
+ [titan] 2025-07-23 14:28:51,274 - root - WARNING - Warmup (100) + decay (95366) steps exceed total training steps (95366). Adjusting decay steps to 95266.
300
+ [titan] 2025-07-23 14:28:51,299 - root - INFO - Checkpointing active. Checkpoints will be loaded from and saved to /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/exp/mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/checkpoint
301
+ [titan] 2025-07-23 14:28:51,320 - root - INFO - CUDA capacity: NVIDIA H20 with 95.00GiB memory
302
+ [titan] 2025-07-23 14:28:51,477 - root - WARNING - Peak flops undefined for: NVIDIA H20, fallback to A100
303
+ [titan] 2025-07-23 14:28:58,659 - root - INFO - ***** Running training *****
304
+ [titan] 2025-07-23 14:28:58,665 - root - INFO -  Training starts at step 1
305
+ [titan] 2025-07-23 14:28:58,665 - root - INFO -  Number of tokens per sequence = 8,192
306
+ [titan] 2025-07-23 14:28:58,667 - root - INFO -  Gradient Accumulation steps = 2
307
+ [titan] 2025-07-23 14:28:58,669 - root - INFO -  Instantaneous batch size (per device) = 8
308
+ [titan] 2025-07-23 14:28:58,670 - root - INFO -  Global batch size (w. parallel, distributed & accumulation) = 128 (1,048,576 tokens)
309
+ [titan] 2025-07-23 14:28:58,670 - root - INFO -  Total optimization steps = 95,366 (99,998,498,816 tokens)
310
+ [titan] 2025-07-23 14:28:58,670 - root - INFO -  Warmup steps = 100 (104,857,600 tokens)
311
+ [titan] 2025-07-23 14:28:58,670 - root - INFO -  Number of parameters = 382,387,712 
312
+ [titan] 2025-07-23 14:28:58,670 - root - INFO - Profiling active. Traces will be saved at /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/exp/mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/profile_trace
313
+ /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/variables/functions.py:1263: UserWarning: Dynamo does not know how to trace the builtin `cuda_utils.get_device_properties.` This function is either a Python builtin (e.g. _warnings.warn) or a third-party C/C++ Python extension (perhaps created with pybind).
314
+ If it is a Python builtin, please file an issue on GitHub so the PyTorch team can add support for it and see the next case for a workaround.
315
+ If it is a third-party C/C++ Python extension, please either wrap it into a PyTorch-understood custom operator (see https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html for more details) or, if it is traceable, use `torch.compiler.allow_in_graph`.
316
+ torch._dynamo.utils.warn_once(explanation + "\n" + "\n".join(hints))
317
+ [rank6]: Traceback (most recent call last):
318
+ [rank6]: File "<frozen runpy>", line 198, in _run_module_as_main
319
+ [rank6]: File "<frozen runpy>", line 88, in _run_code
320
+ [rank6]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/flame/train.py", line 615, in <module>
321
+ [rank6]: main(config)
322
+ [rank6]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
323
+ [rank6]: return f(*args, **kwargs)
324
+ [rank6]: ^^^^^^^^^^^^^^^^^^
325
+ [rank6]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/flame/train.py", line 487, in main
326
+ [rank6]: output = model(
327
+ [rank6]: ^^^^^^
328
+ [rank6]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
329
+ [rank6]: return self._call_impl(*args, **kwargs)
330
+ [rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
331
+ [rank6]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1857, in _call_impl
332
+ [rank6]: return inner()
333
+ [rank6]: ^^^^^^^
334
+ [rank6]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1805, in inner
335
+ [rank6]: result = forward_call(*args, **kwargs)
336
+ [rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
337
+ [rank6]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
338
+ [rank6]: return func(*args, **kwargs)
339
+ [rank6]: ^^^^^^^^^^^^^^^^^^^^^
340
+ [rank6]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/mamba2/modeling_mamba2.py", line 526, in forward
341
+ [rank6]: outputs = self.backbone(
342
+ [rank6]: ^^^^^^^^^^^^^^
343
+ [rank6]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
344
+ [rank6]: return self._call_impl(*args, **kwargs)
345
+ [rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
346
+ [rank6]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
347
+ [rank6]: return forward_call(*args, **kwargs)
348
+ [rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
349
+ [rank6]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/mamba2/modeling_mamba2.py", line 405, in forward
350
+ [rank6]: hidden_states = mixer_block(
351
+ [rank6]: ^^^^^^^^^^^^
352
+ [rank6]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
353
+ [rank6]: return self._call_impl(*args, **kwargs)
354
+ [rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
355
+ [rank6]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1857, in _call_impl
356
+ [rank6]: return inner()
357
+ [rank6]: ^^^^^^^
358
+ [rank6]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1805, in inner
359
+ [rank6]: result = forward_call(*args, **kwargs)
360
+ [rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
361
+ [rank6]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 655, in _fn
362
+ [rank6]: return fn(*args, **kwargs)
363
+ [rank6]: ^^^^^^^^^^^^^^^^^^^
364
+ [rank6]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
365
+ [rank6]: return self._call_impl(*args, **kwargs)
366
+ [rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
367
+ [rank6]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
368
+ [rank6]: return forward_call(*args, **kwargs)
369
+ [rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
370
+ [rank6]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/mamba2/modeling_mamba2.py", line 161, in forward
371
+ [rank6]: hidden_states = self.norm(hidden_states)
372
+ [rank6]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/mamba2/modeling_mamba2.py", line 165, in torch_dynamo_resume_in_forward_at_161
373
+ [rank6]: hidden_states = self.mixer(
374
+ [rank6]: ^^^^^^^^^^^
375
+ [rank6]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
376
+ [rank6]: return self._call_impl(*args, **kwargs)
377
+ [rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
378
+ [rank6]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
379
+ [rank6]: return forward_call(*args, **kwargs)
380
+ [rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
381
+ [rank6]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/layers/mamba2.py", line 601, in forward
382
+ [rank6]: return self.torch_forward(hidden_states, cache_params, cache_position, attention_mask)
383
+ [rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
384
+ [rank6]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/layers/mamba2.py", line 528, in torch_forward
385
+ [rank6]: G_intermediate = C[:, :, :, None, :, :] * B[:, :, None, :, :, :]
386
+ [rank6]: ~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~
387
+ [rank6]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 GiB. GPU 6 has a total capacity of 95.00 GiB of which 85.09 GiB is free. Process 2003904 has 316.00 MiB memory in use. Process 696035 has 316.00 MiB memory in use. Process 1114699 has 9.27 GiB memory in use. Of the allocated memory 7.99 GiB is allocated by PyTorch, and 73.66 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_07su7ijp/attempt_0/6/stdout.log ADDED
File without changes
mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_07su7ijp/attempt_0/7/error.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"message": {"message": "OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 GiB. GPU 7 has a total capacity of 95.00 GiB of which 85.09 GiB is free. Process 2003906 has 316.00 MiB memory in use. Process 696037 has 316.00 MiB memory in use. Process 1114702 has 9.27 GiB memory in use. Of the allocated memory 7.99 GiB is allocated by PyTorch, and 73.66 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)", "extraInfo": {"py_callstack": "Traceback (most recent call last):\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py\", line 355, in wrapper\n return f(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/flame/train.py\", line 487, in main\n output = model(\n ^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1751, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1857, in _call_impl\n return inner()\n ^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1805, in inner\n result = forward_call(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/transformers/utils/deprecation.py\", line 172, in wrapped_func\n return func(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/mamba2/modeling_mamba2.py\", line 526, in forward\n outputs = self.backbone(\n ^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1751, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1762, in _call_impl\n return forward_call(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/mamba2/modeling_mamba2.py\", line 405, in forward\n hidden_states = mixer_block(\n ^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1751, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1857, in _call_impl\n return inner()\n ^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1805, in inner\n result = forward_call(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py\", line 655, in _fn\n return fn(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1751, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1762, in _call_impl\n return forward_call(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/mamba2/modeling_mamba2.py\", line 161, in forward\n hidden_states = self.norm(hidden_states)\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/mamba2/modeling_mamba2.py\", line 165, in torch_dynamo_resume_in_forward_at_161\n hidden_states = self.mixer(\n ^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1751, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1762, in _call_impl\n return forward_call(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/layers/mamba2.py\", line 601, in forward\n return self.torch_forward(hidden_states, cache_params, cache_position, attention_mask)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/layers/mamba2.py\", line 528, in torch_forward\n G_intermediate = C[:, :, :, None, :, :] * B[:, :, None, :, :, :]\n ~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~\ntorch.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 GiB. GPU 7 has a total capacity of 95.00 GiB of which 85.09 GiB is free. Process 2003906 has 316.00 MiB memory in use. Process 696037 has 316.00 MiB memory in use. Process 1114702 has 9.27 GiB memory in use. Of the allocated memory 7.99 GiB is allocated by PyTorch, and 73.66 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)\n", "timestamp": "1753252283"}}}
mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_07su7ijp/attempt_0/7/stderr.log ADDED
@@ -0,0 +1,387 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [titan] 2025-07-23 14:27:46,256 - root - INFO - Starting job: default job
2
+ [titan] 2025-07-23 14:27:46,256 - root - INFO - {
3
+ "activation_checkpoint": {
4
+ "mode": "none",
5
+ "selective_ac_option": "2"
6
+ },
7
+ "activation_offload": {
8
+ "mode": "none"
9
+ },
10
+ "checkpoint": {
11
+ "async_mode": "disabled",
12
+ "create_seed_checkpoint": false,
13
+ "enable_checkpoint": true,
14
+ "exclude_from_loading": [],
15
+ "export_dtype": "float32",
16
+ "folder": "checkpoint",
17
+ "interval": 8192,
18
+ "interval_type": "steps",
19
+ "keep_latest_k": 100,
20
+ "load_step": -1,
21
+ "model_weights_only": false
22
+ },
23
+ "comm": {
24
+ "init_timeout_seconds": 300,
25
+ "trace_buf_size": 20000,
26
+ "train_timeout_seconds": 100
27
+ },
28
+ "experimental": {
29
+ "context_parallel_degree": 1,
30
+ "context_parallel_rotate_method": "allgather",
31
+ "custom_model_path": "",
32
+ "enable_async_tensor_parallel": false,
33
+ "enable_compiled_autograd": false,
34
+ "pipeline_parallel_degree": 1,
35
+ "pipeline_parallel_microbatches": null,
36
+ "pipeline_parallel_schedule": "1F1B",
37
+ "pipeline_parallel_schedule_csv": "",
38
+ "pipeline_parallel_split_points": []
39
+ },
40
+ "fault_tolerance": {
41
+ "enable": false,
42
+ "group_size": 0,
43
+ "min_replica_size": 1,
44
+ "replica_id": 0
45
+ },
46
+ "float8": {
47
+ "enable_fsdp_float8_all_gather": false,
48
+ "force_recompute_fp8_weight_in_bwd": false,
49
+ "precompute_float8_dynamic_scale_for_fsdp": false,
50
+ "recipe_name": null
51
+ },
52
+ "job": {
53
+ "config_file": "flame/models/fla.toml",
54
+ "description": "default job",
55
+ "dump_folder": "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/exp/mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2",
56
+ "print_args": true,
57
+ "use_for_integration_test": false
58
+ },
59
+ "lr_scheduler": {
60
+ "decay_ratio": 1.0,
61
+ "decay_type": "linear",
62
+ "lr_min": 0.01,
63
+ "warmup_steps": 100
64
+ },
65
+ "memory_estimation": {
66
+ "disable_fake_mode": false,
67
+ "enabled": false
68
+ },
69
+ "metrics": {
70
+ "disable_color_printing": false,
71
+ "enable_tensorboard": true,
72
+ "enable_wandb": true,
73
+ "log_freq": 1,
74
+ "save_for_all_ranks": false,
75
+ "save_tb_folder": "tb"
76
+ },
77
+ "model": {
78
+ "config": "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/configs/mamba2_6_1_340M.json",
79
+ "converters": [],
80
+ "name": "fla",
81
+ "print_after_conversion": false,
82
+ "tokenizer_path": "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/tokenizer"
83
+ },
84
+ "optimizer": {
85
+ "early_step_in_backward": false,
86
+ "eps": 1e-08,
87
+ "implementation": "fused",
88
+ "lr": 0.0003,
89
+ "name": "AdamW"
90
+ },
91
+ "profiling": {
92
+ "enable_memory_snapshot": false,
93
+ "enable_profiling": true,
94
+ "profile_freq": 512,
95
+ "save_memory_snapshot_folder": "memory_snapshot",
96
+ "save_traces_folder": "profile_trace"
97
+ },
98
+ "training": {
99
+ "batch_size": 8,
100
+ "compile": true,
101
+ "context_len": 8192,
102
+ "data_dir": null,
103
+ "data_files": null,
104
+ "data_parallel_replicate_degree": 1,
105
+ "data_parallel_shard_degree": -1,
106
+ "data_probs": "0.55,0.3,0.15",
107
+ "dataset": "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample,/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged,/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro",
108
+ "dataset_name": "default,default,default",
109
+ "dataset_split": "train,train,train",
110
+ "deterministic": false,
111
+ "disable_loss_parallel": false,
112
+ "enable_cpu_offload": false,
113
+ "fsdp_reshard_after_forward": "default",
114
+ "gc_freq": 50,
115
+ "gradient_accumulation_steps": 2,
116
+ "max_norm": 1.0,
117
+ "mixed_precision_param": "bfloat16",
118
+ "mixed_precision_reduce": "float32",
119
+ "num_workers": 32,
120
+ "persistent_workers": false,
121
+ "pin_memory": false,
122
+ "prefetch_factor": 2,
123
+ "seed": 42,
124
+ "seq_len": 8192,
125
+ "skip_nan_inf": true,
126
+ "steps": 95366,
127
+ "streaming": true,
128
+ "tensor_parallel_degree": 1,
129
+ "varlen": false
130
+ }
131
+ }
132
+ [titan] 2025-07-23 14:27:46,257 - root - INFO - [GC] Initial GC collection. 0.00 seconds.
133
+ [titan] 2025-07-23 14:27:47,217 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config
134
+ [titan] 2025-07-23 14:27:47,220 - root - INFO - CUDA capacity: NVIDIA H20 with 95.00GiB memory
135
+ [titan] 2025-07-23 14:27:47,268 - root - WARNING - Peak flops undefined for: NVIDIA H20, fallback to A100
136
+ [titan] 2025-07-23 14:27:47,268 - root - INFO - Peak FLOPS used for computing MFU: 3.120e+14
137
+ [titan] 2025-07-23 14:27:47,268 - root - INFO - Building 1-D device mesh with ['dp_shard'], [8]
138
+ [titan] 2025-07-23 14:27:47,411 - root - INFO - Loading tokenizer...
139
+ [titan] 2025-07-23 14:27:47,997 - root - INFO - LlamaTokenizerFast(name_or_path='/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/tokenizer', vocab_size=32000, model_max_length=10000000000, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>'}, clean_up_tokenization_spaces=False, added_tokens_decoder={
140
+ 0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
141
+ 1: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
142
+ 2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
143
+ }
144
+ )
145
+ [titan] 2025-07-23 14:27:47,998 - root - INFO - Loading dataset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample,/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged,/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro:default,default,default
146
+ `trust_remote_code` is not supported anymore.
147
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample' isn't based on a loading script and remove `trust_remote_code`.
148
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
149
+ [titan] 2025-07-23 14:27:47,998 - datasets.load - ERROR - `trust_remote_code` is not supported anymore.
150
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample' isn't based on a loading script and remove `trust_remote_code`.
151
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
152
+ [titan] 2025-07-23 14:27:48,594 - root - INFO - Subset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample:default (p = 0.550):
153
+ IterableDataset({
154
+ features: ['text', 'id', 'dump', 'url', 'file_path', 'language', 'language_score', 'token_count', 'score', 'int_score'],
155
+ num_shards: 140
156
+ })
157
+ [titan] 2025-07-23 14:27:48,594 - root - INFO - Shuffling the dataset with seed 42
158
+ [titan] 2025-07-23 14:27:48,594 - root - WARNING - Dataset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample has insufficient shards (140). Need 256 shards minimum for desired data parallel workers × 32 dataloader workers. Resharding dataset to 256 shards and disabling streaming mode.
159
+ `trust_remote_code` is not supported anymore.
160
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample' isn't based on a loading script and remove `trust_remote_code`.
161
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
162
+ [titan] 2025-07-23 14:27:48,594 - datasets.load - ERROR - `trust_remote_code` is not supported anymore.
163
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample' isn't based on a loading script and remove `trust_remote_code`.
164
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
165
+ `trust_remote_code` is not supported anymore.
166
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged' isn't based on a loading script and remove `trust_remote_code`.
167
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
168
+ [titan] 2025-07-23 14:28:40,263 - datasets.load - ERROR - `trust_remote_code` is not supported anymore.
169
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged' isn't based on a loading script and remove `trust_remote_code`.
170
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
171
+ [titan] 2025-07-23 14:28:40,297 - root - INFO - Subset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged:default (p = 0.300):
172
+ IterableDataset({
173
+ features: ['repo', 'content'],
174
+ num_shards: 1
175
+ })
176
+ [titan] 2025-07-23 14:28:40,298 - root - INFO - Shuffling the dataset with seed 42
177
+ [titan] 2025-07-23 14:28:40,298 - root - WARNING - Dataset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged has insufficient shards (1). Need 256 shards minimum for desired data parallel workers × 32 dataloader workers. Resharding dataset to 256 shards and disabling streaming mode.
178
+ `trust_remote_code` is not supported anymore.
179
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged' isn't based on a loading script and remove `trust_remote_code`.
180
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
181
+ [titan] 2025-07-23 14:28:40,298 - datasets.load - ERROR - `trust_remote_code` is not supported anymore.
182
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged' isn't based on a loading script and remove `trust_remote_code`.
183
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
184
+ `trust_remote_code` is not supported anymore.
185
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro' isn't based on a loading script and remove `trust_remote_code`.
186
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
187
+ [titan] 2025-07-23 14:28:40,563 - datasets.load - ERROR - `trust_remote_code` is not supported anymore.
188
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro' isn't based on a loading script and remove `trust_remote_code`.
189
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
190
+ [titan] 2025-07-23 14:28:40,649 - root - INFO - Subset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro:default (p = 0.150):
191
+ IterableDataset({
192
+ features: ['text', 'cc-path', 'domain', 'lang', 'lang_score', 'timestamp', 'url', 'math_score'],
193
+ num_shards: 100
194
+ })
195
+ [titan] 2025-07-23 14:28:40,649 - root - INFO - Shuffling the dataset with seed 42
196
+ [titan] 2025-07-23 14:28:40,649 - root - WARNING - Dataset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro has insufficient shards (100). Need 256 shards minimum for desired data parallel workers × 32 dataloader workers. Resharding dataset to 256 shards and disabling streaming mode.
197
+ `trust_remote_code` is not supported anymore.
198
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro' isn't based on a loading script and remove `trust_remote_code`.
199
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
200
+ [titan] 2025-07-23 14:28:40,649 - datasets.load - ERROR - `trust_remote_code` is not supported anymore.
201
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro' isn't based on a loading script and remove `trust_remote_code`.
202
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
203
+ [titan] 2025-07-23 14:28:46,975 - root - INFO - Interleaving 3 datasets with probabilities [0.55, 0.3, 0.15]
204
+ [titan] 2025-07-23 14:28:47,679 - root - INFO - IterableDataset({
205
+ features: ['text', 'content'],
206
+ num_shards: 256
207
+ })
208
+ [titan] 2025-07-23 14:28:47,795 - root - INFO - Building dataloader...
209
+ [titan] 2025-07-23 14:28:47,797 - root - INFO - Loading model config from /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/configs/mamba2_6_1_340M.json
210
+ [titan] 2025-07-23 14:28:47,799 - root - INFO - Building model from the config
211
+ Mamba2Config {
212
+ "architectures": [
213
+ "Mamba2ForCausalLM"
214
+ ],
215
+ "attn": {
216
+ "layers": [
217
+ 5,
218
+ 11,
219
+ 17,
220
+ 23
221
+ ],
222
+ "num_heads": 16,
223
+ "num_kv_heads": 8,
224
+ "qkv_bias": false,
225
+ "rope_theta": 160000.0,
226
+ "window_size": null
227
+ },
228
+ "attn_mode": "chunk",
229
+ "bos_token_id": 1,
230
+ "chunk_size": 256,
231
+ "conv_kernel": 4,
232
+ "eos_token_id": 2,
233
+ "expand": 2,
234
+ "fuse_cross_entropy": true,
235
+ "fuse_norm": true,
236
+ "fuse_swiglu": true,
237
+ "head_dim": 64,
238
+ "hidden_act": "silu",
239
+ "hidden_size": 1024,
240
+ "initializer_range": 0.02,
241
+ "model_type": "mamba2",
242
+ "n_groups": 1,
243
+ "norm_eps": 1e-05,
244
+ "num_heads": 32,
245
+ "num_hidden_layers": 48,
246
+ "pad_token_id": 0,
247
+ "rescale_prenorm_residual": true,
248
+ "residual_in_fp32": true,
249
+ "rms_norm": true,
250
+ "state_size": 128,
251
+ "tie_word_embeddings": false,
252
+ "time_step_floor": 0.0001,
253
+ "time_step_limit": [
254
+ 0.0,
255
+ Infinity
256
+ ],
257
+ "time_step_max": 0.1,
258
+ "time_step_min": 0.001,
259
+ "time_step_rank": 128,
260
+ "transformers_version": "4.53.3",
261
+ "use_bias": false,
262
+ "use_cache": true,
263
+ "use_conv_bias": true,
264
+ "use_l2warp": false,
265
+ "vocab_size": 32000
266
+ }
267
+ 
268
+ [titan] 2025-07-23 14:28:50,147 - fla.layers.mamba2 - WARNING - The fast path is not available because one of `(selective_state_update)` is None. Falling back to the naive implementation. To install follow https://github.com/state-spaces/mamba/#installation
269
+ [titan] 2025-07-23 14:28:50,148 - fla.layers.mamba2 - WARNING - The CUDA backend is not available because `causal_conv1d` is None. Falling back to the Triton backend. To install follow https://github.com/Dao-AILab/causal-conv1d
270
+ [titan] 2025-07-23 14:28:50,264 - root - INFO - 
271
+ Mamba2ForCausalLM(
272
+ (backbone): Mamba2Model(
273
+ (embeddings): Embedding(32000, 1024)
274
+ (layers): ModuleList(
275
+ (0-47): 48 x Mamba2Block(
276
+ (norm): RMSNorm(1024, eps=1e-05)
277
+ (mixer): Mamba2(
278
+ (conv1d): Conv1d(2304, 2304, kernel_size=(4,), stride=(1,), padding=(3,), groups=2304)
279
+ (in_proj): Linear(in_features=1024, out_features=4384, bias=False)
280
+ (norm): RMSNormGated()
281
+ (out_proj): Linear(in_features=2048, out_features=1024, bias=False)
282
+ )
283
+ )
284
+ )
285
+ (norm_f): RMSNorm(1024, eps=1e-05)
286
+ )
287
+ (lm_head): Linear(in_features=1024, out_features=32000, bias=False)
288
+ (criterion): FusedLinearCrossEntropyLoss()
289
+ )
290
+
291
+ [titan] 2025-07-23 14:28:50,316 - root - INFO - Compiling each block with torch.compile
292
+ [titan] 2025-07-23 14:28:50,316 - root - INFO - Compiling the embedding, norm, and lm_head layers with torch.compile
293
+ [titan] 2025-07-23 14:28:50,317 - root - WARNING - No norm found in model
294
+ [titan] 2025-07-23 14:28:50,317 - root - INFO - Compiling the entire model with torch.compile
295
+ [titan] 2025-07-23 14:28:50,540 - root - INFO - Applied FSDP to the model
296
+ [titan] 2025-07-23 14:28:50,884 - fla.models.mamba2.modeling_mamba2 - WARNING - `A_log` is a DTensor, skipping initialization
297
+ [titan] 2025-07-23 14:28:51,042 - fla.models.mamba2.modeling_mamba2 - WARNING - `dt_bias` is a DTensor, skipping initialization
298
+ [titan] 2025-07-23 14:28:51,273 - root - INFO - CUDA memory usage for model: 0.19GiB(0.20%)
299
+ [titan] 2025-07-23 14:28:51,275 - root - WARNING - Warmup (100) + decay (95366) steps exceed total training steps (95366). Adjusting decay steps to 95266.
300
+ [titan] 2025-07-23 14:28:51,299 - root - INFO - Checkpointing active. Checkpoints will be loaded from and saved to /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/exp/mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/checkpoint
301
+ [titan] 2025-07-23 14:28:51,311 - root - INFO - CUDA capacity: NVIDIA H20 with 95.00GiB memory
302
+ [titan] 2025-07-23 14:28:51,462 - root - WARNING - Peak flops undefined for: NVIDIA H20, fallback to A100
303
+ [titan] 2025-07-23 14:28:58,659 - root - INFO - ***** Running training *****
304
+ [titan] 2025-07-23 14:28:58,661 - root - INFO -  Training starts at step 1
305
+ [titan] 2025-07-23 14:28:58,662 - root - INFO -  Number of tokens per sequence = 8,192
306
+ [titan] 2025-07-23 14:28:58,662 - root - INFO -  Gradient Accumulation steps = 2
307
+ [titan] 2025-07-23 14:28:58,667 - root - INFO -  Instantaneous batch size (per device) = 8
308
+ [titan] 2025-07-23 14:28:58,667 - root - INFO -  Global batch size (w. parallel, distributed & accumulation) = 128 (1,048,576 tokens)
309
+ [titan] 2025-07-23 14:28:58,668 - root - INFO -  Total optimization steps = 95,366 (99,998,498,816 tokens)
310
+ [titan] 2025-07-23 14:28:58,669 - root - INFO -  Warmup steps = 100 (104,857,600 tokens)
311
+ [titan] 2025-07-23 14:28:58,669 - root - INFO -  Number of parameters = 382,387,712 
312
+ [titan] 2025-07-23 14:28:58,669 - root - INFO - Profiling active. Traces will be saved at /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/exp/mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/profile_trace
313
+ /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/variables/functions.py:1263: UserWarning: Dynamo does not know how to trace the builtin `cuda_utils.get_device_properties.` This function is either a Python builtin (e.g. _warnings.warn) or a third-party C/C++ Python extension (perhaps created with pybind).
314
+ If it is a Python builtin, please file an issue on GitHub so the PyTorch team can add support for it and see the next case for a workaround.
315
+ If it is a third-party C/C++ Python extension, please either wrap it into a PyTorch-understood custom operator (see https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html for more details) or, if it is traceable, use `torch.compiler.allow_in_graph`.
316
+ torch._dynamo.utils.warn_once(explanation + "\n" + "\n".join(hints))
317
+ [rank7]: Traceback (most recent call last):
318
+ [rank7]: File "<frozen runpy>", line 198, in _run_module_as_main
319
+ [rank7]: File "<frozen runpy>", line 88, in _run_code
320
+ [rank7]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/flame/train.py", line 615, in <module>
321
+ [rank7]: main(config)
322
+ [rank7]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
323
+ [rank7]: return f(*args, **kwargs)
324
+ [rank7]: ^^^^^^^^^^^^^^^^^^
325
+ [rank7]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/flame/train.py", line 487, in main
326
+ [rank7]: output = model(
327
+ [rank7]: ^^^^^^
328
+ [rank7]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
329
+ [rank7]: return self._call_impl(*args, **kwargs)
330
+ [rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
331
+ [rank7]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1857, in _call_impl
332
+ [rank7]: return inner()
333
+ [rank7]: ^^^^^^^
334
+ [rank7]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1805, in inner
335
+ [rank7]: result = forward_call(*args, **kwargs)
336
+ [rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
337
+ [rank7]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
338
+ [rank7]: return func(*args, **kwargs)
339
+ [rank7]: ^^^^^^^^^^^^^^^^^^^^^
340
+ [rank7]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/mamba2/modeling_mamba2.py", line 526, in forward
341
+ [rank7]: outputs = self.backbone(
342
+ [rank7]: ^^^^^^^^^^^^^^
343
+ [rank7]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
344
+ [rank7]: return self._call_impl(*args, **kwargs)
345
+ [rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
346
+ [rank7]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
347
+ [rank7]: return forward_call(*args, **kwargs)
348
+ [rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
349
+ [rank7]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/mamba2/modeling_mamba2.py", line 405, in forward
350
+ [rank7]: hidden_states = mixer_block(
351
+ [rank7]: ^^^^^^^^^^^^
352
+ [rank7]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
353
+ [rank7]: return self._call_impl(*args, **kwargs)
354
+ [rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
355
+ [rank7]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1857, in _call_impl
356
+ [rank7]: return inner()
357
+ [rank7]: ^^^^^^^
358
+ [rank7]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1805, in inner
359
+ [rank7]: result = forward_call(*args, **kwargs)
360
+ [rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
361
+ [rank7]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 655, in _fn
362
+ [rank7]: return fn(*args, **kwargs)
363
+ [rank7]: ^^^^^^^^^^^^^^^^^^^
364
+ [rank7]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
365
+ [rank7]: return self._call_impl(*args, **kwargs)
366
+ [rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
367
+ [rank7]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
368
+ [rank7]: return forward_call(*args, **kwargs)
369
+ [rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
370
+ [rank7]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/mamba2/modeling_mamba2.py", line 161, in forward
371
+ [rank7]: hidden_states = self.norm(hidden_states)
372
+ [rank7]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/mamba2/modeling_mamba2.py", line 165, in torch_dynamo_resume_in_forward_at_161
373
+ [rank7]: hidden_states = self.mixer(
374
+ [rank7]: ^^^^^^^^^^^
375
+ [rank7]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
376
+ [rank7]: return self._call_impl(*args, **kwargs)
377
+ [rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
378
+ [rank7]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
379
+ [rank7]: return forward_call(*args, **kwargs)
380
+ [rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
381
+ [rank7]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/layers/mamba2.py", line 601, in forward
382
+ [rank7]: return self.torch_forward(hidden_states, cache_params, cache_position, attention_mask)
383
+ [rank7]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
384
+ [rank7]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/layers/mamba2.py", line 528, in torch_forward
385
+ [rank7]: G_intermediate = C[:, :, :, None, :, :] * B[:, :, None, :, :, :]
386
+ [rank7]: ~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~
387
+ [rank7]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 GiB. GPU 7 has a total capacity of 95.00 GiB of which 85.09 GiB is free. Process 2003906 has 316.00 MiB memory in use. Process 696037 has 316.00 MiB memory in use. Process 1114702 has 9.27 GiB memory in use. Of the allocated memory 7.99 GiB is allocated by PyTorch, and 73.66 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_07su7ijp/attempt_0/7/stdout.log ADDED
File without changes
mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_77qh1j5t/attempt_0/0/error.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"message": {"message": "OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 GiB. GPU 0 has a total capacity of 95.00 GiB of which 63.94 GiB is free. Process 2003896 has 316.00 MiB memory in use. Process 696027 has 316.00 MiB memory in use. Process 1850004 has 21.15 GiB memory in use. Process 2711975 has 9.27 GiB memory in use. Of the allocated memory 7.99 GiB is allocated by PyTorch, and 73.66 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)", "extraInfo": {"py_callstack": "Traceback (most recent call last):\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py\", line 355, in wrapper\n return f(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/flame/train.py\", line 487, in main\n output = model(\n ^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1751, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1857, in _call_impl\n return inner()\n ^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1805, in inner\n result = forward_call(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/transformers/utils/deprecation.py\", line 172, in wrapped_func\n return func(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/mamba2/modeling_mamba2.py\", line 526, in forward\n outputs = self.backbone(\n ^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1751, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1762, in _call_impl\n return forward_call(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/mamba2/modeling_mamba2.py\", line 405, in forward\n hidden_states = mixer_block(\n ^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1751, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1857, in _call_impl\n return inner()\n ^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1805, in inner\n result = forward_call(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py\", line 655, in _fn\n return fn(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1751, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1762, in _call_impl\n return forward_call(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/mamba2/modeling_mamba2.py\", line 161, in forward\n hidden_states = self.norm(hidden_states)\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/mamba2/modeling_mamba2.py\", line 165, in torch_dynamo_resume_in_forward_at_161\n hidden_states = self.mixer(\n ^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1751, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1762, in _call_impl\n return forward_call(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/layers/mamba2.py\", line 601, in forward\n return self.torch_forward(hidden_states, cache_params, cache_position, attention_mask)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/layers/mamba2.py\", line 528, in torch_forward\n G_intermediate = C[:, :, :, None, :, :] * B[:, :, None, :, :, :]\n ~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~\ntorch.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 GiB. GPU 0 has a total capacity of 95.00 GiB of which 63.94 GiB is free. Process 2003896 has 316.00 MiB memory in use. Process 696027 has 316.00 MiB memory in use. Process 1850004 has 21.15 GiB memory in use. Process 2711975 has 9.27 GiB memory in use. Of the allocated memory 7.99 GiB is allocated by PyTorch, and 73.66 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)\n", "timestamp": "1753242220"}}}
mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_77qh1j5t/attempt_0/0/stderr.log ADDED
@@ -0,0 +1,467 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [titan] 2025-07-23 11:13:04,987 - root - INFO - Starting job: default job
2
+ [titan] 2025-07-23 11:13:04,987 - root - INFO - {
3
+ "activation_checkpoint": {
4
+ "mode": "none",
5
+ "selective_ac_option": "2"
6
+ },
7
+ "activation_offload": {
8
+ "mode": "none"
9
+ },
10
+ "checkpoint": {
11
+ "async_mode": "disabled",
12
+ "create_seed_checkpoint": false,
13
+ "enable_checkpoint": true,
14
+ "exclude_from_loading": [],
15
+ "export_dtype": "float32",
16
+ "folder": "checkpoint",
17
+ "interval": 8192,
18
+ "interval_type": "steps",
19
+ "keep_latest_k": 100,
20
+ "load_step": -1,
21
+ "model_weights_only": false
22
+ },
23
+ "comm": {
24
+ "init_timeout_seconds": 300,
25
+ "trace_buf_size": 20000,
26
+ "train_timeout_seconds": 100
27
+ },
28
+ "experimental": {
29
+ "context_parallel_degree": 1,
30
+ "context_parallel_rotate_method": "allgather",
31
+ "custom_model_path": "",
32
+ "enable_async_tensor_parallel": false,
33
+ "enable_compiled_autograd": false,
34
+ "pipeline_parallel_degree": 1,
35
+ "pipeline_parallel_microbatches": null,
36
+ "pipeline_parallel_schedule": "1F1B",
37
+ "pipeline_parallel_schedule_csv": "",
38
+ "pipeline_parallel_split_points": []
39
+ },
40
+ "fault_tolerance": {
41
+ "enable": false,
42
+ "group_size": 0,
43
+ "min_replica_size": 1,
44
+ "replica_id": 0
45
+ },
46
+ "float8": {
47
+ "enable_fsdp_float8_all_gather": false,
48
+ "force_recompute_fp8_weight_in_bwd": false,
49
+ "precompute_float8_dynamic_scale_for_fsdp": false,
50
+ "recipe_name": null
51
+ },
52
+ "job": {
53
+ "config_file": "flame/models/fla.toml",
54
+ "description": "default job",
55
+ "dump_folder": "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/exp/mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2",
56
+ "print_args": true,
57
+ "use_for_integration_test": false
58
+ },
59
+ "lr_scheduler": {
60
+ "decay_ratio": 1.0,
61
+ "decay_type": "linear",
62
+ "lr_min": 0.01,
63
+ "warmup_steps": 100
64
+ },
65
+ "memory_estimation": {
66
+ "disable_fake_mode": false,
67
+ "enabled": false
68
+ },
69
+ "metrics": {
70
+ "disable_color_printing": false,
71
+ "enable_tensorboard": true,
72
+ "enable_wandb": true,
73
+ "log_freq": 1,
74
+ "save_for_all_ranks": false,
75
+ "save_tb_folder": "tb"
76
+ },
77
+ "model": {
78
+ "config": "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/configs/mamba2_6_1_340M.json",
79
+ "converters": [],
80
+ "name": "fla",
81
+ "print_after_conversion": false,
82
+ "tokenizer_path": "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/tokenizer"
83
+ },
84
+ "optimizer": {
85
+ "early_step_in_backward": false,
86
+ "eps": 1e-08,
87
+ "implementation": "fused",
88
+ "lr": 0.0003,
89
+ "name": "AdamW"
90
+ },
91
+ "profiling": {
92
+ "enable_memory_snapshot": false,
93
+ "enable_profiling": true,
94
+ "profile_freq": 512,
95
+ "save_memory_snapshot_folder": "memory_snapshot",
96
+ "save_traces_folder": "profile_trace"
97
+ },
98
+ "training": {
99
+ "batch_size": 8,
100
+ "compile": true,
101
+ "context_len": 8192,
102
+ "data_dir": null,
103
+ "data_files": null,
104
+ "data_parallel_replicate_degree": 1,
105
+ "data_parallel_shard_degree": -1,
106
+ "data_probs": "0.55,0.3,0.15",
107
+ "dataset": "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample,/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged,/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro",
108
+ "dataset_name": "default,default,default",
109
+ "dataset_split": "train,train,train",
110
+ "deterministic": false,
111
+ "disable_loss_parallel": false,
112
+ "enable_cpu_offload": false,
113
+ "fsdp_reshard_after_forward": "default",
114
+ "gc_freq": 50,
115
+ "gradient_accumulation_steps": 2,
116
+ "max_norm": 1.0,
117
+ "mixed_precision_param": "bfloat16",
118
+ "mixed_precision_reduce": "float32",
119
+ "num_workers": 32,
120
+ "persistent_workers": false,
121
+ "pin_memory": false,
122
+ "prefetch_factor": 2,
123
+ "seed": 42,
124
+ "seq_len": 8192,
125
+ "skip_nan_inf": true,
126
+ "steps": 95366,
127
+ "streaming": true,
128
+ "tensor_parallel_degree": 1,
129
+ "varlen": false
130
+ }
131
+ }
132
+ [titan] 2025-07-23 11:13:04,988 - root - INFO - [GC] Initial GC collection. 0.00 seconds.
133
+ [titan] 2025-07-23 11:13:04,988 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config
134
+ [titan] 2025-07-23 11:13:05,005 - root - INFO - CUDA capacity: NVIDIA H20 with 95.00GiB memory
135
+ [titan] 2025-07-23 11:13:05,098 - root - WARNING - Peak flops undefined for: NVIDIA H20, fallback to A100
136
+ [titan] 2025-07-23 11:13:05,098 - root - INFO - Peak FLOPS used for computing MFU: 3.120e+14
137
+ [titan] 2025-07-23 11:13:05,099 - root - INFO - Building 1-D device mesh with ['dp_shard'], [8]
138
+ [titan] 2025-07-23 11:13:07,424 - root - INFO - Loading tokenizer...
139
+ [titan] 2025-07-23 11:13:07,703 - root - INFO - LlamaTokenizerFast(name_or_path='/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/tokenizer', vocab_size=32000, model_max_length=10000000000, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>'}, clean_up_tokenization_spaces=False, added_tokens_decoder={
140
+ 0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
141
+ 1: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
142
+ 2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
143
+ }
144
+ )
145
+ [titan] 2025-07-23 11:13:07,704 - root - INFO - Loading dataset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample,/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged,/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro:default,default,default
146
+ `trust_remote_code` is not supported anymore.
147
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample' isn't based on a loading script and remove `trust_remote_code`.
148
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
149
+ [titan] 2025-07-23 11:13:07,704 - datasets.load - ERROR - `trust_remote_code` is not supported anymore.
150
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample' isn't based on a loading script and remove `trust_remote_code`.
151
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
152
+ [titan] 2025-07-23 11:13:08,426 - root - INFO - Subset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample:default (p = 0.550):
153
+ IterableDataset({
154
+ features: ['text', 'id', 'dump', 'url', 'file_path', 'language', 'language_score', 'token_count', 'score', 'int_score'],
155
+ num_shards: 140
156
+ })
157
+ [titan] 2025-07-23 11:13:08,426 - root - INFO - Shuffling the dataset with seed 42
158
+ [titan] 2025-07-23 11:13:08,426 - root - WARNING - Dataset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample has insufficient shards (140). Need 256 shards minimum for desired data parallel workers × 32 dataloader workers. Resharding dataset to 256 shards and disabling streaming mode.
159
+ `trust_remote_code` is not supported anymore.
160
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample' isn't based on a loading script and remove `trust_remote_code`.
161
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
162
+ [titan] 2025-07-23 11:13:08,427 - datasets.load - ERROR - `trust_remote_code` is not supported anymore.
163
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample' isn't based on a loading script and remove `trust_remote_code`.
164
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
165
+ `trust_remote_code` is not supported anymore.
166
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged' isn't based on a loading script and remove `trust_remote_code`.
167
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
168
+ [titan] 2025-07-23 11:35:37,776 - datasets.load - ERROR - `trust_remote_code` is not supported anymore.
169
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged' isn't based on a loading script and remove `trust_remote_code`.
170
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
171
+ [titan] 2025-07-23 11:35:37,867 - root - INFO - Subset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged:default (p = 0.300):
172
+ IterableDataset({
173
+ features: ['repo', 'content'],
174
+ num_shards: 1
175
+ })
176
+ [titan] 2025-07-23 11:35:37,867 - root - INFO - Shuffling the dataset with seed 42
177
+ [titan] 2025-07-23 11:35:37,867 - root - WARNING - Dataset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged has insufficient shards (1). Need 256 shards minimum for desired data parallel workers × 32 dataloader workers. Resharding dataset to 256 shards and disabling streaming mode.
178
+ `trust_remote_code` is not supported anymore.
179
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged' isn't based on a loading script and remove `trust_remote_code`.
180
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
181
+ [titan] 2025-07-23 11:35:37,868 - datasets.load - ERROR - `trust_remote_code` is not supported anymore.
182
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged' isn't based on a loading script and remove `trust_remote_code`.
183
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
184
+ Setting num_proc from 32 back to 1 for the train split to disable multiprocessing as it only contains one shard.
185
+ [titan] 2025-07-23 11:35:37,949 - datasets.builder - WARNING - Setting num_proc from 32 back to 1 for the train split to disable multiprocessing as it only contains one shard.
186
+
187
+ `trust_remote_code` is not supported anymore.
188
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro' isn't based on a loading script and remove `trust_remote_code`.
189
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
190
+ [titan] 2025-07-23 11:36:07,557 - datasets.load - ERROR - `trust_remote_code` is not supported anymore.
191
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro' isn't based on a loading script and remove `trust_remote_code`.
192
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
193
+ [titan] 2025-07-23 11:36:08,012 - root - INFO - Subset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro:default (p = 0.150):
194
+ IterableDataset({
195
+ features: ['text', 'cc-path', 'domain', 'lang', 'lang_score', 'timestamp', 'url', 'math_score'],
196
+ num_shards: 100
197
+ })
198
+ [titan] 2025-07-23 11:36:08,012 - root - INFO - Shuffling the dataset with seed 42
199
+ [titan] 2025-07-23 11:36:08,013 - root - WARNING - Dataset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro has insufficient shards (100). Need 256 shards minimum for desired data parallel workers × 32 dataloader workers. Resharding dataset to 256 shards and disabling streaming mode.
200
+ `trust_remote_code` is not supported anymore.
201
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro' isn't based on a loading script and remove `trust_remote_code`.
202
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
203
+ [titan] 2025-07-23 11:36:08,013 - datasets.load - ERROR - `trust_remote_code` is not supported anymore.
204
+ Please check that the Hugging Face dataset '/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro' isn't based on a loading script and remove `trust_remote_code`.
205
+ If the dataset is based on a loading script, please ask the dataset author to remove it and convert it to a standard format like Parquet.
206
+
207
+ [titan] 2025-07-23 11:40:36,057 - root - INFO - Interleaving 3 datasets with probabilities [0.55, 0.3, 0.15]
208
+ [titan] 2025-07-23 11:40:36,957 - root - INFO - IterableDataset({
209
+ features: ['text', 'content'],
210
+ num_shards: 256
211
+ })
212
+ [titan] 2025-07-23 11:40:37,082 - root - INFO - Building dataloader...
213
+ [titan] 2025-07-23 11:40:37,085 - root - INFO - Loading model config from /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/configs/mamba2_6_1_340M.json
214
+ [titan] 2025-07-23 11:40:37,088 - root - INFO - Building model from the config
215
+ Mamba2Config {
216
+ "architectures": [
217
+ "Mamba2ForCausalLM"
218
+ ],
219
+ "attn": {
220
+ "layers": [
221
+ 5,
222
+ 11,
223
+ 17,
224
+ 23
225
+ ],
226
+ "num_heads": 16,
227
+ "num_kv_heads": 8,
228
+ "qkv_bias": false,
229
+ "rope_theta": 160000.0,
230
+ "window_size": null
231
+ },
232
+ "attn_mode": "chunk",
233
+ "bos_token_id": 1,
234
+ "chunk_size": 256,
235
+ "conv_kernel": 4,
236
+ "eos_token_id": 2,
237
+ "expand": 2,
238
+ "fuse_cross_entropy": true,
239
+ "fuse_norm": true,
240
+ "fuse_swiglu": true,
241
+ "head_dim": 64,
242
+ "hidden_act": "silu",
243
+ "hidden_size": 1024,
244
+ "initializer_range": 0.02,
245
+ "model_type": "mamba2",
246
+ "n_groups": 1,
247
+ "norm_eps": 1e-05,
248
+ "num_heads": 32,
249
+ "num_hidden_layers": 48,
250
+ "pad_token_id": 0,
251
+ "rescale_prenorm_residual": true,
252
+ "residual_in_fp32": true,
253
+ "rms_norm": true,
254
+ "state_size": 128,
255
+ "tie_word_embeddings": false,
256
+ "time_step_floor": 0.0001,
257
+ "time_step_limit": [
258
+ 0.0,
259
+ Infinity
260
+ ],
261
+ "time_step_max": 0.1,
262
+ "time_step_min": 0.001,
263
+ "time_step_rank": 128,
264
+ "transformers_version": "4.53.3",
265
+ "use_bias": false,
266
+ "use_cache": true,
267
+ "use_conv_bias": true,
268
+ "use_l2warp": false,
269
+ "vocab_size": 32000
270
+ }
271
+ 
272
+ [titan] 2025-07-23 11:40:39,687 - fla.layers.mamba2 - WARNING - The fast path is not available because one of `(selective_state_update)` is None. Falling back to the naive implementation. To install follow https://github.com/state-spaces/mamba/#installation
273
+ [titan] 2025-07-23 11:40:39,687 - fla.layers.mamba2 - WARNING - The CUDA backend is not available because `causal_conv1d` is None. Falling back to the Triton backend. To install follow https://github.com/Dao-AILab/causal-conv1d
274
+ [titan] 2025-07-23 11:40:39,804 - root - INFO - 
275
+ Mamba2ForCausalLM(
276
+ (backbone): Mamba2Model(
277
+ (embeddings): Embedding(32000, 1024)
278
+ (layers): ModuleList(
279
+ (0-47): 48 x Mamba2Block(
280
+ (norm): RMSNorm(1024, eps=1e-05)
281
+ (mixer): Mamba2(
282
+ (conv1d): Conv1d(2304, 2304, kernel_size=(4,), stride=(1,), padding=(3,), groups=2304)
283
+ (in_proj): Linear(in_features=1024, out_features=4384, bias=False)
284
+ (norm): RMSNormGated()
285
+ (out_proj): Linear(in_features=2048, out_features=1024, bias=False)
286
+ )
287
+ )
288
+ )
289
+ (norm_f): RMSNorm(1024, eps=1e-05)
290
+ )
291
+ (lm_head): Linear(in_features=1024, out_features=32000, bias=False)
292
+ (criterion): FusedLinearCrossEntropyLoss()
293
+ )
294
+
295
+ [titan] 2025-07-23 11:40:39,857 - root - INFO - Compiling each block with torch.compile
296
+ [titan] 2025-07-23 11:40:39,857 - root - INFO - Compiling the embedding, norm, and lm_head layers with torch.compile
297
+ [titan] 2025-07-23 11:40:39,857 - root - WARNING - No norm found in model
298
+ [titan] 2025-07-23 11:40:39,858 - root - INFO - Compiling the entire model with torch.compile
299
+ [titan] 2025-07-23 11:40:40,108 - root - INFO - Applied FSDP to the model
300
+ [titan] 2025-07-23 11:40:40,431 - fla.models.mamba2.modeling_mamba2 - WARNING - `A_log` is a DTensor, skipping initialization
301
+ [titan] 2025-07-23 11:40:40,596 - fla.models.mamba2.modeling_mamba2 - WARNING - `dt_bias` is a DTensor, skipping initialization
302
+ [titan] 2025-07-23 11:40:40,842 - root - INFO - CUDA memory usage for model: 0.19GiB(0.20%)
303
+ [titan] 2025-07-23 11:40:40,845 - root - WARNING - Warmup (100) + decay (95366) steps exceed total training steps (95366). Adjusting decay steps to 95266.
304
+ [titan] 2025-07-23 11:40:40,873 - root - INFO - Checkpointing active. Checkpoints will be loaded from and saved to /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/exp/mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/checkpoint
305
+ wandb: Network error (InvalidURL), entering retry loop.
306
+ wandb: W&B API key is configured. Use `wandb login --relogin` to force relogin
307
+ wandb: Network error (InvalidURL), entering retry loop.
308
+ [titan] 2025-07-23 11:42:33,611 - root - ERROR - Failed to create WandB logger: Run initialization has timed out after 90.0 sec. Please try increasing the timeout with the `init_timeout` setting: `wandb.init(settings=wandb.Settings(init_timeout=120))`.
309
+ [titan] 2025-07-23 11:42:33,728 - root - INFO - TensorBoard logging enabled. Logs will be saved at /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/exp/mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/tb/20250723-1140
310
+ [titan] 2025-07-23 11:42:33,729 - root - INFO - CUDA capacity: NVIDIA H20 with 95.00GiB memory
311
+ [titan] 2025-07-23 11:42:33,774 - root - WARNING - Peak flops undefined for: NVIDIA H20, fallback to A100
312
+ [titan] 2025-07-23 11:42:43,730 - root - INFO - ***** Running training *****
313
+ [titan] 2025-07-23 11:42:43,732 - root - INFO -  Training starts at step 1
314
+ [titan] 2025-07-23 11:42:43,732 - root - INFO -  Number of tokens per sequence = 8,192
315
+ [titan] 2025-07-23 11:42:43,732 - root - INFO -  Gradient Accumulation steps = 2
316
+ [titan] 2025-07-23 11:42:43,732 - root - INFO -  Instantaneous batch size (per device) = 8
317
+ [titan] 2025-07-23 11:42:43,733 - root - INFO -  Global batch size (w. parallel, distributed & accumulation) = 128 (1,048,576 tokens)
318
+ [titan] 2025-07-23 11:42:43,733 - root - INFO -  Total optimization steps = 95,366 (99,998,498,816 tokens)
319
+ [titan] 2025-07-23 11:42:43,733 - root - INFO -  Warmup steps = 100 (104,857,600 tokens)
320
+ [titan] 2025-07-23 11:42:43,733 - root - INFO -  Number of parameters = 382,387,712 
321
+ [titan] 2025-07-23 11:42:43,733 - root - INFO - Profiling active. Traces will be saved at /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/exp/mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/profile_trace
322
+ /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/variables/functions.py:1263: UserWarning: Dynamo does not know how to trace the builtin `cuda_utils.get_device_properties.` This function is either a Python builtin (e.g. _warnings.warn) or a third-party C/C++ Python extension (perhaps created with pybind).
323
+ If it is a Python builtin, please file an issue on GitHub so the PyTorch team can add support for it and see the next case for a workaround.
324
+ If it is a third-party C/C++ Python extension, please either wrap it into a PyTorch-understood custom operator (see https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html for more details) or, if it is traceable, use `torch.compiler.allow_in_graph`.
325
+ torch._dynamo.utils.warn_once(explanation + "\n" + "\n".join(hints))
326
+ Traceback (most recent call last):
327
+ File "<frozen runpy>", line 198, in _run_module_as_main
328
+ File "<frozen runpy>", line 88, in _run_code
329
+ File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/flame/train.py", line 615, in <module>
330
+ main(config)
331
+ File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
332
+ return f(*args, **kwargs)
333
+ ^^^^^^^^^^^^^^^^^^
334
+ File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/flame/train.py", line 487, in main
335
+ output = model(
336
+ ^^^^^^
337
+ File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
338
+ return self._call_impl(*args, **kwargs)
339
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
340
+ File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1857, in _call_impl
341
+ return inner()
342
+ ^^^^^^^
343
+ File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1805, in inner
344
+ result = forward_call(*args, **kwargs)
345
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
346
+ File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
347
+ return func(*args, **kwargs)
348
+ ^^^^^^^^^^^^^^^^^^^^^
349
+ File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/mamba2/modeling_mamba2.py", line 526, in forward
350
+ outputs = self.backbone(
351
+ ^^^^^^^^^^^^^^
352
+ File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
353
+ return self._call_impl(*args, **kwargs)
354
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
355
+ File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
356
+ return forward_call(*args, **kwargs)
357
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
358
+ File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/mamba2/modeling_mamba2.py", line 405, in forward
359
+ hidden_states = mixer_block(
360
+ ^^^^^^^^^^^^
361
+ File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
362
+ return self._call_impl(*args, **kwargs)
363
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
364
+ File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1857, in _call_impl
365
+ return inner()
366
+ ^^^^^^^
367
+ File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1805, in inner
368
+ result = forward_call(*args, **kwargs)
369
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
370
+ File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 655, in _fn
371
+ return fn(*args, **kwargs)
372
+ ^^^^^^^^^^^^^^^^^^^
373
+ File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
374
+ return self._call_impl(*args, **kwargs)
375
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
376
+ File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
377
+ return forward_call(*args, **kwargs)
378
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
379
+ File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/mamba2/modeling_mamba2.py", line 161, in forward
380
+ hidden_states = self.norm(hidden_states)
381
+ File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/mamba2/modeling_mamba2.py", line 165, in torch_dynamo_resume_in_forward_at_161
382
+ hidden_states = self.mixer(
383
+ ^^^^^^^^^^^
384
+ File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
385
+ return self._call_impl(*args, **kwargs)
386
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
387
+ File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
388
+ return forward_call(*args, **kwargs)
389
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
390
+ File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/layers/mamba2.py", line 601, in forward
391
+ return self.torch_forward(hidden_states, cache_params, cache_position, attention_mask)
392
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
393
+ File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/layers/mamba2.py", line 528, in torch_forward
394
+ G_intermediate = C[:, :, :, None, :, :] * B[:, :, None, :, :, :]
395
+ ~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~
396
+ torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 GiB. GPU 0 has a total capacity of 95.00 GiB of which 63.94 GiB is free. Process 2003896 has 316.00 MiB memory in use. Process 696027 has 316.00 MiB memory in use. Process 1850004 has 21.15 GiB memory in use. Process 2711975 has 9.27 GiB memory in use. Of the allocated memory 7.99 GiB is allocated by PyTorch, and 73.66 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
397
+ [rank0]: Traceback (most recent call last):
398
+ [rank0]: File "<frozen runpy>", line 198, in _run_module_as_main
399
+ [rank0]: File "<frozen runpy>", line 88, in _run_code
400
+ [rank0]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/flame/train.py", line 615, in <module>
401
+ [rank0]: main(config)
402
+ [rank0]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
403
+ [rank0]: return f(*args, **kwargs)
404
+ [rank0]: ^^^^^^^^^^^^^^^^^^
405
+ [rank0]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/flame/train.py", line 487, in main
406
+ [rank0]: output = model(
407
+ [rank0]: ^^^^^^
408
+ [rank0]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
409
+ [rank0]: return self._call_impl(*args, **kwargs)
410
+ [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
411
+ [rank0]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1857, in _call_impl
412
+ [rank0]: return inner()
413
+ [rank0]: ^^^^^^^
414
+ [rank0]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1805, in inner
415
+ [rank0]: result = forward_call(*args, **kwargs)
416
+ [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
417
+ [rank0]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
418
+ [rank0]: return func(*args, **kwargs)
419
+ [rank0]: ^^^^^^^^^^^^^^^^^^^^^
420
+ [rank0]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/mamba2/modeling_mamba2.py", line 526, in forward
421
+ [rank0]: outputs = self.backbone(
422
+ [rank0]: ^^^^^^^^^^^^^^
423
+ [rank0]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
424
+ [rank0]: return self._call_impl(*args, **kwargs)
425
+ [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
426
+ [rank0]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
427
+ [rank0]: return forward_call(*args, **kwargs)
428
+ [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
429
+ [rank0]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/mamba2/modeling_mamba2.py", line 405, in forward
430
+ [rank0]: hidden_states = mixer_block(
431
+ [rank0]: ^^^^^^^^^^^^
432
+ [rank0]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
433
+ [rank0]: return self._call_impl(*args, **kwargs)
434
+ [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
435
+ [rank0]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1857, in _call_impl
436
+ [rank0]: return inner()
437
+ [rank0]: ^^^^^^^
438
+ [rank0]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1805, in inner
439
+ [rank0]: result = forward_call(*args, **kwargs)
440
+ [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
441
+ [rank0]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 655, in _fn
442
+ [rank0]: return fn(*args, **kwargs)
443
+ [rank0]: ^^^^^^^^^^^^^^^^^^^
444
+ [rank0]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
445
+ [rank0]: return self._call_impl(*args, **kwargs)
446
+ [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
447
+ [rank0]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
448
+ [rank0]: return forward_call(*args, **kwargs)
449
+ [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
450
+ [rank0]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/mamba2/modeling_mamba2.py", line 161, in forward
451
+ [rank0]: hidden_states = self.norm(hidden_states)
452
+ [rank0]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/mamba2/modeling_mamba2.py", line 165, in torch_dynamo_resume_in_forward_at_161
453
+ [rank0]: hidden_states = self.mixer(
454
+ [rank0]: ^^^^^^^^^^^
455
+ [rank0]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
456
+ [rank0]: return self._call_impl(*args, **kwargs)
457
+ [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
458
+ [rank0]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
459
+ [rank0]: return forward_call(*args, **kwargs)
460
+ [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
461
+ [rank0]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/layers/mamba2.py", line 601, in forward
462
+ [rank0]: return self.torch_forward(hidden_states, cache_params, cache_position, attention_mask)
463
+ [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
464
+ [rank0]: File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/layers/mamba2.py", line 528, in torch_forward
465
+ [rank0]: G_intermediate = C[:, :, :, None, :, :] * B[:, :, None, :, :, :]
466
+ [rank0]: ~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~
467
+ [rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 GiB. GPU 0 has a total capacity of 95.00 GiB of which 63.94 GiB is free. Process 2003896 has 316.00 MiB memory in use. Process 696027 has 316.00 MiB memory in use. Process 1850004 has 21.15 GiB memory in use. Process 2711975 has 9.27 GiB memory in use. Of the allocated memory 7.99 GiB is allocated by PyTorch, and 73.66 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
mamba2_6_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_77qh1j5t/attempt_0/0/stdout.log ADDED
File without changes