FC-JSON-SFT-1_5B / debug.log
amphora's picture
Training in progress, step 69
de6682d verified
Loading dataset from disk: 0%| | 0/208 [00:00<?, ?it/s] Loading dataset from disk: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 208/208 [00:00<00:00, 237133.80it/s]
Loading weights: 0%| | 0/146 [00:00<?, ?it/s] Loading weights: 1%| | 1/146 [00:00<00:00, 7423.55it/s, Materializing param=model.embed_tokens.weight] Loading weights: 1%| | 1/146 [00:00<00:00, 3728.27it/s, Materializing param=model.embed_tokens.weight] Loading weights: 1%| | 2/146 [00:00<00:00, 3985.09it/s, Materializing param=model.layers.0.input_layernorm.weight] Loading weights: 1%| | 2/146 [00:00<00:00, 1457.11it/s, Materializing param=model.layers.0.input_layernorm.weight] Loading weights: 2%| | 3/146 [00:00<00:00, 1893.59it/s, Materializing param=model.layers.0.mlp.down_proj.weight] Loading weights: 2%| | 3/146 [00:00<00:00, 1740.13it/s, Materializing param=model.layers.0.mlp.down_proj.weight] Loading weights: 3%|▏ | 4/146 [00:00<00:00, 2094.01it/s, Materializing param=model.layers.0.mlp.gate_proj.weight] Loading weights: 3%|▏ | 4/146 [00:00<00:00, 1979.85it/s, Materializing param=model.layers.0.mlp.gate_proj.weight] Loading weights: 3%|▏ | 5/146 [00:00<00:00, 2270.87it/s, Materializing param=model.layers.0.mlp.up_proj.weight] Loading weights: 3%|▏ | 5/146 [00:00<00:00, 2033.11it/s, Materializing param=model.layers.0.mlp.up_proj.weight] Loading weights: 4%| | 6/146 [00:00<00:00, 2282.41it/s, Materializing param=model.layers.0.post_attention_layernorm. Loading weights: 4%| | 6/146 [00:00<00:00, 2201.54it/s, Materializing param=model.layers.0.post_attention_layernorm. Loading weights: 5%| | 7/146 [00:00<00:00, 2415.68it/s, Materializing param=model.layers.0.self_attn.k_proj.weight] Loading weights: 5%| | 7/146 [00:00<00:00, 2336.66it/s, Materializing param=model.layers.0.self_attn.k_proj.weight] Loading weights: 5%| | 8/146 [00:00<00:00, 2534.90it/s, Materializing param=model.layers.0.self_attn.o_proj.weight] Loading weights: 5%| | 8/146 [00:00<00:00, 2457.48it/s, Materializing param=model.layers.0.self_attn.o_proj.weight] Loading weights: 6%| | 9/146 [00:00<00:00, 2362.99it/s, Materializing param=model.layers.0.self_attn.q_proj.weight] Loading weights: 6%| | 9/146 [00:00<00:00, 2304.98it/s, Materializing param=model.layers.0.self_attn.q_proj.weight] Loading weights: 7%| | 10/146 [00:00<00:00, 2443.09it/s, Materializing param=model.layers.0.self_attn.v_proj.weight] Loading weights: 7%| | 10/146 [00:00<00:00, 2386.79it/s, Materializing param=model.layers.0.self_attn.v_proj.weight] Loading weights: 8%|▏ | 11/146 [00:00<00:00, 2534.46it/s, Materializing param=model.layers.1.input_layernorm.weight] Loading weights: 8%|▏ | 11/146 [00:00<00:00, 2477.17it/s, Materializing param=model.layers.1.input_layernorm.weight] Loading weights: 8%|β–Ž | 12/146 [00:00<00:00, 2605.16it/s, Materializing param=model.layers.1.mlp.down_proj.weight] Loading weights: 8%|β–Ž | 12/146 [00:00<00:00, 2554.13it/s, Materializing param=model.layers.1.mlp.down_proj.weight] Loading weights: 9%|β–Ž | 13/146 [00:00<00:00, 2584.17it/s, Materializing param=model.layers.1.mlp.gate_proj.weight] Loading weights: 9%|β–Ž | 13/146 [00:00<00:00, 2533.38it/s, Materializing param=model.layers.1.mlp.gate_proj.weight] Loading weights: 10%|β–Œ | 14/146 [00:00<00:00, 2450.35it/s, Materializing param=model.layers.1.mlp.up_proj.weight] Loading weights: 10%|β–Œ | 14/146 [00:00<00:00, 2408.44it/s, Materializing param=model.layers.1.mlp.up_proj.weight] Loading weights: 10%| | 15/146 [00:00<00:00, 2509.16it/s, Materializing param=model.layers.1.post_attention_layernorm Loading weights: 10%| | 15/146 [00:00<00:00, 2468.59it/s, Materializing param=model.layers.1.post_attention_layernorm Loading weights: 11%| | 16/146 [00:00<00:00, 2538.44it/s, Materializing param=model.layers.1.self_attn.k_proj.weight] Loading weights: 11%| | 16/146 [00:00<00:00, 2320.90it/s, Materializing param=model.layers.1.self_attn.k_proj.weight] Loading weights: 12%| | 17/146 [00:00<00:00, 2356.12it/s, Materializing param=model.layers.1.self_attn.o_proj.weight] Loading weights: 12%| | 17/146 [00:00<00:00, 2325.61it/s, Materializing param=model.layers.1.self_attn.o_proj.weight] Loading weights: 12%| | 18/146 [00:00<00:00, 2360.11it/s, Materializing param=model.layers.1.self_attn.q_proj.weight] Loading weights: 12%| | 18/146 [00:00<00:00, 2295.24it/s, Materializing param=model.layers.1.self_attn.q_proj.weight] Loading weights: 13%|▏| 19/146 [00:00<00:00, 2315.41it/s, Materializing param=model.layers.1.self_attn.v_proj.weight] Loading weights: 13%|▏| 19/146 [00:00<00:00, 2248.07it/s, Materializing param=model.layers.1.self_attn.v_proj.weight] Loading weights: 14%|β–Ž | 20/146 [00:00<00:00, 2114.44it/s, Materializing param=model.layers.2.input_layernorm.weight] Loading weights: 14%|β–Ž | 20/146 [00:00<00:00, 2091.61it/s, Materializing param=model.layers.2.input_layernorm.weight] Loading weights: 14%|β–Œ | 21/146 [00:00<00:00, 2156.14it/s, Materializing param=model.layers.2.mlp.down_proj.weight] Loading weights: 14%|β–Œ | 21/146 [00:00<00:00, 2133.63it/s, Materializing param=model.layers.2.mlp.down_proj.weight] Loading weights: 15%|β–Œ | 22/146 [00:00<00:00, 2176.34it/s, Materializing param=model.layers.2.mlp.gate_proj.weight] Loading weights: 15%|β–Œ | 22/146 [00:00<00:00, 2155.95it/s, Materializing param=model.layers.2.mlp.gate_proj.weight] Loading weights: 16%|β–‰ | 23/146 [00:00<00:00, 2219.87it/s, Materializing param=model.layers.2.mlp.up_proj.weight] Loading weights: 16%|β–‰ | 23/146 [00:00<00:00, 2199.93it/s, Materializing param=model.layers.2.mlp.up_proj.weight] Loading weights: 16%|▏| 24/146 [00:00<00:00, 2259.10it/s, Materializing param=model.layers.2.post_attention_layernorm Loading weights: 16%|▏| 24/146 [00:00<00:00, 2236.47it/s, Materializing param=model.layers.2.post_attention_layernorm Loading weights: 17%|▏| 25/146 [00:00<00:00, 2246.50it/s, Materializing param=model.layers.2.self_attn.k_proj.weight] Loading weights: 17%|▏| 25/146 [00:00<00:00, 2226.28it/s, Materializing param=model.layers.2.self_attn.k_proj.weight] Loading weights: 18%|▏| 26/146 [00:00<00:00, 2284.38it/s, Materializing param=model.layers.2.self_attn.o_proj.weight] Loading weights: 18%|▏| 26/146 [00:00<00:00, 2265.31it/s, Materializing param=model.layers.2.self_attn.o_proj.weight] Loading weights: 18%|▏| 27/146 [00:00<00:00, 2308.27it/s, Materializing param=model.layers.2.self_attn.q_proj.weight] Loading weights: 18%|▏| 27/146 [00:00<00:00, 2288.87it/s, Materializing param=model.layers.2.self_attn.q_proj.weight] Loading weights: 19%|▏| 28/146 [00:00<00:00, 2342.67it/s, Materializing param=model.layers.2.self_attn.v_proj.weight] Loading weights: 19%|▏| 28/146 [00:00<00:00, 2288.80it/s, Materializing param=model.layers.2.self_attn.v_proj.weight] Loading weights: 20%|▍ | 29/146 [00:00<00:00, 2285.34it/s, Materializing param=model.layers.3.input_layernorm.weight] Loading weights: 20%|▍ | 29/146 [00:00<00:00, 2254.58it/s, Materializing param=model.layers.3.input_layernorm.weight] Loading weights: 21%|β–Š | 30/146 [00:00<00:00, 2288.47it/s, Materializing param=model.layers.3.mlp.down_proj.weight] Loading weights: 21%|β–Š | 30/146 [00:00<00:00, 2270.79it/s, Materializing param=model.layers.3.mlp.down_proj.weight] Loading weights: 21%|β–Š | 31/146 [00:00<00:00, 2270.24it/s, Materializing param=model.layers.3.mlp.gate_proj.weight] Loading weights: 21%|β–Š | 31/146 [00:00<00:00, 2253.63it/s, Materializing param=model.layers.3.mlp.gate_proj.weight] Loading weights: 22%|β–ˆβ–Ž | 32/146 [00:00<00:00, 2279.82it/s, Materializing param=model.layers.3.mlp.up_proj.weight] Loading weights: 22%|β–ˆβ–Ž | 32/146 [00:00<00:00, 2263.22it/s, Materializing param=model.layers.3.mlp.up_proj.weight] Loading weights: 23%|▏| 33/146 [00:00<00:00, 2305.98it/s, Materializing param=model.layers.3.post_attention_layernorm Loading weights: 23%|▏| 33/146 [00:00<00:00, 2270.32it/s, Materializing param=model.layers.3.post_attention_layernorm Loading weights: 23%|▏| 34/146 [00:00<00:00, 2311.77it/s, Materializing param=model.layers.3.self_attn.k_proj.weight] Loading weights: 23%|▏| 34/146 [00:00<00:00, 2296.95it/s, Materializing param=model.layers.3.self_attn.k_proj.weight] Loading weights: 24%|▏| 35/146 [00:00<00:00, 2327.44it/s, Materializing param=model.layers.3.self_attn.o_proj.weight] Loading weights: 24%|▏| 35/146 [00:00<00:00, 2312.44it/s, Materializing param=model.layers.3.self_attn.o_proj.weight] Loading weights: 25%|▏| 36/146 [00:00<00:00, 2352.46it/s, Materializing param=model.layers.3.self_attn.q_proj.weight] Loading weights: 25%|▏| 36/146 [00:00<00:00, 2330.78it/s, Materializing param=model.layers.3.self_attn.q_proj.weight] Loading weights: 25%|β–Ž| 37/146 [00:00<00:00, 2344.89it/s, Materializing param=model.layers.3.self_attn.v_proj.weight] Loading weights: 25%|β–Ž| 37/146 [00:00<00:00, 2330.38it/s, Materializing param=model.layers.3.self_attn.v_proj.weight] Loading weights: 26%|β–Œ | 38/146 [00:00<00:00, 2361.13it/s, Materializing param=model.layers.4.input_layernorm.weight] Loading weights: 26%|β–Œ | 38/146 [00:00<00:00, 2347.33it/s, Materializing param=model.layers.4.input_layernorm.weight] Loading weights: 27%|β–ˆ | 39/146 [00:00<00:00, 2386.88it/s, Materializing param=model.layers.4.mlp.down_proj.weight] Loading weights: 27%|β–ˆ | 39/146 [00:00<00:00, 2373.31it/s, Materializing param=model.layers.4.mlp.down_proj.weight] Loading weights: 27%|β–ˆ | 40/146 [00:00<00:00, 2412.04it/s, Materializing param=model.layers.4.mlp.gate_proj.weight] Loading weights: 27%|β–ˆ | 40/146 [00:00<00:00, 2398.66it/s, Materializing param=model.layers.4.mlp.gate_proj.weight] Loading weights: 28%|β–ˆβ–‹ | 41/146 [00:00<00:00, 2435.20it/s, Materializing param=model.layers.4.mlp.up_proj.weight] Loading weights: 28%|β–ˆβ–‹ | 41/146 [00:00<00:00, 2421.24it/s, Materializing param=model.layers.4.mlp.up_proj.weight] Loading weights: 29%|β–Ž| 42/146 [00:00<00:00, 2458.18it/s, Materializing param=model.layers.4.post_attention_layernorm Loading weights: 29%|β–Ž| 42/146 [00:00<00:00, 2443.52it/s, Materializing param=model.layers.4.post_attention_layernorm Loading weights: 29%|β–Ž| 43/146 [00:00<00:00, 2478.09it/s, Materializing param=model.layers.4.self_attn.k_proj.weight] Loading weights: 29%|β–Ž| 43/146 [00:00<00:00, 2464.94it/s, Materializing param=model.layers.4.self_attn.k_proj.weight] Loading weights: 30%|β–Ž| 44/146 [00:00<00:00, 2500.77it/s, Materializing param=model.layers.4.self_attn.o_proj.weight] Loading weights: 30%|β–Ž| 44/146 [00:00<00:00, 2487.56it/s, Materializing param=model.layers.4.self_attn.o_proj.weight] Loading weights: 31%|β–Ž| 45/146 [00:00<00:00, 2521.26it/s, Materializing param=model.layers.4.self_attn.q_proj.weight] Loading weights: 31%|β–Ž| 45/146 [00:00<00:00, 2507.92it/s, Materializing param=model.layers.4.self_attn.q_proj.weight] Loading weights: 32%|β–Ž| 46/146 [00:00<00:00, 2541.83it/s, Materializing param=model.layers.4.self_attn.v_proj.weight] Loading weights: 32%|β–Ž| 46/146 [00:00<00:00, 2528.28it/s, Materializing param=model.layers.4.self_attn.v_proj.weight] Loading weights: 32%|β–‹ | 47/146 [00:00<00:00, 2560.09it/s, Materializing param=model.layers.5.input_layernorm.weight] Loading weights: 32%|β–‹ | 47/146 [00:00<00:00, 2546.57it/s, Materializing param=model.layers.5.input_layernorm.weight] Loading weights: 33%|β–ˆβ–Ž | 48/146 [00:00<00:00, 2579.69it/s, Materializing param=model.layers.5.mlp.down_proj.weight] Loading weights: 33%|β–ˆβ–Ž | 48/146 [00:00<00:00, 2566.53it/s, Materializing param=model.layers.5.mlp.down_proj.weight] Loading weights: 34%|β–ˆβ–Ž | 49/146 [00:00<00:00, 2597.45it/s, Materializing param=model.layers.5.mlp.gate_proj.weight] Loading weights: 34%|β–ˆβ–Ž | 49/146 [00:00<00:00, 2584.10it/s, Materializing param=model.layers.5.mlp.gate_proj.weight] Loading weights: 34%|β–ˆβ–ˆ | 50/146 [00:00<00:00, 2615.49it/s, Materializing param=model.layers.5.mlp.up_proj.weight] Loading weights: 34%|β–ˆβ–ˆ | 50/146 [00:00<00:00, 2582.45it/s, Materializing param=model.layers.5.mlp.up_proj.weight] Loading weights: 35%|β–Ž| 51/146 [00:00<00:00, 2610.40it/s, Materializing param=model.layers.5.post_attention_layernorm Loading weights: 35%|β–Ž| 51/146 [00:00<00:00, 2597.03it/s, Materializing param=model.layers.5.post_attention_layernorm Loading weights: 36%|β–Ž| 52/146 [00:00<00:00, 2627.41it/s, Materializing param=model.layers.5.self_attn.k_proj.weight] Loading weights: 36%|β–Ž| 52/146 [00:00<00:00, 2614.56it/s, Materializing param=model.layers.5.self_attn.k_proj.weight] Loading weights: 36%|β–Ž| 53/146 [00:00<00:00, 2642.07it/s, Materializing param=model.layers.5.self_attn.o_proj.weight] Loading weights: 36%|β–Ž| 53/146 [00:00<00:00, 2628.76it/s, Materializing param=model.layers.5.self_attn.o_proj.weight] Loading weights: 37%|β–Ž| 54/146 [00:00<00:00, 2650.37it/s, Materializing param=model.layers.5.self_attn.q_proj.weight] Loading weights: 37%|β–Ž| 54/146 [00:00<00:00, 2638.05it/s, Materializing param=model.layers.5.self_attn.q_proj.weight] Loading weights: 38%|▍| 55/146 [00:00<00:00, 2665.33it/s, Materializing param=model.layers.5.self_attn.v_proj.weight] Loading weights: 38%|▍| 55/146 [00:00<00:00, 2653.16it/s, Materializing param=model.layers.5.self_attn.v_proj.weight] Loading weights: 38%|β–Š | 56/146 [00:00<00:00, 2681.87it/s, Materializing param=model.layers.6.input_layernorm.weight] Loading weights: 38%|β–Š | 56/146 [00:00<00:00, 2669.83it/s, Materializing param=model.layers.6.input_layernorm.weight] Loading weights: 39%|β–ˆβ–Œ | 57/146 [00:00<00:00, 2696.94it/s, Materializing param=model.layers.6.mlp.down_proj.weight] Loading weights: 39%|β–ˆβ–Œ | 57/146 [00:00<00:00, 2684.70it/s, Materializing param=model.layers.6.mlp.down_proj.weight] Loading weights: 40%|β–ˆβ–Œ | 58/146 [00:00<00:00, 2710.25it/s, Materializing param=model.layers.6.mlp.gate_proj.weight] Loading weights: 40%|β–ˆβ–Œ | 58/146 [00:00<00:00, 2698.08it/s, Materializing param=model.layers.6.mlp.gate_proj.weight] Loading weights: 40%|β–ˆβ–ˆβ– | 59/146 [00:00<00:00, 2724.23it/s, Materializing param=model.layers.6.mlp.up_proj.weight] Loading weights: 40%|β–ˆβ–ˆβ– | 59/146 [00:00<00:00, 2711.93it/s, Materializing param=model.layers.6.mlp.up_proj.weight] Loading weights: 41%|▍| 60/146 [00:00<00:00, 2738.54it/s, Materializing param=model.layers.6.post_attention_layernorm Loading weights: 41%|▍| 60/146 [00:00<00:00, 2726.79it/s, Materializing param=model.layers.6.post_attention_layernorm Loading weights: 42%|▍| 61/146 [00:00<00:00, 2751.55it/s, Materializing param=model.layers.6.self_attn.k_proj.weight] Loading weights: 42%|▍| 61/146 [00:00<00:00, 2739.85it/s, Materializing param=model.layers.6.self_attn.k_proj.weight] Loading weights: 42%|▍| 62/146 [00:00<00:00, 2765.96it/s, Materializing param=model.layers.6.self_attn.o_proj.weight] Loading weights: 42%|▍| 62/146 [00:00<00:00, 2754.27it/s, Materializing param=model.layers.6.self_attn.o_proj.weight] Loading weights: 43%|▍| 63/146 [00:00<00:00, 2778.74it/s, Materializing param=model.layers.6.self_attn.q_proj.weight] Loading weights: 43%|▍| 63/146 [00:00<00:00, 2766.55it/s, Materializing param=model.layers.6.self_attn.q_proj.weight] Loading weights: 44%|▍| 64/146 [00:00<00:00, 2769.83it/s, Materializing param=model.layers.6.self_attn.v_proj.weight] Loading weights: 44%|▍| 64/146 [00:00<00:00, 2758.62it/s, Materializing param=model.layers.6.self_attn.v_proj.weight] Loading weights: 45%|β–‰ | 65/146 [00:00<00:00, 2781.77it/s, Materializing param=model.layers.7.input_layernorm.weight] Loading weights: 45%|β–‰ | 65/146 [00:00<00:00, 2770.85it/s, Materializing param=model.layers.7.input_layernorm.weight] Loading weights: 45%|β–ˆβ–Š | 66/146 [00:00<00:00, 2786.86it/s, Materializing param=model.layers.7.mlp.down_proj.weight] Loading weights: 45%|β–ˆβ–Š | 66/146 [00:00<00:00, 2775.29it/s, Materializing param=model.layers.7.mlp.down_proj.weight] Loading weights: 46%|β–ˆβ–Š | 67/146 [00:00<00:00, 2701.19it/s, Materializing param=model.layers.7.mlp.gate_proj.weight] Loading weights: 46%|β–ˆβ–Š | 67/146 [00:00<00:00, 2689.92it/s, Materializing param=model.layers.7.mlp.gate_proj.weight] Loading weights: 47%|β–ˆβ–ˆβ–Š | 68/146 [00:00<00:00, 2705.08it/s, Materializing param=model.layers.7.mlp.up_proj.weight] Loading weights: 47%|β–ˆβ–ˆβ–Š | 68/146 [00:00<00:00, 2694.80it/s, Materializing param=model.layers.7.mlp.up_proj.weight] Loading weights: 47%|▍| 69/146 [00:00<00:00, 2690.18it/s, Materializing param=model.layers.7.post_attention_layernorm Loading weights: 47%|▍| 69/146 [00:00<00:00, 2679.87it/s, Materializing param=model.layers.7.post_attention_layernorm Loading weights: 48%|▍| 70/146 [00:00<00:00, 2703.71it/s, Materializing param=model.layers.7.self_attn.k_proj.weight] Loading weights: 48%|▍| 70/146 [00:00<00:00, 2694.21it/s, Materializing param=model.layers.7.self_attn.k_proj.weight] Loading weights: 49%|▍| 71/146 [00:00<00:00, 2718.20it/s, Materializing param=model.layers.7.self_attn.o_proj.weight] Loading weights: 49%|▍| 71/146 [00:00<00:00, 2708.81it/s, Materializing param=model.layers.7.self_attn.o_proj.weight] Loading weights: 49%|▍| 72/146 [00:00<00:00, 2732.79it/s, Materializing param=model.layers.7.self_attn.q_proj.weight] Loading weights: 49%|▍| 72/146 [00:00<00:00, 2723.70it/s, Materializing param=model.layers.7.self_attn.q_proj.weight] Loading weights: 50%|β–Œ| 73/146 [00:00<00:00, 2747.28it/s, Materializing param=model.layers.7.self_attn.v_proj.weight] Loading weights: 50%|β–Œ| 73/146 [00:00<00:00, 2737.62it/s, Materializing param=model.layers.7.self_attn.v_proj.weight] Loading weights: 51%|β–ˆ | 74/146 [00:00<00:00, 2760.59it/s, Materializing param=model.layers.8.input_layernorm.weight] Loading weights: 51%|β–ˆ | 74/146 [00:00<00:00, 2751.29it/s, Materializing param=model.layers.8.input_layernorm.weight] Loading weights: 51%|β–ˆβ–ˆ | 75/146 [00:00<00:00, 2774.60it/s, Materializing param=model.layers.8.mlp.down_proj.weight] Loading weights: 51%|β–ˆβ–ˆ | 75/146 [00:00<00:00, 2765.09it/s, Materializing param=model.layers.8.mlp.down_proj.weight] Loading weights: 52%|β–ˆβ–ˆ | 76/146 [00:00<00:00, 2787.47it/s, Materializing param=model.layers.8.mlp.gate_proj.weight] Loading weights: 52%|β–ˆβ–ˆ | 76/146 [00:00<00:00, 2778.44it/s, Materializing param=model.layers.8.mlp.gate_proj.weight] Loading weights: 53%|β–ˆβ–ˆβ–ˆβ– | 77/146 [00:00<00:00, 2800.98it/s, Materializing param=model.layers.8.mlp.up_proj.weight] Loading weights: 53%|β–ˆβ–ˆβ–ˆβ– | 77/146 [00:00<00:00, 2791.80it/s, Materializing param=model.layers.8.mlp.up_proj.weight] Loading weights: 53%|β–Œ| 78/146 [00:00<00:00, 2814.12it/s, Materializing param=model.layers.8.post_attention_layernorm Loading weights: 53%|β–Œ| 78/146 [00:00<00:00, 2804.74it/s, Materializing param=model.layers.8.post_attention_layernorm Loading weights: 54%|β–Œ| 79/146 [00:00<00:00, 2826.57it/s, Materializing param=model.layers.8.self_attn.k_proj.weight] Loading weights: 54%|β–Œ| 79/146 [00:00<00:00, 2817.00it/s, Materializing param=model.layers.8.self_attn.k_proj.weight] Loading weights: 55%|β–Œ| 80/146 [00:00<00:00, 2838.35it/s, Materializing param=model.layers.8.self_attn.o_proj.weight] Loading weights: 55%|β–Œ| 80/146 [00:00<00:00, 2829.33it/s, Materializing param=model.layers.8.self_attn.o_proj.weight] Loading weights: 55%|β–Œ| 81/146 [00:00<00:00, 2850.63it/s, Materializing param=model.layers.8.self_attn.q_proj.weight] Loading weights: 55%|β–Œ| 81/146 [00:00<00:00, 2841.67it/s, Materializing param=model.layers.8.self_attn.q_proj.weight] Loading weights: 56%|β–Œ| 82/146 [00:00<00:00, 2862.84it/s, Materializing param=model.layers.8.self_attn.v_proj.weight] Loading weights: 56%|β–Œ| 82/146 [00:00<00:00, 2853.95it/s, Materializing param=model.layers.8.self_attn.v_proj.weight] Loading weights: 57%|β–ˆβ–| 83/146 [00:00<00:00, 2875.11it/s, Materializing param=model.layers.9.input_layernorm.weight] Loading weights: 57%|β–ˆβ–| 83/146 [00:00<00:00, 2865.93it/s, Materializing param=model.layers.9.input_layernorm.weight] Loading weights: 58%|β–ˆβ–ˆβ–Ž | 84/146 [00:00<00:00, 2886.18it/s, Materializing param=model.layers.9.mlp.down_proj.weight] Loading weights: 58%|β–ˆβ–ˆβ–Ž | 84/146 [00:00<00:00, 2877.06it/s, Materializing param=model.layers.9.mlp.down_proj.weight] Loading weights: 58%|β–ˆβ–ˆβ–Ž | 85/146 [00:00<00:00, 2897.30it/s, Materializing param=model.layers.9.mlp.gate_proj.weight] Loading weights: 58%|β–ˆβ–ˆβ–Ž | 85/146 [00:00<00:00, 2888.05it/s, Materializing param=model.layers.9.mlp.gate_proj.weight] Loading weights: 59%|β–ˆβ–ˆβ–ˆβ–Œ | 86/146 [00:00<00:00, 2908.37it/s, Materializing param=model.layers.9.mlp.up_proj.weight] Loading weights: 59%|β–ˆβ–ˆβ–ˆβ–Œ | 86/146 [00:00<00:00, 2899.34it/s, Materializing param=model.layers.9.mlp.up_proj.weight] Loading weights: 60%|β–Œ| 87/146 [00:00<00:00, 2919.24it/s, Materializing param=model.layers.9.post_attention_layernorm Loading weights: 60%|β–Œ| 87/146 [00:00<00:00, 2909.62it/s, Materializing param=model.layers.9.post_attention_layernorm Loading weights: 60%|β–Œ| 88/146 [00:00<00:00, 2929.08it/s, Materializing param=model.layers.9.self_attn.k_proj.weight] Loading weights: 60%|β–Œ| 88/146 [00:00<00:00, 2920.39it/s, Materializing param=model.layers.9.self_attn.k_proj.weight] Loading weights: 61%|β–Œ| 89/146 [00:00<00:00, 2939.92it/s, Materializing param=model.layers.9.self_attn.o_proj.weight] Loading weights: 61%|β–Œ| 89/146 [00:00<00:00, 2931.03it/s, Materializing param=model.layers.9.self_attn.o_proj.weight] Loading weights: 62%|β–Œ| 90/146 [00:00<00:00, 2950.36it/s, Materializing param=model.layers.9.self_attn.q_proj.weight] Loading weights: 62%|β–Œ| 90/146 [00:00<00:00, 2941.79it/s, Materializing param=model.layers.9.self_attn.q_proj.weight] Loading weights: 62%|β–Œ| 91/146 [00:00<00:00, 2961.27it/s, Materializing param=model.layers.9.self_attn.v_proj.weight] Loading weights: 62%|β–Œ| 91/146 [00:00<00:00, 2952.80it/s, Materializing param=model.layers.9.self_attn.v_proj.weight] Loading weights: 63%|β–‹| 92/146 [00:00<00:00, 2971.84it/s, Materializing param=model.layers.10.input_layernorm.weight] Loading weights: 63%|β–‹| 92/146 [00:00<00:00, 2963.38it/s, Materializing param=model.layers.10.input_layernorm.weight] Loading weights: 64%|β–ˆβ–‰ | 93/146 [00:00<00:00, 2982.94it/s, Materializing param=model.layers.10.mlp.down_proj.weight] Loading weights: 64%|β–ˆβ–‰ | 93/146 [00:00<00:00, 2974.50it/s, Materializing param=model.layers.10.mlp.down_proj.weight] Loading weights: 64%|β–ˆβ–‰ | 94/146 [00:00<00:00, 2993.00it/s, Materializing param=model.layers.10.mlp.gate_proj.weight] Loading weights: 64%|β–ˆβ–‰ | 94/146 [00:00<00:00, 2984.32it/s, Materializing param=model.layers.10.mlp.gate_proj.weight] Loading weights: 65%|β–ˆβ–ˆβ–ˆβ–Ž | 95/146 [00:00<00:00, 3002.48it/s, Materializing param=model.layers.10.mlp.up_proj.weight] Loading weights: 65%|β–ˆβ–ˆβ–ˆβ–Ž | 95/146 [00:00<00:00, 2993.77it/s, Materializing param=model.layers.10.mlp.up_proj.weight] Loading weights: 66%|β–‹| 96/146 [00:00<00:00, 3012.47it/s, Materializing param=model.layers.10.post_attention_layernor Loading weights: 66%|β–‹| 96/146 [00:00<00:00, 3003.26it/s, Materializing param=model.layers.10.post_attention_layernor Loading weights: 66%|β–‹| 97/146 [00:00<00:00, 3020.46it/s, Materializing param=model.layers.10.self_attn.k_proj.weight Loading weights: 66%|β–‹| 97/146 [00:00<00:00, 3012.03it/s, Materializing param=model.layers.10.self_attn.k_proj.weight Loading weights: 67%|β–‹| 98/146 [00:00<00:00, 3029.78it/s, Materializing param=model.layers.10.self_attn.o_proj.weight Loading weights: 67%|β–‹| 98/146 [00:00<00:00, 3021.06it/s, Materializing param=model.layers.10.self_attn.o_proj.weight Loading weights: 68%|β–‹| 99/146 [00:00<00:00, 3038.91it/s, Materializing param=model.layers.10.self_attn.q_proj.weight Loading weights: 68%|β–‹| 99/146 [00:00<00:00, 3030.63it/s, Materializing param=model.layers.10.self_attn.q_proj.weight Loading weights: 68%|β–‹| 100/146 [00:00<00:00, 3048.54it/s, Materializing param=model.layers.10.self_attn.v_proj.weigh Loading weights: 68%|β–‹| 100/146 [00:00<00:00, 3039.48it/s, Materializing param=model.layers.10.self_attn.v_proj.weigh Loading weights: 69%|β–‹| 101/146 [00:00<00:00, 3057.14it/s, Materializing param=model.layers.11.input_layernorm.weight Loading weights: 69%|β–‹| 101/146 [00:00<00:00, 3049.06it/s, Materializing param=model.layers.11.input_layernorm.weight Loading weights: 70%|β–ˆβ–| 102/146 [00:00<00:00, 3066.91it/s, Materializing param=model.layers.11.mlp.down_proj.weight] Loading weights: 70%|β–ˆβ–| 102/146 [00:00<00:00, 3058.60it/s, Materializing param=model.layers.11.mlp.down_proj.weight] Loading weights: 71%|β–ˆβ–| 103/146 [00:00<00:00, 3076.10it/s, Materializing param=model.layers.11.mlp.gate_proj.weight] Loading weights: 71%|β–ˆβ–| 103/146 [00:00<00:00, 3068.04it/s, Materializing param=model.layers.11.mlp.gate_proj.weight] Loading weights: 71%|β–ˆβ–ˆβ–Š | 104/146 [00:00<00:00, 3085.16it/s, Materializing param=model.layers.11.mlp.up_proj.weight] Loading weights: 71%|β–ˆβ–ˆβ–Š | 104/146 [00:00<00:00, 3076.72it/s, Materializing param=model.layers.11.mlp.up_proj.weight] Loading weights: 72%|β–‹| 105/146 [00:00<00:00, 3093.64it/s, Materializing param=model.layers.11.post_attention_layerno Loading weights: 72%|β–‹| 105/146 [00:00<00:00, 3085.04it/s, Materializing param=model.layers.11.post_attention_layerno Loading weights: 73%|β–‹| 106/146 [00:00<00:00, 3101.34it/s, Materializing param=model.layers.11.self_attn.k_proj.weigh Loading weights: 73%|β–‹| 106/146 [00:00<00:00, 3093.14it/s, Materializing param=model.layers.11.self_attn.k_proj.weigh Loading weights: 73%|β–‹| 107/146 [00:00<00:00, 3109.88it/s, Materializing param=model.layers.11.self_attn.o_proj.weigh Loading weights: 73%|β–‹| 107/146 [00:00<00:00, 3101.46it/s, Materializing param=model.layers.11.self_attn.o_proj.weigh Loading weights: 74%|β–‹| 108/146 [00:00<00:00, 3118.06it/s, Materializing param=model.layers.11.self_attn.q_proj.weigh Loading weights: 74%|β–‹| 108/146 [00:00<00:00, 3109.71it/s, Materializing param=model.layers.11.self_attn.q_proj.weigh Loading weights: 75%|β–‹| 109/146 [00:00<00:00, 3125.82it/s, Materializing param=model.layers.11.self_attn.v_proj.weigh Loading weights: 75%|β–‹| 109/146 [00:00<00:00, 3117.55it/s, Materializing param=model.layers.11.self_attn.v_proj.weigh Loading weights: 75%|β–Š| 110/146 [00:00<00:00, 3133.29it/s, Materializing param=model.layers.12.input_layernorm.weight Loading weights: 75%|β–Š| 110/146 [00:00<00:00, 3125.52it/s, Materializing param=model.layers.12.input_layernorm.weight Loading weights: 76%|β–ˆβ–Œ| 111/146 [00:00<00:00, 3142.06it/s, Materializing param=model.layers.12.mlp.down_proj.weight] Loading weights: 76%|β–ˆβ–Œ| 111/146 [00:00<00:00, 3133.79it/s, Materializing param=model.layers.12.mlp.down_proj.weight] Loading weights: 77%|β–ˆβ–Œ| 112/146 [00:00<00:00, 3149.47it/s, Materializing param=model.layers.12.mlp.gate_proj.weight] Loading weights: 77%|β–ˆβ–Œ| 112/146 [00:00<00:00, 3141.57it/s, Materializing param=model.layers.12.mlp.gate_proj.weight] Loading weights: 77%|β–ˆβ–ˆβ–ˆ | 113/146 [00:00<00:00, 3157.46it/s, Materializing param=model.layers.12.mlp.up_proj.weight] Loading weights: 77%|β–ˆβ–ˆβ–ˆ | 113/146 [00:00<00:00, 3149.40it/s, Materializing param=model.layers.12.mlp.up_proj.weight] Loading weights: 78%|β–Š| 114/146 [00:00<00:00, 3164.90it/s, Materializing param=model.layers.12.post_attention_layerno Loading weights: 78%|β–Š| 114/146 [00:00<00:00, 3156.57it/s, Materializing param=model.layers.12.post_attention_layerno Loading weights: 79%|β–Š| 115/146 [00:00<00:00, 3172.36it/s, Materializing param=model.layers.12.self_attn.k_proj.weigh Loading weights: 79%|β–Š| 115/146 [00:00<00:00, 3164.83it/s, Materializing param=model.layers.12.self_attn.k_proj.weigh Loading weights: 79%|β–Š| 116/146 [00:00<00:00, 3180.56it/s, Materializing param=model.layers.12.self_attn.o_proj.weigh Loading weights: 79%|β–Š| 116/146 [00:00<00:00, 3172.70it/s, Materializing param=model.layers.12.self_attn.o_proj.weigh Loading weights: 80%|β–Š| 117/146 [00:00<00:00, 3188.15it/s, Materializing param=model.layers.12.self_attn.q_proj.weigh Loading weights: 80%|β–Š| 117/146 [00:00<00:00, 3179.99it/s, Materializing param=model.layers.12.self_attn.q_proj.weigh Loading weights: 81%|β–Š| 118/146 [00:00<00:00, 3195.62it/s, Materializing param=model.layers.12.self_attn.v_proj.weigh Loading weights: 81%|β–Š| 118/146 [00:00<00:00, 3188.02it/s, Materializing param=model.layers.12.self_attn.v_proj.weigh Loading weights: 82%|β–Š| 119/146 [00:00<00:00, 3203.57it/s, Materializing param=model.layers.13.input_layernorm.weight Loading weights: 82%|β–Š| 119/146 [00:00<00:00, 3195.94it/s, Materializing param=model.layers.13.input_layernorm.weight Loading weights: 82%|β–ˆβ–‹| 120/146 [00:00<00:00, 3211.09it/s, Materializing param=model.layers.13.mlp.down_proj.weight] Loading weights: 82%|β–ˆβ–‹| 120/146 [00:00<00:00, 3203.47it/s, Materializing param=model.layers.13.mlp.down_proj.weight] Loading weights: 83%|β–ˆβ–‹| 121/146 [00:00<00:00, 3218.57it/s, Materializing param=model.layers.13.mlp.gate_proj.weight] Loading weights: 83%|β–ˆβ–‹| 121/146 [00:00<00:00, 3210.73it/s, Materializing param=model.layers.13.mlp.gate_proj.weight] Loading weights: 84%|β–ˆβ–ˆβ–ˆβ–Ž| 122/146 [00:00<00:00, 3225.59it/s, Materializing param=model.layers.13.mlp.up_proj.weight] Loading weights: 84%|β–ˆβ–ˆβ–ˆβ–Ž| 122/146 [00:00<00:00, 3218.09it/s, Materializing param=model.layers.13.mlp.up_proj.weight] Loading weights: 84%|β–Š| 123/146 [00:00<00:00, 3232.78it/s, Materializing param=model.layers.13.post_attention_layerno Loading weights: 84%|β–Š| 123/146 [00:00<00:00, 3224.59it/s, Materializing param=model.layers.13.post_attention_layerno Loading weights: 85%|β–Š| 124/146 [00:00<00:00, 3239.05it/s, Materializing param=model.layers.13.self_attn.k_proj.weigh Loading weights: 85%|β–Š| 124/146 [00:00<00:00, 3231.12it/s, Materializing param=model.layers.13.self_attn.k_proj.weigh Loading weights: 86%|β–Š| 125/146 [00:00<00:00, 3244.34it/s, Materializing param=model.layers.13.self_attn.o_proj.weigh Loading weights: 86%|β–Š| 125/146 [00:00<00:00, 3236.71it/s, Materializing param=model.layers.13.self_attn.o_proj.weigh Loading weights: 86%|β–Š| 126/146 [00:00<00:00, 3250.90it/s, Materializing param=model.layers.13.self_attn.q_proj.weigh Loading weights: 86%|β–Š| 126/146 [00:00<00:00, 3242.90it/s, Materializing param=model.layers.13.self_attn.q_proj.weigh Loading weights: 87%|β–Š| 127/146 [00:00<00:00, 3256.13it/s, Materializing param=model.layers.13.self_attn.v_proj.weigh Loading weights: 87%|β–Š| 127/146 [00:00<00:00, 3248.23it/s, Materializing param=model.layers.13.self_attn.v_proj.weigh Loading weights: 88%|β–‰| 128/146 [00:00<00:00, 3262.11it/s, Materializing param=model.layers.14.input_layernorm.weight Loading weights: 88%|β–‰| 128/146 [00:00<00:00, 3254.26it/s, Materializing param=model.layers.14.input_layernorm.weight Loading weights: 88%|β–ˆβ–Š| 129/146 [00:00<00:00, 3268.17it/s, Materializing param=model.layers.14.mlp.down_proj.weight] Loading weights: 88%|β–ˆβ–Š| 129/146 [00:00<00:00, 3260.49it/s, Materializing param=model.layers.14.mlp.down_proj.weight] Loading weights: 89%|β–ˆβ–Š| 130/146 [00:00<00:00, 3274.08it/s, Materializing param=model.layers.14.mlp.gate_proj.weight] Loading weights: 89%|β–ˆβ–Š| 130/146 [00:00<00:00, 3266.51it/s, Materializing param=model.layers.14.mlp.gate_proj.weight] Loading weights: 90%|β–ˆβ–ˆβ–ˆβ–Œ| 131/146 [00:00<00:00, 3280.07it/s, Materializing param=model.layers.14.mlp.up_proj.weight] Loading weights: 90%|β–ˆβ–ˆβ–ˆβ–Œ| 131/146 [00:00<00:00, 3271.71it/s, Materializing param=model.layers.14.mlp.up_proj.weight] Loading weights: 90%|β–‰| 132/146 [00:00<00:00, 3284.61it/s, Materializing param=model.layers.14.post_attention_layerno Loading weights: 90%|β–‰| 132/146 [00:00<00:00, 3276.53it/s, Materializing param=model.layers.14.post_attention_layerno Loading weights: 91%|β–‰| 133/146 [00:00<00:00, 3289.65it/s, Materializing param=model.layers.14.self_attn.k_proj.weigh Loading weights: 91%|β–‰| 133/146 [00:00<00:00, 3282.10it/s, Materializing param=model.layers.14.self_attn.k_proj.weigh Loading weights: 92%|β–‰| 134/146 [00:00<00:00, 3294.99it/s, Materializing param=model.layers.14.self_attn.o_proj.weigh Loading weights: 92%|β–‰| 134/146 [00:00<00:00, 3287.73it/s, Materializing param=model.layers.14.self_attn.o_proj.weigh Loading weights: 92%|β–‰| 135/146 [00:00<00:00, 3300.98it/s, Materializing param=model.layers.14.self_attn.q_proj.weigh Loading weights: 92%|β–‰| 135/146 [00:00<00:00, 3293.59it/s, Materializing param=model.layers.14.self_attn.q_proj.weigh Loading weights: 93%|β–‰| 136/146 [00:00<00:00, 3306.60it/s, Materializing param=model.layers.14.self_attn.v_proj.weigh Loading weights: 93%|β–‰| 136/146 [00:00<00:00, 3299.51it/s, Materializing param=model.layers.14.self_attn.v_proj.weigh Loading weights: 94%|β–‰| 137/146 [00:00<00:00, 3312.77it/s, Materializing param=model.layers.15.input_layernorm.weight Loading weights: 94%|β–‰| 137/146 [00:00<00:00, 3305.32it/s, Materializing param=model.layers.15.input_layernorm.weight Loading weights: 95%|β–ˆβ–‰| 138/146 [00:00<00:00, 3318.39it/s, Materializing param=model.layers.15.mlp.down_proj.weight] Loading weights: 95%|β–ˆβ–‰| 138/146 [00:00<00:00, 3311.14it/s, Materializing param=model.layers.15.mlp.down_proj.weight] Loading weights: 95%|β–ˆβ–‰| 139/146 [00:00<00:00, 3324.35it/s, Materializing param=model.layers.15.mlp.gate_proj.weight] Loading weights: 95%|β–ˆβ–‰| 139/146 [00:00<00:00, 3316.99it/s, Materializing param=model.layers.15.mlp.gate_proj.weight] Loading weights: 96%|β–ˆβ–ˆβ–ˆβ–Š| 140/146 [00:00<00:00, 3329.81it/s, Materializing param=model.layers.15.mlp.up_proj.weight] Loading weights: 96%|β–ˆβ–ˆβ–ˆβ–Š| 140/146 [00:00<00:00, 3322.95it/s, Materializing param=model.layers.15.mlp.up_proj.weight] Loading weights: 97%|β–‰| 141/146 [00:00<00:00, 3335.99it/s, Materializing param=model.layers.15.post_attention_layerno Loading weights: 97%|β–‰| 141/146 [00:00<00:00, 3328.64it/s, Materializing param=model.layers.15.post_attention_layerno Loading weights: 97%|β–‰| 142/146 [00:00<00:00, 3341.25it/s, Materializing param=model.layers.15.self_attn.k_proj.weigh Loading weights: 97%|β–‰| 142/146 [00:00<00:00, 3334.14it/s, Materializing param=model.layers.15.self_attn.k_proj.weigh Loading weights: 98%|β–‰| 143/146 [00:00<00:00, 3346.87it/s, Materializing param=model.layers.15.self_attn.o_proj.weigh Loading weights: 98%|β–‰| 143/146 [00:00<00:00, 3339.73it/s, Materializing param=model.layers.15.self_attn.o_proj.weigh Loading weights: 99%|β–‰| 144/146 [00:00<00:00, 3352.24it/s, Materializing param=model.layers.15.self_attn.q_proj.weigh Loading weights: 99%|β–‰| 144/146 [00:00<00:00, 3345.20it/s, Materializing param=model.layers.15.self_attn.q_proj.weigh Loading weights: 99%|β–‰| 145/146 [00:00<00:00, 3357.85it/s, Materializing param=model.layers.15.self_attn.v_proj.weigh Loading weights: 99%|β–‰| 145/146 [00:00<00:00, 3350.73it/s, Materializing param=model.layers.15.self_attn.v_proj.weigh Loading weights: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 146/146 [00:00<00:00, 3363.07it/s, Materializing param=model.norm.weight] Loading weights: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 146/146 [00:00<00:00, 3355.99it/s, Materializing param=model.norm.weight] Loading weights: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 146/146 [00:00<00:00, 3341.97it/s, Materializing param=model.norm.weight]
[2026-02-10 15:44:22,256] [WARNING] [torchao.<module>:39] [PID:14756] Skipping import of cpp extensions due to incompatible torch version 2.9.1+cu128 for torchao version 0.13.0
[2026-02-10 15:44:41,826] [WARNING] [accelerate.utils.dataclasses.__post_init__:1962] [PID:14756] sharding_strategy is deprecated in favor of reshard_after_forward. This will be removed in a future version of Accelerate.
[2026-02-10 16:12:36,776] [WARNING] [py.warnings._showwarnmsg:110] [PID:14756] /root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:675: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
warnings.warn(