Trainer._get_train_sampler replaced with custom implementation. Trainer._get_train_sampler replaced with custom implementation. Trainer._get_train_sampler replaced with custom implementation. Trainer._get_train_sampler replaced with custom implementation. Trainer._get_train_sampler replaced with custom implementation. Trainer._get_train_sampler replaced with custom implementation. [2025-11-08 21:13:46,422] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-11-08 21:13:46,422] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-11-08 21:13:46,422] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-11-08 21:13:46,427] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-11-08 21:13:46,544] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) Trainer._get_train_sampler replaced with custom implementation. [2025-11-08 21:13:47,862] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-11-08 21:13:49,205] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) df: /home/tiger/.triton/autotune: No such file or directory df: /home/tiger/.triton/autotune: No such file or directory Trainer._get_train_sampler replaced with custom implementation. [2025-11-08 21:13:54,219] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2025-11-08 21:14:11,571] [INFO] [comm.py:658:init_distributed] cdb=None [2025-11-08 21:14:11,571] [INFO] [comm.py:658:init_distributed] cdb=None [2025-11-08 21:14:11,571] [INFO] [comm.py:658:init_distributed] cdb=None [2025-11-08 21:14:11,571] [INFO] [comm.py:658:init_distributed] cdb=None [2025-11-08 21:14:11,571] [INFO] [comm.py:658:init_distributed] cdb=None [2025-11-08 21:14:11,571] [INFO] [comm.py:658:init_distributed] cdb=None [2025-11-08 21:14:11,572] [INFO] [comm.py:689:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl [2025-11-08 21:14:11,577] [INFO] [comm.py:658:init_distributed] cdb=None Warning: FlashAttention 3 is not available, falling back to PyTorch's scaled_dot_product_attentionWarning: FlashAttention 3 is not available, falling back to PyTorch's scaled_dot_product_attention Warning: FlashAttention 3 is not available, falling back to PyTorch's scaled_dot_product_attention Warning: FlashAttention 3 is not available, falling back to PyTorch's scaled_dot_product_attention You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. [2025-11-08 21:14:15,221] [INFO] [comm.py:658:init_distributed] cdb=None Warning: FlashAttention 3 is not available, falling back to PyTorch's scaled_dot_product_attention You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. Warning: FlashAttention 3 is not available, falling back to PyTorch's scaled_dot_product_attention Warning: FlashAttention 3 is not available, falling back to PyTorch's scaled_dot_product_attention Warning: FlashAttention 3 is not available, falling back to PyTorch's scaled_dot_product_attention You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. Loading checkpoint shards: 0%| | 0/2 [00:00