[rank0]: Traceback (most recent call last): [rank0]: File "/workspace/rl4phyx/RL4Phyx/SFT/train_sft.py", line 312, in [rank0]: main() [rank0]: File "/workspace/rl4phyx/RL4Phyx/SFT/train_sft.py", line 285, in main [rank0]: trainer.train() [rank0]: File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 2241, in train [rank0]: return inner_training_loop( [rank0]: ^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 2548, in _inner_training_loop [rank0]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 3740, in training_step [rank0]: self.accelerator.backward(loss, **kwargs) [rank0]: File "/opt/conda/lib/python3.11/site-packages/accelerate/accelerator.py", line 2852, in backward [rank0]: loss.backward(**kwargs) [rank0]: File "/opt/conda/lib/python3.11/site-packages/torch/_tensor.py", line 521, in backward [rank0]: torch.autograd.backward( [rank0]: File "/opt/conda/lib/python3.11/site-packages/torch/autograd/__init__.py", line 289, in backward [rank0]: _engine_run_backward( [rank0]: File "/opt/conda/lib/python3.11/site-packages/torch/autograd/graph.py", line 768, in _engine_run_backward [rank0]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/opt/conda/lib/python3.11/site-packages/torch/autograd/function.py", line 306, in apply [rank0]: return user_fn(self, *args) [rank0]: ^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/opt/conda/lib/python3.11/site-packages/torch/utils/checkpoint.py", line 313, in backward [rank0]: torch.autograd.backward(outputs_with_grad, args_with_grad) [rank0]: File "/opt/conda/lib/python3.11/site-packages/torch/autograd/__init__.py", line 289, in backward [rank0]: _engine_run_backward( [rank0]: File "/opt/conda/lib/python3.11/site-packages/torch/autograd/graph.py", line 768, in _engine_run_backward [rank0]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations. [rank0]: Parameter at index 695 with name base_model.model.model.layers.35.mlp.down_proj.lora_B.default.weight has been marked as ready twice. This means that multiple autograd engine hooks have fired for this particular parameter during this iteration.