| nohup: ignoring input |
| W1030 19:02:35.769000 8870 site-packages/torch/distributed/run.py:793] |
| W1030 19:02:35.769000 8870 site-packages/torch/distributed/run.py:793] ***************************************** |
| W1030 19:02:35.769000 8870 site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. |
| W1030 19:02:35.769000 8870 site-packages/torch/distributed/run.py:793] ***************************************** |
| /root/miniconda3/envs/meditron/bin/python: can't open file '/mnt/dg_tunning/lora4_meditron_7b/tune_combined.py': [Errno 2] No such file or directory |
| /root/miniconda3/envs/meditron/bin/python: can't open file '/mnt/dg_tunning/lora4_meditron_7b/tune_combined.py': [Errno 2] No such file or directory |
| /root/miniconda3/envs/meditron/bin/python: can't open file '/mnt/dg_tunning/lora4_meditron_7b/tune_combined.py': [Errno 2] No such file or directory |
| /root/miniconda3/envs/meditron/bin/python: can't open file '/mnt/dg_tunning/lora4_meditron_7b/tune_combined.py': [Errno 2] No such file or directory |
| /root/miniconda3/envs/meditron/bin/python: can't open file '/mnt/dg_tunning/lora4_meditron_7b/tune_combined.py': [Errno 2] No such file or directory |
| /root/miniconda3/envs/meditron/bin/python: can't open file '/mnt/dg_tunning/lora4_meditron_7b/tune_combined.py': [Errno 2] No such file or directory |
| /root/miniconda3/envs/meditron/bin/python: can't open file '/mnt/dg_tunning/lora4_meditron_7b/tune_combined.py': [Errno 2] No such file or directory |
| E1030 19:02:35.887000 8870 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 2) local_rank: 0 (pid: 8935) of binary: /root/miniconda3/envs/meditron/bin/python |
| Traceback (most recent call last): |
| File "/root/miniconda3/envs/meditron/bin/torchrun", line 8, in <module> |
| sys.exit(main()) |
| File "/root/miniconda3/envs/meditron/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper |
| return f(*args, **kwargs) |
| File "/root/miniconda3/envs/meditron/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main |
| run(args) |
| File "/root/miniconda3/envs/meditron/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run |
| elastic_launch( |
| File "/root/miniconda3/envs/meditron/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__ |
| return launch_agent(self._config, self._entrypoint, list(args)) |
| File "/root/miniconda3/envs/meditron/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent |
| raise ChildFailedError( |
| torch.distributed.elastic.multiprocessing.errors.ChildFailedError: |
| ============================================================ |
| tune_combined.py FAILED |
| ------------------------------------------------------------ |
| Failures: |
| [1]: |
| time : 2024-10-30_19:02:35 |
| host : 3f2e085cdcba |
| rank : 1 (local_rank: 1) |
| exitcode : 2 (pid: 8936) |
| error_file: <N/A> |
| traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
| [2]: |
| time : 2024-10-30_19:02:35 |
| host : 3f2e085cdcba |
| rank : 2 (local_rank: 2) |
| exitcode : 2 (pid: 8937) |
| error_file: <N/A> |
| traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
| [3]: |
| time : 2024-10-30_19:02:35 |
| host : 3f2e085cdcba |
| rank : 3 (local_rank: 3) |
| exitcode : 2 (pid: 8938) |
| error_file: <N/A> |
| traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
| [4]: |
| time : 2024-10-30_19:02:35 |
| host : 3f2e085cdcba |
| rank : 4 (local_rank: 4) |
| exitcode : 2 (pid: 8939) |
| error_file: <N/A> |
| traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
| [5]: |
| time : 2024-10-30_19:02:35 |
| host : 3f2e085cdcba |
| rank : 5 (local_rank: 5) |
| exitcode : 2 (pid: 8940) |
| error_file: <N/A> |
| traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
| [6]: |
| time : 2024-10-30_19:02:35 |
| host : 3f2e085cdcba |
| rank : 6 (local_rank: 6) |
| exitcode : 2 (pid: 8941) |
| error_file: <N/A> |
| traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
| ------------------------------------------------------------ |
| Root Cause (first observed failure): |
| [0]: |
| time : 2024-10-30_19:02:35 |
| host : 3f2e085cdcba |
| rank : 0 (local_rank: 0) |
| exitcode : 2 (pid: 8935) |
| error_file: <N/A> |
| traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
| ============================================================ |
| |