| 1: 2023-04-27 15:54:46.865633: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA |
| 1: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. |
| 1: 2023-04-27 15:54:46.865658: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA |
| 1: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. |
| 1: 2023-04-27 15:54:46.865639: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA |
| 1: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. |
| 1: 2023-04-27 15:54:46.865670: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA |
| 1: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. |
| 1: 2023-04-27 15:54:46.865698: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA |
| 1: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. |
| 1: 2023-04-27 15:54:46.865712: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA |
| 1: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. |
| 1: 2023-04-27 15:54:46.865722: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA |
| 1: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. |
| 1: 2023-04-27 15:54:46.865732: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA |
| 1: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. |
| 0: 2023-04-27 15:54:46.866334: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA |
| 0: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. |
| 0: 2023-04-27 15:54:46.866381: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA |
| 0: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. |
| 0: 2023-04-27 15:54:46.866405: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA |
| 0: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. |
| 0: 2023-04-27 15:54:46.866422: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA |
| 0: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. |
| 0: 2023-04-27 15:54:46.866423: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA |
| 0: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. |
| 0: 2023-04-27 15:54:46.866376: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA |
| 0: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. |
| 0: 2023-04-27 15:54:46.866429: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA |
| 0: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. |
| 0: 2023-04-27 15:54:46.866443: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA |
| 0: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. |
| 0: 2023-04-27 15:54:54.275706: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
| 0: 2023-04-27 15:54:54.275726: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
| 0: 2023-04-27 15:54:54.275736: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
| 0: 2023-04-27 15:54:54.275754: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
| 0: 2023-04-27 15:54:54.275770: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
| 0: 2023-04-27 15:54:54.275778: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
| 0: 2023-04-27 15:54:54.275774: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
| 0: 2023-04-27 15:54:54.275780: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
| 0: 2023-04-27 15:54:54.280268: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. |
| 0: 2023-04-27 15:54:54.280299: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. |
| 0: 2023-04-27 15:54:54.280321: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. |
| 0: 2023-04-27 15:54:54.280348: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. |
| 0: 2023-04-27 15:54:54.280358: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. |
| 0: 2023-04-27 15:54:54.280366: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. |
| 0: 2023-04-27 15:54:54.280378: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. |
| 0: 2023-04-27 15:54:54.280589: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. |
| 1: 2023-04-27 15:54:54.295518: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
| 1: 2023-04-27 15:54:54.295543: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
| 1: 2023-04-27 15:54:54.295595: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
| 1: 2023-04-27 15:54:54.295601: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
| 1: 2023-04-27 15:54:54.295558: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
| 1: 2023-04-27 15:54:54.295598: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
| 1: 2023-04-27 15:54:54.295613: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
| 1: 2023-04-27 15:54:54.295641: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
| 1: 2023-04-27 15:54:54.296314: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. |
| 1: 2023-04-27 15:54:54.296327: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. |
| 1: 2023-04-27 15:54:54.296339: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. |
| 1: 2023-04-27 15:54:54.296344: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. |
| 1: 2023-04-27 15:54:54.296367: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. |
| 1: 2023-04-27 15:54:54.296370: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. |
| 1: 2023-04-27 15:54:54.296376: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. |
| 1: 2023-04-27 15:54:54.296384: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. |
| 0: 2023-04-27 15:55:19.995941: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
| 0: 2023-04-27 15:55:19.995971: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
| 0: 2023-04-27 15:55:19.995987: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
| 0: 2023-04-27 15:55:19.996017: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
| 0: 2023-04-27 15:55:19.996024: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
| 0: 2023-04-27 15:55:19.996036: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
| 0: 2023-04-27 15:55:19.996037: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
| 0: 2023-04-27 15:55:19.996243: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
| 0: 2023-04-27 15:55:19.997343: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
| 0: 2023-04-27 15:55:19.997350: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
| 0: 2023-04-27 15:55:19.997349: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
| 0: 2023-04-27 15:55:19.997356: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
| 0: 2023-04-27 15:55:19.997367: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
| 0: 2023-04-27 15:55:19.997395: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. |
| 0: 2023-04-27 15:55:19.997394: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. |
| 0: 2023-04-27 15:55:19.997395: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. |
| 0: 2023-04-27 15:55:19.997396: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. |
| 0: 2023-04-27 15:55:19.997384: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
| 0: 2023-04-27 15:55:19.997405: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. |
| 0: 2023-04-27 15:55:19.997429: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. |
| 0: 2023-04-27 15:55:19.997453: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
| 0: 2023-04-27 15:55:19.997475: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. |
| 0: 2023-04-27 15:55:19.997484: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
| 0: 2023-04-27 15:55:19.997506: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. |
| 1: 2023-04-27 15:55:20.004632: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
| 1: 2023-04-27 15:55:20.004664: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
| 1: 2023-04-27 15:55:20.004687: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
| 1: 2023-04-27 15:55:20.004697: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
| 1: 2023-04-27 15:55:20.004716: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
| 1: 2023-04-27 15:55:20.004722: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
| 1: 2023-04-27 15:55:20.004736: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
| 1: 2023-04-27 15:55:20.004964: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
| 1: 2023-04-27 15:55:20.006347: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
| 1: 2023-04-27 15:55:20.006347: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
| 1: 2023-04-27 15:55:20.006347: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
| 1: 2023-04-27 15:55:20.006350: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
| 1: 2023-04-27 15:55:20.006351: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
| 1: 2023-04-27 15:55:20.006349: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
| 1: 2023-04-27 15:55:20.006354: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
| 1: 2023-04-27 15:55:20.006352: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
| 1: 2023-04-27 15:55:20.006367: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. |
| 1: 2023-04-27 15:55:20.006368: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. |
| 1: 2023-04-27 15:55:20.006370: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. |
| 1: 2023-04-27 15:55:20.006369: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. |
| 1: 2023-04-27 15:55:20.006369: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. |
| 1: 2023-04-27 15:55:20.006371: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. |
| 1: 2023-04-27 15:55:20.006372: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. |
| 1: 2023-04-27 15:55:20.006372: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. |
| 0: Loading extension module scaled_upper_triang_masked_softmax_cuda... |
| 0: [92mSuccessfully preprocessed all matching files.[0m |
| 0: Detected CUDA files, patching ldflags |
| 0: Emitting ninja build file /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/build/build.ninja... |
| 0: Building extension module scaled_masked_softmax_cuda... |
| 0: Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) |
| 0: Loading extension module scaled_masked_softmax_cuda... |
| 0: [92mSuccessfully preprocessed all matching files.[0m |
| 0: Detected CUDA files, patching ldflags |
| 0: Emitting ninja build file /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/build/build.ninja... |
| 0: Building extension module fused_mix_prec_layer_norm_cuda... |
| 0: Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) |
| 0: Loading extension module fused_mix_prec_layer_norm_cuda... |
| 0: [92mSuccessfully preprocessed all matching files.[0m |
| 0: [92mSuccessfully preprocessed all matching files.[0m |
| 1: [92mSuccessfully preprocessed all matching files.[0m |
| 1: [92mSuccessfully preprocessed all matching files.[0m |
| 1: [92mSuccessfully preprocessed all matching files.[0m |
| 1: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead |
| 1: warnings.warn( |
| 1: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead |
| 1: warnings.warn( |
| 1: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead |
| 1: warnings.warn( |
| 1: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead |
| 1: warnings.warn( |
| 1: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead |
| 1: warnings.warn( |
| 1: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead |
| 1: warnings.warn( |
| 1: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead |
| 1: warnings.warn( |
| 1: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead |
| 1: warnings.warn( |
| 0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead |
| 0: warnings.warn( |
| 0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead |
| 0: warnings.warn( |
| 0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead |
| 0: warnings.warn( |
| 0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead |
| 0: warnings.warn( |
| 0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead |
| 0: warnings.warn( |
| 0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead |
| 0: warnings.warn( |
| 0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead |
| 0: warnings.warn( |
| 0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead |
| 0: warnings.warn( |
| 0: Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root... |
| 0: Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root...Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root... |
| 0: |
| 0: Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root... |
| 0: Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root... |
| 0: Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root... |
| 0: Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root... |
| 1: Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root...Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root...Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root... |
| 1: |
| 1: |
| 1: Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root...Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root... |
| 1: |
| 1: Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root...Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root... |
| 1: |
| 1: Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root... |
| 0: Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root... |
| 0: Emitting ninja build file /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu/utils/build.ninja... |
| 0: Building extension module utils... |
| 0: Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) |
| 0: Loading extension module utils... |
| 0: Loading extension module utils... |
| 0: Loading extension module utils... |
| 0: Loading extension module utils... |
| 0: Loading extension module utils... |
| 0: Loading extension module utils... |
| 0: Loading extension module utils... |
| 0: Loading extension module utils... |
| 1: Loading extension module utils... |
| 1: Loading extension module utils... |
| 1: Loading extension module utils... |
| 1: Loading extension module utils... |
| 1: Loading extension module utils... |
| 1: Loading extension module utils... |
| 1: Loading extension module utils... |
| 1: Loading extension module utils... |
| 0: Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root... |
| 0: Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root... |
| 0: Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root... |
| 0: Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root... |
| 0: Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root... |
| 0: Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root... |
| 0: Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root... |
| 0: No modifications detected for re-loaded extension module utils, skipping build step... |
| 0: Loading extension module utils... |
| 0: No modifications detected for re-loaded extension module utils, skipping build step... |
| 0: Loading extension module utils... |
| 0: No modifications detected for re-loaded extension module utils, skipping build step... |
| 0: Loading extension module utils... |
| 0: No modifications detected for re-loaded extension module utils, skipping build step... |
| 0: Loading extension module utils... |
| 0: No modifications detected for re-loaded extension module utils, skipping build step... |
| 0: Loading extension module utils... |
| 0: No modifications detected for re-loaded extension module utils, skipping build step... |
| 0: Loading extension module utils... |
| 0: No modifications detected for re-loaded extension module utils, skipping build step... |
| 0: Loading extension module utils... |
| 1: Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root... |
| 1: Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root... |
| 1: Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root... |
| 1: Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root... |
| 1: Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root... |
| 1: Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root... |
| 1: Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root... |
| 1: Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root... |
| 1: No modifications detected for re-loaded extension module utils, skipping build step... |
| 1: Loading extension module utils... |
| 1: No modifications detected for re-loaded extension module utils, skipping build step... |
| 1: Loading extension module utils... |
| 1: No modifications detected for re-loaded extension module utils, skipping build step... |
| 1: Loading extension module utils... |
| 1: No modifications detected for re-loaded extension module utils, skipping build step... |
| 1: Loading extension module utils...No modifications detected for re-loaded extension module utils, skipping build step... |
| 1: |
| 1: Loading extension module utils... |
| 1: No modifications detected for re-loaded extension module utils, skipping build step... |
| 1: Loading extension module utils... |
| 1: No modifications detected for re-loaded extension module utils, skipping build step... |
| 1: Loading extension module utils... |
| 1: No modifications detected for re-loaded extension module utils, skipping build step... |
| 1: Loading extension module utils... |
| 0: Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root... |
| 0: No modifications detected for re-loaded extension module utils, skipping build step... |
| 0: Loading extension module utils... |
| 1: Traceback (most recent call last): |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in <module> |
| 1: Traceback (most recent call last): |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in <module> |
| 1: Traceback (most recent call last): |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in <module> |
| 1: Traceback (most recent call last): |
| 1: main() |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in <module> |
| 1: Traceback (most recent call last): |
| 1: Traceback (most recent call last): |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in <module> |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in <module> |
| 1: Traceback (most recent call last): |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in <module> |
| 1: main() |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
| 1: return f(*args, **kwargs) |
| 1: main() |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
| 1: main() |
| 1: main()Traceback (most recent call last): |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
| 1: |
| 1: main() |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in <module> |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
| 1: main() |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
| 1: return f(*args, **kwargs) |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
| 1: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
| 1: return f(*args, **kwargs) |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
| 1: return f(*args, **kwargs) |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
| 1: return f(*args, **kwargs) |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
| 1: return f(*args, **kwargs) |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
| 1: main()pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
| 1: |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
| 1: return f(*args, **kwargs) |
| 1: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
| 1: |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
| 1: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
| 1: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
| 1: |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
| 1: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
| 1: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
| 1: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
| 1: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
| 1: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
| 1: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
| 1: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
| 1: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
| 1: return f(*args, **kwargs) |
| 1: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
| 1: |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
| 1: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
| 1: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
| 1: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
| 1: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
| 1: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
| 1: |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
| 1: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
| 1: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
| 1: args.iteration = load_checkpoint(model, optimizer, lr_scheduler)model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
| 1: |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
| 1: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
| 1: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
| 1: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
| 1: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
| 1: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
| 1: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
| 1: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
| 1: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
| 1: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
| 1: success = self._load_zero_checkpoint( |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
| 1: success = self._load_zero_checkpoint( |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
| 1: success = self._load_zero_checkpoint( |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
| 1: success = self._load_zero_checkpoint( |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
| 1: success = self._load_zero_checkpoint( |
| 1: self.optimizer.load_state_dict( |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
| 1: success = self._load_zero_checkpoint( |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
| 1: success = self._load_zero_checkpoint( |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
| 1: self.optimizer.load_state_dict( |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
| 1: success = self._load_zero_checkpoint( |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
| 1: self._load_legacy_checkpoint(state_dict_list, |
| 1: self._load_legacy_checkpoint(state_dict_list, File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 406, in _load_legacy_checkpoint |
| 1: |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 406, in _load_legacy_checkpoint |
| 1: current_rank_sd = state_dict_list[dp_rank]self.optimizer.load_state_dict( |
| 1: |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
| 1: IndexError: list index out of range |
| 1: current_rank_sd = state_dict_list[dp_rank] |
| 1: IndexError: list index out of range |
| 1: self.optimizer.load_state_dict( |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
| 1: self._load_legacy_checkpoint(state_dict_list, |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 406, in _load_legacy_checkpoint |
| 1: self._load_legacy_checkpoint(state_dict_list, |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 406, in _load_legacy_checkpoint |
| 1: current_rank_sd = state_dict_list[dp_rank] |
| 1: IndexError: list index out of range |
| 1: current_rank_sd = state_dict_list[dp_rank] |
| 1: IndexError: list index out of range |
| 1: self.optimizer.load_state_dict( |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
| 1: self.optimizer.load_state_dict( |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
| 1: self.optimizer.load_state_dict( |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
| 1: self._load_legacy_checkpoint(state_dict_list, |
| 1: self.optimizer.load_state_dict( |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 406, in _load_legacy_checkpoint |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
| 1: self._load_legacy_checkpoint(state_dict_list, |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 406, in _load_legacy_checkpoint |
| 1: self._load_legacy_checkpoint(state_dict_list, |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 406, in _load_legacy_checkpoint |
| 1: self._load_legacy_checkpoint(state_dict_list, |
| 1: current_rank_sd = state_dict_list[dp_rank] File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 406, in _load_legacy_checkpoint |
| 1: |
| 1: IndexError: list index out of range |
| 1: current_rank_sd = state_dict_list[dp_rank] |
| 1: IndexError: list index out of range |
| 1: current_rank_sd = state_dict_list[dp_rank] |
| 1: IndexError: list index out of range |
| 1: current_rank_sd = state_dict_list[dp_rank] |
| 1: IndexError: list index out of range |
| 0: Traceback (most recent call last): |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in <module> |
| 0: Traceback (most recent call last): |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in <module> |
| 0: Traceback (most recent call last): |
| 0: Traceback (most recent call last): |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in <module> |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in <module> |
| 0: Traceback (most recent call last): |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in <module> |
| 0: Traceback (most recent call last): |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in <module> |
| 0: main() |
| 0: Traceback (most recent call last): |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in <module> |
| 0: main() |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
| 0: main() |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
| 0: main() |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
| 0: main() |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
| 0: main() |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
| 0: return f(*args, **kwargs) |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
| 0: main() |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
| 0: return f(*args, **kwargs) |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
| 0: return f(*args, **kwargs) |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
| 0: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
| 0: return f(*args, **kwargs) |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
| 0: return f(*args, **kwargs)return f(*args, **kwargs) |
| 0: |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
| 0: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
| 0: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
| 0: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
| 0: return f(*args, **kwargs) |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
| 0: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
| 0: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
| 0: pretrain(train_valid_test_datasets_provider, model_provider, forward_step,model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
| 0: |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
| 0: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
| 0: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
| 0: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
| 0: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
| 0: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
| 0: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
| 0: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
| 0: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
| 0: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
| 0: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
| 0: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
| 0: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
| 0: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
| 0: Traceback (most recent call last): |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in <module> |
| 0: main() |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
| 0: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
| 0: return f(*args, **kwargs) |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
| 0: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
| 0: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
| 0: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
| 0: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
| 0: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
| 0: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
| 0: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
| 0: |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
| 0: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
| 0: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
| 0: success = self._load_zero_checkpoint( |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
| 0: success = self._load_zero_checkpoint( |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
| 0: success = self._load_zero_checkpoint( |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
| 0: success = self._load_zero_checkpoint( |
| 0: success = self._load_zero_checkpoint( |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
| 0: success = self._load_zero_checkpoint( |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
| 0: success = self._load_zero_checkpoint( |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
| 0: success = self._load_zero_checkpoint( |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
| 0: self.optimizer.load_state_dict( |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
| 0: self.optimizer.load_state_dict( |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
| 0: self.optimizer.load_state_dict( |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
| 0: self.optimizer.load_state_dict( |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
| 0: self.optimizer.load_state_dict( |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
| 0: self.optimizer.load_state_dict( |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
| 0: self.optimizer.load_state_dict( |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
| 0: self.optimizer.load_state_dict( |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
| 0: self._load_legacy_checkpoint(state_dict_list, |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 420, in _load_legacy_checkpoint |
| 0: self._load_legacy_checkpoint(state_dict_list, |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 420, in _load_legacy_checkpoint |
| 0: self._load_legacy_checkpoint(state_dict_list, |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 420, in _load_legacy_checkpoint |
| 0: self._load_legacy_checkpoint(state_dict_list, |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 420, in _load_legacy_checkpoint |
| 0: self._load_legacy_checkpoint(state_dict_list, |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 420, in _load_legacy_checkpoint |
| 0: self._load_legacy_checkpoint(state_dict_list, |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 420, in _load_legacy_checkpoint |
| 0: self._load_legacy_checkpoint(state_dict_list, |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 420, in _load_legacy_checkpoint |
| 0: current.data.copy_(src_tensor.data) |
| 0: RuntimeError: The size of tensor a (795648) must match the size of tensor b (1591296) at non-singleton dimension 0 |
| 0: current.data.copy_(src_tensor.data) |
| 0: RuntimeError : current.data.copy_(src_tensor.data)The size of tensor a (795648) must match the size of tensor b (1591296) at non-singleton dimension 0 |
| 0: |
| 0: current.data.copy_(src_tensor.data) |
| 0: RuntimeError: RuntimeErrorThe size of tensor a (795648) must match the size of tensor b (1591296) at non-singleton dimension 0: |
| 0: The size of tensor a (795648) must match the size of tensor b (1591296) at non-singleton dimension 0 |
| 0: current.data.copy_(src_tensor.data) |
| 0: RuntimeError: The size of tensor a (795648) must match the size of tensor b (1591296) at non-singleton dimension 0 |
| 0: current.data.copy_(src_tensor.data) |
| 0: RuntimeErrorcurrent.data.copy_(src_tensor.data): |
| 0: The size of tensor a (795648) must match the size of tensor b (1591296) at non-singleton dimension 0 |
| 0: RuntimeError: The size of tensor a (795648) must match the size of tensor b (1591296) at non-singleton dimension 0 |
| 0: self._load_legacy_checkpoint(state_dict_list, |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 420, in _load_legacy_checkpoint |
| 0: current.data.copy_(src_tensor.data) |
| 0: RuntimeError: The size of tensor a (795648) must match the size of tensor b (1591296) at non-singleton dimension 0 |
| 0: ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 42284) of binary: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/bin/python |
| 1: ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 71550) of binary: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/bin/python |
| 1: ERROR:torch.distributed.elastic.multiprocessing.errors.error_handler:no error file defined for parent, to copy child error file (/tmp/torchelastic_8ia3se9x/none_chz6nne8/attempt_0/0/error.json) |
| 1: Traceback (most recent call last): |
| 1: File "/opt/cray/pe/python/3.9.12.1/lib/python3.9/runpy.py", line 197, in _run_module_as_main |
| 1: return _run_code(code, main_globals, None, |
| 1: File "/opt/cray/pe/python/3.9.12.1/lib/python3.9/runpy.py", line 87, in _run_code |
| 1: exec(code, run_globals) |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/run.py", line 766, in <module> |
| 1: main() |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
| 1: return f(*args, **kwargs) |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/run.py", line 762, in main |
| 1: run(args) |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run |
| 0: ERROR:torch.distributed.elastic.multiprocessing.errors.error_handler:no error file defined for parent, to copy child error file (/tmp/torchelastic_w2odsqyk/none_2mmj7eag/attempt_0/0/error.json) |
| 1: elastic_launch( |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ |
| 0: Traceback (most recent call last): |
| 0: File "/opt/cray/pe/python/3.9.12.1/lib/python3.9/runpy.py", line 197, in _run_module_as_main |
| 0: return _run_code(code, main_globals, None, |
| 0: File "/opt/cray/pe/python/3.9.12.1/lib/python3.9/runpy.py", line 87, in _run_code |
| 0: exec(code, run_globals) |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/run.py", line 766, in <module> |
| 1: return launch_agent(self._config, self._entrypoint, list(args)) |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent |
| 1: raise ChildFailedError( |
| 1: torch.distributed.elastic.multiprocessing.errors.ChildFailedError: |
| 1: ============================================================ |
| 1: Megatron-DeepSpeed/pretrain_gpt.py FAILED |
| 1: ------------------------------------------------------------ |
| 1: Failures: |
| 1: [1]: |
| 1: time : 2023-04-27_15:57:30 |
| 1: host : nid007281 |
| 1: rank : 9 (local_rank: 1) |
| 1: exitcode : 1 (pid: 71551) |
| 1: error_file: /tmp/torchelastic_8ia3se9x/none_chz6nne8/attempt_0/1/error.json |
| 1: traceback : Traceback (most recent call last): |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
| 1: return f(*args, **kwargs) |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
| 1: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
| 1: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
| 1: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
| 1: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
| 1: success = self._load_zero_checkpoint( |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
| 1: self.optimizer.load_state_dict( |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
| 1: self._load_legacy_checkpoint(state_dict_list, |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 406, in _load_legacy_checkpoint |
| 1: current_rank_sd = state_dict_list[dp_rank] |
| 1: IndexError: list index out of range |
| 1: |
| 1: [2]: |
| 1: time : 2023-04-27_15:57:30 |
| 1: host : nid007281 |
| 1: rank : 10 (local_rank: 2) |
| 1: exitcode : 1 (pid: 71552) |
| 1: error_file: /tmp/torchelastic_8ia3se9x/none_chz6nne8/attempt_0/2/error.json |
| 1: traceback : Traceback (most recent call last): |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
| 1: return f(*args, **kwargs) |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
| 1: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
| 1: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
| 1: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
| 0: main() |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
| 1: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
| 1: success = self._load_zero_checkpoint( |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
| 1: self.optimizer.load_state_dict( |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
| 1: self._load_legacy_checkpoint(state_dict_list, |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 406, in _load_legacy_checkpoint |
| 1: current_rank_sd = state_dict_list[dp_rank] |
| 1: IndexError: list index out of range |
| 1: |
| 1: [3]: |
| 1: time : 2023-04-27_15:57:30 |
| 1: host : nid007281 |
| 1: rank : 11 (local_rank: 3) |
| 1: exitcode : 1 (pid: 71553) |
| 1: error_file: /tmp/torchelastic_8ia3se9x/none_chz6nne8/attempt_0/3/error.json |
| 1: traceback : Traceback (most recent call last): |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
| 1: return f(*args, **kwargs) |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
| 1: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
| 1: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
| 1: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
| 1: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
| 1: success = self._load_zero_checkpoint( |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
| 1: self.optimizer.load_state_dict( |
| 0: return f(*args, **kwargs) |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/run.py", line 762, in main |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
| 1: self._load_legacy_checkpoint(state_dict_list, |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 406, in _load_legacy_checkpoint |
| 1: current_rank_sd = state_dict_list[dp_rank] |
| 1: IndexError: list index out of range |
| 1: |
| 1: [4]: |
| 1: time : 2023-04-27_15:57:30 |
| 1: host : nid007281 |
| 1: rank : 12 (local_rank: 4) |
| 1: exitcode : 1 (pid: 71554) |
| 1: error_file: /tmp/torchelastic_8ia3se9x/none_chz6nne8/attempt_0/4/error.json |
| 1: traceback : Traceback (most recent call last): |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
| 1: return f(*args, **kwargs) |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
| 1: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
| 1: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
| 1: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
| 1: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
| 1: success = self._load_zero_checkpoint( |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
| 1: self.optimizer.load_state_dict( |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
| 1: self._load_legacy_checkpoint(state_dict_list, |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 406, in _load_legacy_checkpoint |
| 1: current_rank_sd = state_dict_list[dp_rank] |
| 1: IndexError: list index out of range |
| 1: |
| 1: [5]: |
| 1: time : 2023-04-27_15:57:30 |
| 1: host : nid007281 |
| 1: rank : 13 (local_rank: 5) |
| 1: exitcode : 1 (pid: 71555) |
| 1: error_file: /tmp/torchelastic_8ia3se9x/none_chz6nne8/attempt_0/5/error.json |
| 1: traceback : Traceback (most recent call last): |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
| 1: return f(*args, **kwargs) |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
| 1: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
| 1: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
| 1: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
| 1: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
| 1: success = self._load_zero_checkpoint( |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
| 1: self.optimizer.load_state_dict( |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
| 1: self._load_legacy_checkpoint(state_dict_list, |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 406, in _load_legacy_checkpoint |
| 1: current_rank_sd = state_dict_list[dp_rank] |
| 1: IndexError: list index out of range |
| 1: |
| 1: [6]: |
| 1: time : 2023-04-27_15:57:30 |
| 1: host : nid007281 |
| 1: rank : 14 (local_rank: 6) |
| 1: exitcode : 1 (pid: 71556) |
| 1: error_file: /tmp/torchelastic_8ia3se9x/none_chz6nne8/attempt_0/6/error.json |
| 1: traceback : Traceback (most recent call last): |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
| 1: return f(*args, **kwargs) |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
| 1: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
| 1: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
| 1: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
| 1: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
| 1: success = self._load_zero_checkpoint( |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
| 1: self.optimizer.load_state_dict( |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
| 1: self._load_legacy_checkpoint(state_dict_list, |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 406, in _load_legacy_checkpoint |
| 1: current_rank_sd = state_dict_list[dp_rank] |
| 1: IndexError: list index out of range |
| 1: |
| 1: [7]: |
| 1: time : 2023-04-27_15:57:30 |
| 1: host : nid007281 |
| 1: rank : 15 (local_rank: 7) |
| 1: exitcode : 1 (pid: 71557) |
| 1: error_file: /tmp/torchelastic_8ia3se9x/none_chz6nne8/attempt_0/7/error.json |
| 1: traceback : Traceback (most recent call last): |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
| 1: return f(*args, **kwargs) |
| 0: run(args) |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
| 1: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
| 1: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
| 1: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
| 1: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
| 1: success = self._load_zero_checkpoint( |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
| 1: self.optimizer.load_state_dict( |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
| 1: self._load_legacy_checkpoint(state_dict_list, |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 406, in _load_legacy_checkpoint |
| 1: current_rank_sd = state_dict_list[dp_rank] |
| 1: IndexError: list index out of range |
| 1: |
| 1: ------------------------------------------------------------ |
| 1: Root Cause (first observed failure): |
| 1: [0]: |
| 1: time : 2023-04-27_15:57:30 |
| 1: host : nid007281 |
| 1: rank : 8 (local_rank: 0) |
| 1: exitcode : 1 (pid: 71550) |
| 1: error_file: /tmp/torchelastic_8ia3se9x/none_chz6nne8/attempt_0/0/error.json |
| 1: traceback : Traceback (most recent call last): |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
| 1: return f(*args, **kwargs) |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
| 1: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
| 1: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
| 1: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
| 1: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
| 1: success = self._load_zero_checkpoint( |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
| 1: self.optimizer.load_state_dict( |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
| 1: self._load_legacy_checkpoint(state_dict_list, |
| 0: elastic_launch( |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ |
| 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 406, in _load_legacy_checkpoint |
| 1: current_rank_sd = state_dict_list[dp_rank] |
| 1: IndexError: list index out of range |
| 1: |
| 1: ============================================================ |
| 0: return launch_agent(self._config, self._entrypoint, list(args)) |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent |
| 0: raise ChildFailedError( |
| 0: torch.distributed.elastic.multiprocessing.errors.ChildFailedError: |
| 0: ============================================================ |
| 0: Megatron-DeepSpeed/pretrain_gpt.py FAILED |
| 0: ------------------------------------------------------------ |
| 0: Failures: |
| 0: [1]: |
| 0: time : 2023-04-27_15:57:31 |
| 0: host : nid007280 |
| 0: rank : 1 (local_rank: 1) |
| 0: exitcode : 1 (pid: 42285) |
| 0: error_file: /tmp/torchelastic_w2odsqyk/none_2mmj7eag/attempt_0/1/error.json |
| 0: traceback : Traceback (most recent call last): |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
| 0: return f(*args, **kwargs) |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
| 0: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
| 0: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
| 0: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
| 0: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
| 0: success = self._load_zero_checkpoint( |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
| 0: self.optimizer.load_state_dict( |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
| 0: self._load_legacy_checkpoint(state_dict_list, |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 420, in _load_legacy_checkpoint |
| 0: current.data.copy_(src_tensor.data) |
| 0: RuntimeError: The size of tensor a (795648) must match the size of tensor b (1591296) at non-singleton dimension 0 |
| 0: |
| 0: [2]: |
| 0: time : 2023-04-27_15:57:31 |
| 0: host : nid007280 |
| 0: rank : 2 (local_rank: 2) |
| 0: exitcode : 1 (pid: 42286) |
| 0: error_file: /tmp/torchelastic_w2odsqyk/none_2mmj7eag/attempt_0/2/error.json |
| 0: traceback : Traceback (most recent call last): |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
| 0: return f(*args, **kwargs) |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
| 0: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
| 0: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
| 0: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
| 0: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
| 0: success = self._load_zero_checkpoint( |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
| 0: self.optimizer.load_state_dict( |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
| 0: self._load_legacy_checkpoint(state_dict_list, |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 420, in _load_legacy_checkpoint |
| 0: current.data.copy_(src_tensor.data) |
| 0: RuntimeError: The size of tensor a (795648) must match the size of tensor b (1591296) at non-singleton dimension 0 |
| 0: |
| 0: [3]: |
| 0: time : 2023-04-27_15:57:31 |
| 0: host : nid007280 |
| 0: rank : 3 (local_rank: 3) |
| 0: exitcode : 1 (pid: 42287) |
| 0: error_file: /tmp/torchelastic_w2odsqyk/none_2mmj7eag/attempt_0/3/error.json |
| 0: traceback : Traceback (most recent call last): |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
| 0: return f(*args, **kwargs) |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
| 0: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
| 0: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
| 0: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
| 0: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
| 0: success = self._load_zero_checkpoint( |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
| 0: self.optimizer.load_state_dict( |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
| 0: self._load_legacy_checkpoint(state_dict_list, |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 420, in _load_legacy_checkpoint |
| 0: current.data.copy_(src_tensor.data) |
| 0: RuntimeError: The size of tensor a (795648) must match the size of tensor b (1591296) at non-singleton dimension 0 |
| 0: |
| 0: [4]: |
| 0: time : 2023-04-27_15:57:31 |
| 0: host : nid007280 |
| 0: rank : 4 (local_rank: 4) |
| 0: exitcode : 1 (pid: 42288) |
| 0: error_file: /tmp/torchelastic_w2odsqyk/none_2mmj7eag/attempt_0/4/error.json |
| 0: traceback : Traceback (most recent call last): |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
| 0: return f(*args, **kwargs) |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
| 0: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
| 0: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
| 0: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
| 0: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
| 0: success = self._load_zero_checkpoint( |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
| 0: self.optimizer.load_state_dict( |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
| 0: self._load_legacy_checkpoint(state_dict_list, |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 420, in _load_legacy_checkpoint |
| 0: current.data.copy_(src_tensor.data) |
| 0: RuntimeError: The size of tensor a (795648) must match the size of tensor b (1591296) at non-singleton dimension 0 |
| 0: |
| 0: [5]: |
| 0: time : 2023-04-27_15:57:31 |
| 0: host : nid007280 |
| 0: rank : 5 (local_rank: 5) |
| 0: exitcode : 1 (pid: 42289) |
| 0: error_file: /tmp/torchelastic_w2odsqyk/none_2mmj7eag/attempt_0/5/error.json |
| 0: traceback : Traceback (most recent call last): |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
| 0: return f(*args, **kwargs) |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
| 0: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
| 0: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
| 0: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
| 0: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
| 0: success = self._load_zero_checkpoint( |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
| 0: self.optimizer.load_state_dict( |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
| 0: self._load_legacy_checkpoint(state_dict_list, |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 420, in _load_legacy_checkpoint |
| 0: current.data.copy_(src_tensor.data) |
| 0: RuntimeError: The size of tensor a (795648) must match the size of tensor b (1591296) at non-singleton dimension 0 |
| 0: |
| 0: [6]: |
| 0: time : 2023-04-27_15:57:31 |
| 0: host : nid007280 |
| 0: rank : 6 (local_rank: 6) |
| 0: exitcode : 1 (pid: 42290) |
| 0: error_file: /tmp/torchelastic_w2odsqyk/none_2mmj7eag/attempt_0/6/error.json |
| 0: traceback : Traceback (most recent call last): |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
| 0: return f(*args, **kwargs) |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
| 0: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
| 0: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
| 0: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
| 0: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
| 0: success = self._load_zero_checkpoint( |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
| 0: self.optimizer.load_state_dict( |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
| 0: self._load_legacy_checkpoint(state_dict_list, |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 420, in _load_legacy_checkpoint |
| 0: current.data.copy_(src_tensor.data) |
| 0: RuntimeError: The size of tensor a (795648) must match the size of tensor b (1591296) at non-singleton dimension 0 |
| 0: |
| 0: [7]: |
| 0: time : 2023-04-27_15:57:31 |
| 0: host : nid007280 |
| 0: rank : 7 (local_rank: 7) |
| 0: exitcode : 1 (pid: 42291) |
| 0: error_file: /tmp/torchelastic_w2odsqyk/none_2mmj7eag/attempt_0/7/error.json |
| 0: traceback : Traceback (most recent call last): |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
| 0: return f(*args, **kwargs) |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
| 0: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
| 0: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
| 0: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
| 0: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
| 0: success = self._load_zero_checkpoint( |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
| 0: self.optimizer.load_state_dict( |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
| 0: self._load_legacy_checkpoint(state_dict_list, |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 420, in _load_legacy_checkpoint |
| 0: current.data.copy_(src_tensor.data) |
| 0: RuntimeError: The size of tensor a (795648) must match the size of tensor b (1591296) at non-singleton dimension 0 |
| 0: |
| 0: ------------------------------------------------------------ |
| 0: Root Cause (first observed failure): |
| 0: [0]: |
| 0: time : 2023-04-27_15:57:31 |
| 0: host : nid007280 |
| 0: rank : 0 (local_rank: 0) |
| 0: exitcode : 1 (pid: 42284) |
| 0: error_file: /tmp/torchelastic_w2odsqyk/none_2mmj7eag/attempt_0/0/error.json |
| 0: traceback : Traceback (most recent call last): |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
| 0: return f(*args, **kwargs) |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
| 0: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
| 0: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
| 0: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
| 0: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
| 0: success = self._load_zero_checkpoint( |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
| 0: self.optimizer.load_state_dict( |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
| 0: self._load_legacy_checkpoint(state_dict_list, |
| 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 420, in _load_legacy_checkpoint |
| 0: current.data.copy_(src_tensor.data) |
| 0: RuntimeError: The size of tensor a (795648) must match the size of tensor b (1591296) at non-singleton dimension 0 |
| 0: |
| 0: ============================================================ |
| srun: error: nid007281: task 1: Exited with exit code 1 |
| srun: launch/slurm: _step_signal: Terminating StepId=3423781.0 |
| srun: error: nid007280: task 0: Exited with exit code 1 |
| |