"torch.cuda.is_available() is False", "Can't load the configuration of 'path/to/pretrained_model/" and... in hyperparam_optimiz_for_disease_classifier.py
Thank you for your great work in geneformer!
However, when I redid the disease classification, I found a series of problems:
- "File "/home/.../anaconda3/envs/geneformer3/lib/python3.9/site-packages/torch/serialization.py", line 166, in validate_cuda_device
raise RuntimeError('Attempting to deserialize object on a CUDA '
RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False."
Another:
"File "/home/.../anaconda3/envs/geneformer3/lib/python3.9/site-packages/ray/worker.py", line 1833, in get
raise value
ray.exceptions.RaySystemError: System error: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU."
It is really strange because I do have a CUDA device.
"
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.13 Driver Version: 525.60.13 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:3B:00.0 Off | 0 |
| N/A 30C P0 55W / 300W | 903MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... On | 00000000:86:00.0 Off | 0 |
| N/A 27C P0 51W / 300W | 25MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 196221 C python 900MiB |
| 1 N/A N/A 196692 C ray::ImplicitFunc.train() 24MiB |
+-----------------------------------------------------------------------------+
"
"During handling of the above exception, another exception occurred:
...
File "/home/.../envs/geneformer3/lib/python3.9/site-packages/transformers/configuration_utils.py", line 693, in _get_config_dict
raise EnvironmentError(
OSError: Can't load the configuration of 'path/to/pretrained_model/'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'path/to/pretrained_model/' is the correct path to a directory containing a config.json file"
It's also strange because I can load the pretrained_model in the cell classification and gene classification task by the same "path/to"."File "/home/jdhan_pkuhpc/profiles/yeyaxuan/lustre2/anaconda3/envs/geneformer3/lib/python3.9/site-packages/huggingface_hub/utils/_validators.py", line 158, in validate_repo_id
raise HFValidationError(
huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': 'path/to/pretrained_model/'. Userepo_typeargument if needed."
I wonder whether it is caused by version conflict. I currently use the ray of version '1.13.0'.
I would appreciate it if you can offer me some help!
Thank you for your interest in Geneformer! Raytune can be sensitive to various factors, especially when running jobs distributed on multiple GPUs, that are highly environment-specific. If you are able to load the model usually but are having trouble when loading with Raytune, it's possible that the job being distributed to multiple GPUs is not replicating the environment correctly for each parallel job. This script was run with Raytune version 1.9.2 (conda-forge) at the time. You could consider trialing that version or following up with Raytune to help troubleshoot by checking their documentation and prior closed issues/discussions or opening a new issue.
What's the version of python you used?