"torch.cuda.is_available() is False", "Can't load the configuration of 'path/to/pretrained_model/" and... in hyperparam_optimiz_for_disease_classifier.py

#212
by kamisama0101 - opened

Thank you for your great work in geneformer!
However, when I redid the disease classification, I found a series of problems:

  1. "File "/home/.../anaconda3/envs/geneformer3/lib/python3.9/site-packages/torch/serialization.py", line 166, in validate_cuda_device
    raise RuntimeError('Attempting to deserialize object on a CUDA '
    RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False."
    Another:
    "File "/home/.../anaconda3/envs/geneformer3/lib/python3.9/site-packages/ray/worker.py", line 1833, in get
    raise value
    ray.exceptions.RaySystemError: System error: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU."

It is really strange because I do have a CUDA device.

"
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.13 Driver Version: 525.60.13 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:3B:00.0 Off | 0 |
| N/A 30C P0 55W / 300W | 903MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... On | 00000000:86:00.0 Off | 0 |
| N/A 27C P0 51W / 300W | 25MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 196221 C python 900MiB |
| 1 N/A N/A 196692 C ray::ImplicitFunc.train() 24MiB |
+-----------------------------------------------------------------------------+
"

  1. "During handling of the above exception, another exception occurred:
    ...
    File "/home/.../envs/geneformer3/lib/python3.9/site-packages/transformers/configuration_utils.py", line 693, in _get_config_dict
    raise EnvironmentError(
    OSError: Can't load the configuration of 'path/to/pretrained_model/'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'path/to/pretrained_model/' is the correct path to a directory containing a config.json file"
    It's also strange because I can load the pretrained_model in the cell classification and gene classification task by the same "path/to".

  2. "File "/home/jdhan_pkuhpc/profiles/yeyaxuan/lustre2/anaconda3/envs/geneformer3/lib/python3.9/site-packages/huggingface_hub/utils/_validators.py", line 158, in validate_repo_id
    raise HFValidationError(
    huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': 'path/to/pretrained_model/'. Use repo_type argument if needed."

I wonder whether it is caused by version conflict. I currently use the ray of version '1.13.0'.
I would appreciate it if you can offer me some help!

Thank you for your interest in Geneformer! Raytune can be sensitive to various factors, especially when running jobs distributed on multiple GPUs, that are highly environment-specific. If you are able to load the model usually but are having trouble when loading with Raytune, it's possible that the job being distributed to multiple GPUs is not replicating the environment correctly for each parallel job. This script was run with Raytune version 1.9.2 (conda-forge) at the time. You could consider trialing that version or following up with Raytune to help troubleshoot by checking their documentation and prior closed issues/discussions or opening a new issue.

ctheodoris changed discussion status to closed

What's the version of python you used?

Sign up or log in to comment