Graph Machine Learning
AnemoI
English

Issue with dependencies

#5
by garelicu - opened

Hello,

Thanks for giving access to the CRPS version of the model!

I successfully managed to run AIFS (the aifs-single-1.0 checkpoint) using the notebook you provide, anemoi and all necessary dependencies.
Now, however, starting from the docker image I built for a AIFS, I am having issues running this aifs-ens-1.0

I am of course realizing some dependencies may be different, but I am following the list of required packages you mention in the card:

  • torch==2.5.0 anemoi-inference[huggingface]==0.6.0 anemoi-models==0.6.0 anemoi-graphs==0.6.0 anemoi-datasets==0.5.23
  • earthkit-regrid==0.4.0 'ecmwf-opendata>=0.3.19'
  • flash_attn

I am using torch==2.6.0 as I was having CUDA issues with torch==2.5.0. The new docker image using these dependencies is building fine but when I am trying to run the model I am having issues with flash_attn and the python version the checkpoint was saved in, as it can be observed from the error I am getting:

β€œ

ImportError Traceback (most recent call last)
File /usr/local/pyenv/versions/3.12.7/lib/python3.12/site-packages/anemoi/inference/runner.py:452, in Runner.model(self)
451 try:
--> 452 model = torch.load(self.checkpoint.path, map_location=self.device, weights_only=False).to(self.device)
453 except Exception as e: # Wildcard exception to catch all errors

File /usr/local/pyenv/versions/3.12.7/lib/python3.12/site-packages/torch/serialization.py:1471, in load(f, map_location, pickle_module, weights_only, mmap, **pickle_load_args)
1470 raise pickle.UnpicklingError(_get_wo_message(str(e))) from None
-> 1471 return _load(
1472 opened_zipfile,
1473 map_location,
1474 pickle_module,
1475 overall_storage=overall_storage,
1476 **pickle_load_args,
1477 )
1478 if mmap:

File /usr/local/pyenv/versions/3.12.7/lib/python3.12/site-packages/torch/serialization.py:1964, in _load(zip_file, map_location, pickle_module, pickle_file, overall_storage, **pickle_load_args)
1963 _serialization_tls.map_location = map_location
-> 1964 result = unpickler.load()
1965 _serialization_tls.map_location = None

File /usr/local/pyenv/versions/3.12.7/lib/python3.12/site-packages/torch/serialization.py:1953, in _load..UnpicklerWrapper.find_class(self, mod_name, name)
1952 mod_name = load_module_mapping.get(mod_name, mod_name)
-> 1953 return super().find_class(mod_name, name)

File /usr/local/pyenv/versions/3.12.7/lib/python3.12/site-packages/flash_attn/init.py:3
1 version = "2.8.0.post2"
----> 3 from flash_attn.flash_attn_interface import (
4 flash_attn_func,
5 flash_attn_kvpacked_func,
6 flash_attn_qkvpacked_func,
7 flash_attn_varlen_func,
8 flash_attn_varlen_kvpacked_func,
9 flash_attn_varlen_qkvpacked_func,
10 flash_attn_with_kvcache,
11 )

File /usr/local/pyenv/versions/3.12.7/lib/python3.12/site-packages/flash_attn/flash_attn_interface.py:15
14 else:
---> 15 import flash_attn_2_cuda as flash_attn_gpu
17 # isort: on

ImportError: /usr/local/pyenv/versions/3.12.7/lib/python3.12/site-packages/flash_attn_2_cuda.cpython-312-x86_64-linux-gnu.so: undefined symbol: _ZN3c105ErrorC2ENS_14SourceLocationENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE

The above exception was the direct cause of the following exception:

RuntimeError Traceback (most recent call last)
Cell In[20], line 1
----> 1 for state in runner_ens.run(input_state=input_state, lead_time=6*5):
2 print_state(state)

File /usr/local/pyenv/versions/3.12.7/lib/python3.12/site-packages/anemoi/inference/runner.py:222, in Runner.run(self, input_state, lead_time)
219 input_tensor = self.prepare_input_tensor(input_state)
221 try:
--> 222 yield from self.forecast(lead_time, input_tensor, input_state)
223 except (TypeError, ModuleNotFoundError, AttributeError):
224 if self.report_error:

File /usr/local/pyenv/versions/3.12.7/lib/python3.12/site-packages/anemoi/inference/runner.py:541, in Runner.forecast(self, lead_time, input_tensor_numpy, input_state)
522 def forecast(
523 self, lead_time: str, input_tensor_numpy: FloatArray, input_state: State
524 ) -> Generator[State, None, None]:
525 """Forecast the future states.
526
527 Parameters
(...) 539 The forecasted state.
540 """
--> 541 self.model.eval()
543 torch.set_grad_enabled(False)
545 # Create pytorch input tensor

File /usr/local/pyenv/versions/3.12.7/lib/python3.12/functools.py:993, in cached_property.get(self, instance, owner)
991 val = cache.get(self.attrname, _NOT_FOUND)
992 if val is _NOT_FOUND:
--> 993 val = self.func(instance)
994 try:
995 cache[self.attrname] = val

File /usr/local/pyenv/versions/3.12.7/lib/python3.12/site-packages/anemoi/inference/runner.py:458, in Runner.model(self)
456 validation_result = self.checkpoint.validate_environment(on_difference="return")
457 error_msg = f"Error loading model - {validation_result}"
--> 458 raise RuntimeError(error_msg) from e
459 # model.set_inference_options(**self.inference_options)
460 assert getattr(model, "runner", None) is None, model.runner

RuntimeError: Error loading model - Environment validation failed. The following issues were found:
python:
Python version mismatch: 3.11.6 != 3.12.7
β€œ

(I also seemed that anemoi.utils==0.4.22is needed, but this is an easy fix)

Was the checkpoint actually saved using python 3.11.6? Could you please provide the flash_attn version you are using and any other additional related packages? (I was hoping not specifying it in the requirements would help solving the conflict but I am now doubting)

Thanks!

Hi,
Yes the model was trained with 3.11.6, but that is not too important.
The flash_attn version was https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.5cxx11abiFALSE-cp312-cp312-linux_x86_64.whl.

With this docker file, I could able to run the notebook in coiled with GCP GPU a2-ultragpu-1g, torch==2.5.0 is crucial it seems.

FROM nvidia/cuda:12.9.1-cudnn-devel-ubuntu24.04

# Install Python and build essentials
RUN apt-get update && apt-get install -y \
    python3 python3-pip python3-dev \
    build-essential \
    git \
    python3-venv \
    && rm -rf /var/lib/apt/lists/*

# Create and activate virtual environment
RUN python3 -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"

# Install prerequisites
RUN pip install packaging ninja

# Install PyTorch first (specific version as requested)
RUN pip install torch==2.5.0

# Install Flash Attention
RUN MAX_JOBS=4 pip install flash-attn --no-build-isolation

# Install the anemoi packages and other requirements
RUN pip install \
    anemoi-inference[huggingface]==0.6.0 \
    anemoi-models==0.6.0 \
    anemoi-graphs==0.6.0 \
    anemoi-datasets==0.5.23 \
    earthkit-regrid==0.4.0 \
    'ecmwf-opendata>=0.3.19'

# Install notebook environment
RUN pip install jupyter coiled notebook

# Set working directory
WORKDIR /workspace

# Expose Jupyter port
EXPOSE 8888

# Default command
CMD ["jupyter", "notebook", "--ip=0.0.0.0", "--port=8888", "--no-browser", "--allow-root"]

As I have not heard from you, I trust this has resolved your issue.
I will close it now, feel free to reopen if you need more assistance.

hcookie129 changed discussion status to closed

Sign up or log in to comment