Reducing GPU Memory Requirements for AIFS-ENS Inference (Current Peak ~50 GB on A100)

#17

by Nishadhka - opened Nov 24, 2025

Nov 24, 2025

First, thank you to the ECMWF team for releasing the AIFS-ENS v1.0 model and providing clear inference examples in notebook. Using the provided workflow, we were able to integrate the model into an operational pipeline and participate in the ECMWF AIQUEST challenge (our implementation: https://github.com/icpac-igad/ea-aifs
), representing the Fahamu / ICPAC team https://aiweatherquest.ecmwf.int/team/fahamu/.

During the inference https://github.com/icpac-igad/ea-aifs/blob/main/multi_run_AIFS_ENS_v1.py, we observed that full-resolution AIFS-ENS inference currently requires around 50–51 GB of GPU memory as in the attached screenshot coiled notebook dashboard.

Because of this high memory requirement, inference is only feasible on large A100/H100-class GPUs such as a2-ultragpu-1g in Google cloud platform which has 80GB GPU memory.

Our goal is to explore ways to make inference possible on more widely available cloud GPUs such as NVIDIA L4 / T4 / A10G / N4, which offer 24 GB or less of VRAM. These GPUs are supported on services such as Cloud Run and general GCP/ AWS GPU VMs.

We are evaluating the following memory-reduction strategies, but would appreciate guidance from the team on which of these are officially supported or recommended:

1. Mixed precision inference (`precision="half"`)

Setting precision="half" when constructing the SimpleRunner appears to cut memory usage significantly (model weights + activations). Example:

runner = SimpleRunner(
    checkpoint, 
    device="cuda",
    precision="half"   # proposed change
)

We would like to confirm whether FP16/BF16 inference is fully supported for AIFS-ENS and whether this affects output quality.

2. Inference chunking

The ANEMOI_INFERENCE_NUM_CHUNKS environment variable appears to offer further memory reduction at some cost to runtime:

export ANEMOI_INFERENCE_NUM_CHUNKS=16
# or 32 for larger reductions

We would like to verify:

the recommended range (8, 16, 32, 64?),
whether chunking is officially supported for AIFS-ENS,

Questions

Is FP16 / BF16 inference officially supported for AIFS-ENS v1.0?
Is chunked inference (via ANEMOI_INFERENCE_NUM_CHUNKS) recommended, and what ranges are safe?
Are there plans to support lower-memory inference modes tailored for 24 GB GPUs?

Our target is to reduce peak memory from ~51 GB down to < 24 GB, enabling inference on L4-class GPUs and broader operational deployment.

We appreciate any guidance on the best path forward and whether these optimizations align with the anemoi-inference design.

Thank you again for the release and your continued work on AI-based weather prediction.

Nishadhka

Nov 30, 2025

Just to follow up on my earlier message: it is to confirm that the Mixed(low) precision inference FP16 and chunking is indeed working with AIFS-ENS v1.0 and works reliably with the current anemoi-inference setup https://github.com/icpac-igad/ea-aifs/blob/main/pytorch_profile_fp16.py.
Using precision="half" together with moderate chunking (ANEMOI_INFERENCE_NUM_CHUNKS=16), the measured peak GPU usage during full AIFS-ENS inference is as in the profile screenshot
Peak allocated: ~20 GB
Peak reserved: ~23 GB

Following the PyTorch memory-profiling workflow described in the presentation (“Demo: Anemoi Profiling”, https://events.ecmwf.int/event/466/timetable/). This is in strong contrast with FP32, where PyTorch and CUDA workspace allocations drive total usage above >34 GB. The script for profiling at https://github.com/icpac-igad/ea-aifs/blob/main/pytorch_profile_fp16.py and pytorch_profile_fp32.py.
The FP32 PyTorch snapshot visualisation shows “thin” high-peaks (because non-PyTorch CUDA workspace allocations do not appear in the snapshot), the aggregated numbers confirm that FP32 cannot run on sub-48 GB GPUs.

These results suggest that AIFS-ENS v1.0 inference can run on commonly available 24 GB cloud GPUs such as L4, A10G, or RTX 4090, which opens the door to more accessible and cost-effective deployment options (e.g., Cloud Run or standard GCP GPU VMs).

Nishadhka changed discussion status to closed Nov 30, 2025

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Reducing GPU Memory Requirements for AIFS-ENS Inference (Current Peak ~50 GB on A100)

1. Mixed precision inference (precision="half")

2. Inference chunking

Questions

1. Mixed precision inference (`precision="half"`)