File size: 9,625 Bytes


# Earth-2 Checkpoints: HealDA

HealDA is a global ML-based data assimilation (DA) model that maps a short window of satellite and conventional observations to a 1° atmospheric state on the Hierarchical Equal Area isoLatitude Pixelation (HEALPix) grid. The resulting analyses serve as plug-and-play initial conditions for off-the-shelf ML forecast models.

This model is ready for commercial/non-commercial use.

### License/Terms of Use:

**Governing Terms**: Use of this model is governed by the [NVIDIA Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/).

### Deployment Geography:

Global

### Use Case:

Industry, academic, and government research teams interested in data assimilation and medium-range weather forecasting.

### Release Date:

Hugging Face: 3/16/2026 [URL](https://huggingface.co/nvidia/healda)

## Reference:

**Papers**:

- [HealDA: Highlighting the importance of initial errors in end-to-end AI weather forecasts](https://arxiv.org/abs/2601.17636)

**Code**:

- [PhysicsNeMo](https://github.com/NVIDIA/physicsnemo)
- [Earth2Studio](https://github.com/NVIDIA/earth2studio)

## Model Architecture

**Architecture Type:** Custom Observation Encoder + HPX Vision Transformer (ViT) backbone <br>
**Network Architecture:** DiT-L adapted to HEALPix grid with patch-based encoding/decoding and global self-attention across the sphere. Used as a deterministic regression model.

- Observation Encoder: sensor-specific point-cloud embedders with scatter-reduce aggregation onto HPX64 grid
- HPX ViT backbone: 330M parameters, 24 transformer blocks, embedding dimension 1024 <br>

## Input:

**Input Type(s):**

- Tensor (satellite and conventional observations as point-cloud data)
- Tensor (static conditioning fields: orography, land-sea mask)
- Tensor (day of year)
- Tensor (second of day) <br>

**Input Format(s):** PyTorch Tensors <br>
**Input Parameters:**

- Observations: variable-length point cloud (~10M scalar observations per 24-hour window \[-21, 3\])
- Static conditioning: 4D (batch, channels, time, Npix)
- Day of year: 2D (batch, time)
- Second of day: 2D (batch, time)
- Current checkpoint uses time=1 <br>

**Other Properties Related to Input:**

- Observation window: 24 hours [t-21h, t+3h] around the target analysis time
- Microwave sounders: AMSU-A, AMSU-B, ATMS, MHS aboard NOAA-15–20, Metop-A–C, and Suomi-NPP
- Conventional in-situ observations: surface stations, aircraft, radiosondes, buoys (surface pressure, temperature, humidity, u/v winds)
- GNSS Radio Occultation: bending angle and derived temperature/humidity profiles
- Satellite-derived winds: scatterometer (ASCAT) and atmospheric motion vectors (AMVs)
- Observational data is sourced from the [NOAA UFS Replay](https://psl.noaa.gov/data/ufs_replay/) archive

## Output:

**Output Type(s):** Tensor (74-channel atmospheric state) <br>
**Output Format:** PyTorch Tensor <br>
**Output Parameters:** 4D (batch, channels, time, Npix) <br>
**Other Properties Related to Output:**

- Output grid: HEALPix HPX64 (Nside=64), 49,152 pixels, \~1° (\~100 km) resolution
- Output is regridded to 0.25° (721x1440) for downstream forecast model initialization
- Output state variables: `u10m`, `v10m`, `u100m`, `v100m`, `t2m`, `msl`, `tcwv`, `sst`, `sic`,
`u50`, `u100`, `u150`, `u200`, `u250`, `u300`, `u400`, `u500`, `u600`, `u700`, `u850`,
`u925`, `u1000`, `v50`, `v100`, `v150`, `v200`, `v250`, `v300`, `v400`, `v500`, `v600`, `v700`, `v850`,
`v925`, `v1000`, `z50`, `z100`, `z150`, `z200`, `z250`, `z300`, `z400`, `z500`, `z600`, `z700`, `z850`,
`z925`, `z1000`, `t50`, `t100`, `t150`, `t200`, `t250`, `t300`, `t400`, `t500`, `t600`, `t700`, `t850`,
`t925`, `t1000`, `q50`, `q100`, `q150`, `q200`, `q250`, `q300`, `q400`, `q500`, `q600`, `q700`, `q850`,
`q925`, `q1000`

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

## Software Integration

**Runtime Engine(s):** PyTorch <br>
**Supported Hardware Microarchitecture Compatibility:** <br>
* NVIDIA Ampere <br>
* NVIDIA Blackwell <br>
* NVIDIA Hopper <br>

**Supported Operating System(s):**
* Linux <br>

The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.

## Model Version(s):

**Model Version:** v1 <br>

# Training, Testing, and Evaluation Datasets:

## Training Dataset:

**Link:** [ERA5](https://cds.climate.copernicus.eu/) <br>

*Data Collection Method by dataset:* <br>
* Automatic/Sensors <br>

*Labeling Method by dataset:* <br>
* Automatic/Sensors <br>

**Properties:**
6-hourly ERA5 reanalysis data for the period 2000–2021, used as the supervised
training target. ERA5 provides hourly estimates of various atmospheric, land, and
oceanic climate variables. The data covers the Earth on a 30km grid and resolves the
atmosphere at 137 levels. <br>

**Attribution:**
Contains modified Copernicus Atmosphere Monitoring Service information 2000-2022.

**Link:** [UFS Replay](https://psl.noaa.gov/data/ufs_replay/) <br>

*Data Collection Method by dataset:* <br>
* Automatic/Sensors <br>

*Labeling Method by dataset:* <br>
* Automatic/Sensors <br>

**Properties:**
Observational data from the NOAA UFS Replay archive for the period 2000–2021, used
as model input. The archive contains a wide range of satellite and conventional observations
used by NOAA operational forecast systems, thinned to 1° spatial resolution by the NOAA GSI. HealDA uses a
subset of these observations, including microwave sounder radiances (AMSU-A, AMSU-B,
ATMS, MHS), conventional in-situ measurements (surface stations, aircraft, radiosondes,
buoys), GNSS radio occultation, and satellite-derived wind products (scatterometer and
atmospheric motion vectors). <br>

**Attribution:**
The NOAA Global Ensemble Forecast System (version 13) replay data used in our study was created by the NOAA Physical Sciences Laboratory (NOAA, 2024).

#### Data Processing Description:

**ERA5** ERA5 data at 0.25 degree resolution on the lat-lon grid is regridded using bilinear interpolation to the HEALPix grid at level 8 (Nside=256), then coarsened by block-averaging to level 6 (Nside=64).

**UFS Replay** Raw observation data from the UFS Replay archive (netCDF format) is converted to Parquet format for efficient data loading during training.

## Testing Dataset:

**Link:** [ERA5](https://cds.climate.copernicus.eu/) <br>

*Data Collection Method by dataset:* <br>
* Automatic/Sensors <br>

*Labeling Method by dataset:* <br>
* Automatic/Sensors <br>

**Properties:**
ERA5 reanalysis data for the year 2022, used as the verification reference for analysis
and forecast evaluation. <br>

**Attribution:**
Contains modified Copernicus Atmosphere Monitoring Service information 2000-2022.

**Link:** [UFS Replay](https://psl.noaa.gov/data/ufs_replay/) <br>

*Data Collection Method by dataset:* <br>
* Automatic/Sensors <br>

*Labeling Method by dataset:* <br>
* Automatic/Sensors <br>

**Properties:**
Observational data from the NOAA UFS Replay archive for the year 2022, used as model
input during testing. Same observation types as the training dataset. <br>

**Attribution:**
The NOAA Global Ensemble Forecast System (version 13) replay data used in our study was created by the NOAA Physical Sciences Laboratory (NOAA, 2024).

## Evaluation Dataset:

**Link:** [ERA5](https://cds.climate.copernicus.eu/) <br>

*Data Collection Method by dataset:* <br>
* Automatic/Sensors <br>

*Labeling Method by dataset:* <br>
* Automatic/Sensors <br>

**Properties:**
ERA5 reanalysis data for the year 2022. All verification is conducted on the
HPX64 grid. <br>

**Attribution:**
Contains modified Copernicus Atmosphere Monitoring Service information 2000-2022.

**Link:** [UFS Replay](https://psl.noaa.gov/data/ufs_replay/) <br>

*Data Collection Method by dataset:* <br>
* Automatic/Sensors <br>

*Labeling Method by dataset:* <br>
* Automatic/Sensors <br>

**Properties:**
Observational data from the NOAA UFS Replay archive for the year 2022, used as model
input during evaluation. <br>

**Attribution:**
The NOAA Global Ensemble Forecast System (version 13) replay data used in our study was created by the NOAA Physical Sciences Laboratory (NOAA, 2024).

## Inference:

**Acceleration Engine:** PhysicsNeMo, PyTorch <br>
**Test Hardware:**
* H100 <br>


## Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

For more detailed information on ethical considerations for this model, please see the [Model Card++ Bias, Explainability, Safety & Security, and Privacy Subcards](https://huggingface.co/nvidia/healda).

Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).