kashif
/

stormcast-regression-conus-v0

@@ -134,7 +134,7 @@ ERA5 data for the date range of 2023/01/01 to 2023/01/11, interpolated to the HR
 | Optimizer | Adam (fused) |
 | Learning rate | 4e-4 |
 | LR rampup steps | 1,000 |
-| Total steps | 9,000 (of planned 16,000) |
 | Effective batch size | 4 (gradient accumulation) |
 | Batch size per GPU | 1 |
 | Loss | MSE (regression) |
@@ -148,7 +148,9 @@ ERA5 data for the date range of 2023/01/01 to 2023/01/11, interpolated to the HR
 | GPU | 1x NVIDIA H100 80GB |
 | Peak GPU memory | ~29 GiB |
 | Training speed | ~4.5 s/step (with grad accum) |
-| Training time | ~12 hours (9,000 steps) |
 ## Inference
@@ -191,7 +193,7 @@ class RegressionOnlyStormCast(StormCast):
 # Download and load checkpoint
-ckpt_path = hf_hub_download(REPO_ID, "StormCastUNet.0.9000.mdlus")
 regression = PhysicsNemoModule.from_checkpoint(ckpt_path)
 diffusion = torch.nn.Identity()

 | Optimizer | Adam (fused) |
 | Learning rate | 4e-4 |
 | LR rampup steps | 1,000 |
+| Total steps | 16,000 |
 | Effective batch size | 4 (gradient accumulation) |
 | Batch size per GPU | 1 |
 | Loss | MSE (regression) |
 | GPU | 1x NVIDIA H100 80GB |
 | Peak GPU memory | ~29 GiB |
 | Training speed | ~4.5 s/step (with grad accum) |
+| Training time | ~21 hours (16,000 steps) |
+| Final train loss | 0.0143 |
+| Final val loss | 0.0125 |
 ## Inference
 # Download and load checkpoint
+ckpt_path = hf_hub_download(REPO_ID, "StormCastUNet.0.16000.mdlus")
 regression = PhysicsNemoModule.from_checkpoint(ckpt_path)
 diffusion = torch.nn.Identity()