Spaces:

ibm-esa-geospatial
/

challenge

Running

App Files Files Community

From Hazard Detection to Risk Intelligence: TerraMind’s Path Toward Predictive Modeling

by donia-metaplanet - opened Sep 30, 2025

Discussion

donia-metaplanet

Sep 30, 2025

•

edited Feb 23

Introduction

Natural hazards (such as wildfires, landslides, floods) are among the greatest threats to societies and ecosystems. Predicting these risks remains a daunting challenge. Traditional models are often tailored to specific regions and struggle to scale globally.

Today’s risk maps are frequently outdated or static. They often fail to account for climate change, urban expansion, or shifts in land use and soil properties, factors that can radically alter exposure to hazards. As a result, they become obsolete quickly. Satellite data, however, provide continually refreshed signals of changing conditions, allowing risk assessments to remain relevant to the threats communities face today.

To address these challenges, our framework follows a three-step strategy:

Hazard Detection (current focus): Lightweight binary segmentation decoders identify hazard footprints directly from satellite imagery. Each decoder is hazard-specific but built on a shared frozen backbone encoder that converts raw imagery into rich embeddings of the land surface.
Hazard Prediction (future direction): Detection outputs serve as ground-truth for models trained on temporal windows of pre-event imagery, enabling forecasts of where hazards are likely to occur.
Risk Assessment (long-term vision): Predictions are integrated with exposure and vulnerability data, turning hazard probabilities into actionable risk intelligence.

Both hazard detection and prediction models use TerraMind embeddings and Thinking-in-Modalities (TiM). Our framework applies most naturally to frequent hazards that leave repeated, observable signatures. These events supply the abundant training data required for machine learning. Rare events, in contrast, lack sufficient examples for effective predictive modeling.

Trained Models for Hazard Detection

At the heart of our framework lies a simple but powerful concept: specialized U-Net decoders trained for individual hazard types, all built on top of TerraMind’s frozen backbone. The frozen backbone acts as a feature extractor, while decoders specialize in detecting footprints such as flooded areas, burn scars, or unstable slopes.

For this phase, we focused on three hazards:

Floods: Water vs. non-water
Wildfires: Burned vs. unburned areas
Landslides: Recent landslide vs unchanged

All models were trained with TerraMind_v1_base. For the flood detection model, we additionally leveraged TiM (Thinking in Modalities). Training used Dice loss, ideal for imbalanced segmentation tasks, and AdamW for stable convergence.

Model Performance

Hazard Type	Dataset / Inputs	mIoU	F1 Score	Loss	Remarks
Floods	Sen1Floods11 (S1 GRD + S2 L1C)	0.884	0.936	0.131	Without TiM
Floods	Sen1Floods11 (S1 GRD + S2 L1C)	0.901	0.947	0.089	Improved with TiM LULC
Burn Scars	HLS Burn Scars (Landsat + Sentinel)	0.885	0.936	0.092	Surpassed benchmark (83.6 mIoU) [1]
Landslides	Landslide4Sense (S2 L2A + DEM)	0.662	0.751	0.276	Outperformed Landslide4Sense competition winner (F1 74.54%) [2]

Key Insights

Despite different modalities (radar, optical, DEMs), the backbone-decoder design remains robust.
Even with lightweight decoders, performance matches or surpasses existing benchmarks.
TiM contribution: Flood detection already benefited from TiM LULC features, and future work will further extend TiM integration to enhance wildfire and landslide models (including DEMs).

From Hazard Detection to Predictive Modeling

While current models identify past hazard footprints, their value extends far beyond event mapping [3]. Accurate segmentation enables the creation of large-scale databases of historical events. Running detection pipelines across broad regions produces a catalog of when and where hazards occurred, a foundation for predictive modeling.

The aim is not to predict exact event timing but to provide risk assessments: estimating the likelihood of hazards occurring within a defined temporal window. This is particularly valuable for insurance, disaster planning, and infrastructure resilience.

Predictive models ingest temporal sequences of pre-event imagery, learning patterns that consistently precede hazards. Recurrent U-Net variants or time-series convolutional models, fed with sliding windows of days to months of imagery, can estimate probabilities of hazard occurrence at pixel or regional scale.

Both detection and prediction remain modular, built on TerraMind’s frozen backbone. Detection decoders, temporal predictors, and future risk scorers all leverage the same foundational embeddings and are enhanced with Thinking-in-Modalities (TiM) to better exploit multimodal information. This creates a flexible “AI store” for geospatial risk intelligence where new models can be added seamlessly.

Hazard Dataset Generation

Scaling hazard detection globally requires systematic dataset generation. Below, we outline the steps that automate this process across different hazards and inputs.

Pipeline & Preprocessing

The pipeline begins with multimodal inputs: Sentinel-1 GRD, Sentinel-2 L1C/L2A, or DEMs undergoing cropping, normalization, and co-registration before embedding extraction. Each hazard has tailored preprocessing:

Flood Mapping (Sen1Floods11): Complex preprocessing to align Sentinel-1 and Sentinel-2 at pixel level.
Wildfires (Sentinel-2 L2A, HLS-equivalent): Resampled to 30 m resolution to match HLS training data.
Landslides (S2 L2A + DEM): Required co-registration of DEM and optical layers.

This modularity ensures domain-specific accuracy while maintaining backbone consistency.

Deployment

Dataset generation is driven by a simple query interface:

Inputs: region coordinates, temporal window, hazard type
Process: automatic data retrieval (via AWS registry), preprocessing, temporal sequence construction
Outputs: segmentation masks of flood extents, burn scars, or unstable slopes

These outputs serve two purposes: immediate event detection and population of the historical hazard archive that fuels predictive modeling.

Scalability and Generalization

A core strength of the architecture is scalability. New decoders can be added without retraining the entire system, allowing growth to new hazards or related geospatial tasks.

The framework also supports generalization. By relying on consistent embeddings across modalities, it adapts from detection (mapping past events) to prediction (anticipating the risk of future ones). This flexibility positions TerraMind as a data-driven compass for navigating risk, adaptable to emerging climate-driven threats.

Update since 1st of October

We have continued to expand the hazard-dataset generation pipeline and refine the detection models. Updated quantitative metrics will be shared once we surpass previous performance benchmarks. Below are two key developments since the beginning of October.

Operational Insights: Why storing TerraMind’s embeddings matters

Running our full pipeline on a new test region: the area surrounding the Usine hydroélectrique EDF de Salignac in France, highlighted a major structural advantage of using a foundation model.

Averaging across all runs, we observed that:

More than 95% of total processing time was spent on preprocessing
- 20.4%: data download
- 78.6%: co-registration
Less than 5% was spent on actual model inference.
This confirms the importance of investing in reusable embeddings. Once the encoder has produced embeddings for a region, any number of additional hazard-specific decoders can be applied with minimal additional cost. In practice, deploying the encoder once and reusing embeddings can reduce the total compute footprint by up to 95% when adding new detection models.

This reinforces the value of TerraMind as a foundation model. The key advantage is that these embeddings are fully reusable: once they are computed for a region, they can support not only flood detection but also any other hazard analysis such as wildfires or landslides without repeating the heavy preprocessing steps. Adding a new decoder therefore requires only a small fraction of the original computation time.

From Water Detection to True Flood Mapping

Running the flood model on the Salignac region (120 Sentinel-1 and Sentinel-2 pairs, January 2022 → October 2025) provided a second important insight: the model detects water, not floods. To extract actual flood events from these raw water detections, we applied a post-processing workflow to separate permanent water from unusual water presence.

Floods are defined by their exceptional nature. A pixel is considered flooded only if water appears where it is historically rare. For this study, we followed a simple and widely-used logic in global flood mapping: A pixel is considered flooded if its historical water frequency is below 5–10%

This thresholding approach is commonly used when water-depth data is unavailable. It avoids relying on static river or lake masks, which can be outdated or inaccurate.

This aligns with how flood datasets are defined operationally. For example, in the insurance sector, a flood is understood as: “The covering of normally dry land by water from rivers, lakes, the sea, or heavy precipitation, beyond typical seasonal fluctuations.”

Distinguishing normal water extent from abnormal expansion is therefore essential for any credible flood-risk product.

While no flood was detected here, minor overflow from the river is visible, but it is not considered a flood according to operational definitions.

Second Update: Architectural Shifts for True Embedding Modularity

As we expand TerraMind into a generalized feature extractor for risk intelligence AI models, the reusability of our foundation model's embeddings is critical. Our recent tests proved that preprocessing is the primary computational bottleneck, making the ability to compute embeddings once and reuse them for any downstream task a core requirement.

To achieve this, we have implemented two major architectural changes to our training pipeline, focusing on how modalities are fused and how the decoder processes features.

1. The Challenge of Joint Attention

In standard multimodal transformers (single-stream fusion), all modalities are converted into tokens and concatenated into a single sequence before entering the transformer blocks. Because self-attention allows every token to influence every other token, visual tokens explicitly look at radar tokens for context.

While this yields high performance for a specific combination of inputs, it entangles the modalities and struggles with temporal misalignment. Since satellite images are often acquired on different dates (e.g., a radar image during a storm vs. an optical image days later), fusing them early forces the model to mix conflicting physical states. The resulting 768-dimensional vector for a 16x16 patch becomes a hybrid, time-confused representation (e.g., "Sentinel-1 + Sentinel-2"). If a future task requires only Sentinel-1, or a new combination like "Sentinel-1 + DEM", the previously stored embeddings cannot be used. Generating a unique embedding space for every possible combination of satellite sensors is computationally unscalable and defeats the purpose of a generalized feature extractor.

2. The Solution: "Late Fusion" Encoder

To ensure our frozen backbone generates purely independent, modular embeddings, we implemented a Late Fusion architecture.

During training, each modality is passed through the transformer encoder entirely separately. In practice, we stack the modalities along the batch dimension rather than concatenating them along the token sequence. This coding trick prevents the modalities from attending to each other within the transformer blocks, allowing us to reuse the TerraTorch framework without writing a custom transformer from scratch.

As a result, the encoder outputs pure, independent embeddings for each modality. For our flood model, this yields two distinct 768-dimensional feature maps (one for Sentinel-1, one for Sentinel-2). To prepare this for the downstream task, we apply late fusion: the independent embeddings are concatenated post-encoder (creating 1536 channels) and passed through a learned linear projection head to compress them back to the 768-dimensional format expected by the decoder.

3. Refined U-Net Feature Extraction

Historically, our U-Net decoders tapped into intermediary outputs from varying depths of the transformer blocks to build spatial awareness. However, to fully leverage the newly implemented post-encoder fusion head, we adapted our feature extraction strategy.

Instead of extracting from multiple intermediate layers, we now exclusively sample the final output of the last transformer block. We feed this rich, final representation into the decoder four times. A specialized network neck then reshapes and interpolates this single, highly fused output into the multi-scale feature pyramid required by the U-Net decoder.

By isolating modalities in the encoder, we have secured the long-term vision for TerraMind. We can now permanently store modality-specific embeddings in a database, knowing they can be universally combined and utilized by any lightweight hazard decoder we build in the future.

Quantitative Impact of Architectural Changes

To validate these structural changes, we evaluated the flood detection model across four architectural iterations. The goal was to measure the performance trade-off of constraining the model to ensure embedding modularity (using Late Fusion) and decoder flexibility (using only the Last Layer).

Table: Performance progression on Sen1Floods11 (Sentinel-1 + Sentinel-2)

Architecture State	mIoU	F1 Score	Loss	Remarks
1. Initial Baseline (Joint Attention, All Layers)	0.884	0.936	0.131	Original single-stream fusion. Embeddings are entangled.
2. Last Layer Only (Joint Attention, Final Layer)	0.873	0.929	0.155	Expected performance drop due to the U-Net losing spatial cues from intermediate layers.
3. Late Fusion Only (Late fusion Encoder, All Layers)	0.902	0.947	0.123	Strongest performance. Forcing independent feature extraction before fusion creates highly robust representations.
4. Late Fusion + Last Layer Only (The Final Framework)	0.891	0.940	0.127	Final modular architecture. Outperforms the initial baseline while unlocking 100% reusable embeddings.

Current Stage and Future Integration

Much of the workflow is already automated. Pipelines for floods, wildfires, and landslides are operational, enabling event mapping. Temporal sequence generation is streamlined, users specify only parameters, while the system manages acquisition, preprocessing, and decoding.

The risk prediction component has been conceptually designed but not yet trained. Immediate next steps are:

Scaling hazard detection to broader regions to build a comprehensive training database.
Improving training data fidelity by prioritizing higher-resolution satellite imagery. For example, in the burn scars model, we initially relied on harmonized Landsat–Sentinel (HLS, 30 m) data because it provided the ground-truth labels, but future work will instead use native Sentinel-2 L2A imagery (10 m). By superposing Sentinel-2 inputs onto the HLS-derived ground truth, we retain validated labels while benefiting from higher resolution.
Leveraging TiM (Thinking-in-Modalities): Flood detection already benefited from TiM LULC features, and future work will further extend TiM integration to enhance wildfire and landslide models.
Training and validating predictive models using the detection archive as ground truth.

Through this staged approach, TerraMind already delivers actionable hazard detection while evolving toward predictive, interactive risk intelligence.

Contact email

donia.bougrine@metaplanet.fr

Models

donia-metaplanet/TerraMind-Blue-Sky-Challenge

References

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment