groundsource-analysis / blog_post.md
rdjarbeng's picture
Add markdown version of blog post for community blog publishing
9894750 verified
metadata
title: 'Dissecting Groundsource: When LLMs Become Scientific Instruments'
thumbnail: https://huggingface.co/spaces/rdjarbeng/groundsource-analysis
authors:
  - user: rdjarbeng
tags:
  - flood
  - climate
  - disaster
  - geospatial
  - google
  - gemini
  - analysis
  - research

Dissecting Groundsource: When LLMs Become Scientific Instruments

A deep-dive into Google's 2.6-million-event flood dataset β€” what the data actually shows, what claims hold up, and why the methodology may matter more than the dataset itself.

Resources: Enriched Dataset | Full Interactive Article | Original on Zenodo


What is Groundsource?

In February 2026, Google Research released Groundsource β€” an open-access global dataset of 2.6 million historical flood events extracted from news articles using Gemini LLMs. The dataset was published on Zenodo with a preprint on EarthArxiv.

Google used Gemini to scan 5 million news articles across 80+ languages and generated 2.6 million geo-tagged flood events spanning 150+ countries. This is the training data behind Google's operational flash flood forecasting system.

The best existing global flash flood database (GDACS) had roughly 10,000 entries. If Groundsource genuinely delivers 2.6 million validated events, that's not an incremental improvement β€” it's a demonstration that LLMs can turn the world's unstructured text into structured scientific ground truth.

We downloaded the full dataset, decoded every geometry, and verified the claims.

What the Data Actually Shows

The dataset is a single 667 MB Parquet file containing exactly 2,646,302 flood events. Each event has a UUID, polygon boundary (WKB geometry), area in kmΒ², start date, and end date.

Key Numbers

Metric Value
Total events 2,646,302
Null values 0
Duplicates 0
Date range 2000-01-01 to 2026-02-03
Median area 2.05 kmΒ²
Peak year 2024 (402,012 events)

What's absent

No country column. No language of source article. No confidence score. No link to original news article. No severity classification. The dataset is deliberately minimalist β€” just polygons, dates, and areas.

Geographic Distribution

We decoded all 2.6M WKB geometries into lat/lon centroids:

Region Events Share
Europe 590,603 22.3%
Southeast Asia 488,885 18.5%
South Asia 484,418 18.3%
North America 412,254 15.6%
South America 248,652 9.4%
East Asia 179,846 6.8%
Africa 111,053 4.2%
Other 131,591 4.9%

Temporal Growth

Period Events Share
2000-2009 40,581 1.5%
2010-2019 876,630 33.1%
2020-2026 1,729,091 65.3%

65% of all data comes from the last 6 years β€” a compound effect of more digitized global news, improved LLM extraction, and genuinely increasing flood frequency.

Claim Verification

βœ… "2.6 million geo-tagged events"

CONFIRMED. 2,646,302 events, all with polygon geometry and dates. Zero nulls, zero duplicates.

⚠️ "GDACS had roughly 10,000 entries"

Plausible, but needs context. GDACS tracks significant disasters (affecting 100+ people). EM-DAT covers ~22K total natural disasters since 1900. The 260Γ— scale increase is real, but GDACS events are curated expert assessments while Groundsource captures every reported flood β€” fundamentally different granularities.

⚠️ "5 million articles across 80 languages"

CANNOT VERIFY FROM DATASET. No language column, no source article metadata. The paper needs to provide this evidence directly.

βœ… Africa coverage gap

CONFIRMED AND QUANTIFIED. 4.2% of events vs ~17% of world population β€” a 4Γ— underrepresentation.

The Real-Time Question

If the dataset is a static archive of old news, how does it warn about a flood happening tomorrow?

Groundsource is training data, not forecast input. The model studied 2.6 million historical events alongside the weather conditions at each location at the time. It learned the patterns. For daily forecasting, it ingests live feeds from ECMWF, NASA, and NOAA and checks if today's weather matches a learned pattern.

TRAINING: Groundsource labels + Historical weather β†’ Train model
OPERATIONAL: Live weather feeds β†’ Frozen model β†’ "Flash flood likely here tomorrow"

The dataset doesn't need updating for real-time forecasting, just as ImageNet doesn't need daily updates for an image classifier. Periodic retraining would improve performance, but the v1 snapshot is sufficient for operational deployment.

The Africa Gap

Africa represents 4.2% of events despite ~17% of world population. The causes are structural:

  1. Fewer digitized news sources β€” African outlets less indexed by Google News; local radio invisible to text mining
  2. Language gap β€” Africa has ~2,000 languages; Gemini handles ~80
  3. Urban reporting bias β€” rural flash floods may never appear in any outlet
  4. The paradox: the regions with the least monitoring infrastructure are where this methodology works worst

Concrete Approaches to Fix It

  1. Multi-source fusion β€” Combine satellite SAR (works everywhere) + community platforms + radio monitoring
  2. Satellite-only ground truth β€” Use SAR flood maps (Kuro Siwo, AI4G-Flood) as labels independent of news
  3. Synthetic augmentation β€” Generate flood scenarios from physics models (SAGDA shows this works for African agriculture)
  4. Transfer learning β€” RiverMamba already generalizes from data-rich to data-poor regions
  5. Low-resource language improvement β€” Fine-tune extraction models for Hausa, Amharic, Swahili

The Methodology Is The Story

The most important thing about Groundsource is not the flood data β€” it's the demonstration that LLMs can convert unstructured text into structured scientific ground truth at global scale.

Where Else Can This Go?

Domain Feasibility Why
Disease outbreaks 🟒 Very high Already working (ProMED/WHO) β€” F1 up to 0.954
Conflict/displacement 🟒 High ACLED exists, news coverage very high
Pollution events 🟑 Medium Binary events work; continuous levels hard
Wildfires 🟑 Medium Satellite already strong; text adds context
Mining hazards 🟑 Medium Rare events, chronic vs acute exposure
Drought/agriculture πŸ”΄ Lower Slow onset, not event-based

The methodology works best for binary, acute, widely-reported events paired with continuously-available physical observations.

Tutorial: The Enriched Dataset

We've published an enriched version with decoded coordinates:

from datasets import load_dataset

ds = load_dataset("rdjarbeng/groundsource-enriched")
df = ds['train'].to_pandas()

# Columns: uuid, area_km2, start_date, end_date,
#          longitude, latitude, year, month, duration_days, region

# Analyze Africa gap
africa = df[df['region'] == 'Africa']
print(f"Africa: {len(africa):,} events ({100*len(africa)/len(df):.1f}%)")

# Time series by region
monthly = df.groupby(['year', 'region']).size().unstack(fill_value=0)
print(monthly.tail(5))

Resources

Key Papers

  • RiverMamba β€” Global flood forecasting with Mamba SSM
  • Epidemic IE β€” LLMs for epidemic surveillance from ProMED/WHO
  • eKG from WHO DONs β€” Knowledge graph from disease outbreak news
  • DengueNet β€” Satellite-based disease prediction for resource-limited countries
  • SAGDA β€” Synthetic data for Africa's agricultural data gap
  • AirPhyNet β€” Physics-guided air quality prediction

The original Groundsource dataset is by Google Research, licensed CC-BY 4.0. This analysis and enriched dataset by rdjarbeng.