Spaces:

rdjarbeng
/

groundsource-analysis

Running

App Files Files Community

groundsource-analysis / blog_post.md

rdjarbeng

Add markdown version of blog post for community blog publishing

9894750 verified about 1 month ago

preview code

raw

history blame contribute delete

9.13 kB

metadata

title: 'Dissecting Groundsource: When LLMs Become Scientific Instruments'
thumbnail: https://huggingface.co/spaces/rdjarbeng/groundsource-analysis
authors:
  - user: rdjarbeng
tags:
  - flood
  - climate
  - disaster
  - geospatial
  - google
  - gemini
  - analysis
  - research

Dissecting Groundsource: When LLMs Become Scientific Instruments

A deep-dive into Google's 2.6-million-event flood dataset — what the data actually shows, what claims hold up, and why the methodology may matter more than the dataset itself.

Resources: Enriched Dataset | Full Interactive Article | Original on Zenodo

What is Groundsource?

In February 2026, Google Research released Groundsource — an open-access global dataset of 2.6 million historical flood events extracted from news articles using Gemini LLMs. The dataset was published on Zenodo with a preprint on EarthArxiv.

Google used Gemini to scan 5 million news articles across 80+ languages and generated 2.6 million geo-tagged flood events spanning 150+ countries. This is the training data behind Google's operational flash flood forecasting system.

The best existing global flash flood database (GDACS) had roughly 10,000 entries. If Groundsource genuinely delivers 2.6 million validated events, that's not an incremental improvement — it's a demonstration that LLMs can turn the world's unstructured text into structured scientific ground truth.

We downloaded the full dataset, decoded every geometry, and verified the claims.

What the Data Actually Shows

The dataset is a single 667 MB Parquet file containing exactly 2,646,302 flood events. Each event has a UUID, polygon boundary (WKB geometry), area in km², start date, and end date.

Key Numbers

Metric	Value
Total events	2,646,302
Null values	0
Duplicates	0
Date range	2000-01-01 to 2026-02-03
Median area	2.05 km²
Peak year	2024 (402,012 events)

What's absent

No country column. No language of source article. No confidence score. No link to original news article. No severity classification. The dataset is deliberately minimalist — just polygons, dates, and areas.

Geographic Distribution

We decoded all 2.6M WKB geometries into lat/lon centroids:

Region	Events	Share
Europe	590,603	22.3%
Southeast Asia	488,885	18.5%
South Asia	484,418	18.3%
North America	412,254	15.6%
South America	248,652	9.4%
East Asia	179,846	6.8%
Africa	111,053	4.2%
Other	131,591	4.9%

Temporal Growth

Period	Events	Share
2000-2009	40,581	1.5%
2010-2019	876,630	33.1%
2020-2026	1,729,091	65.3%

65% of all data comes from the last 6 years — a compound effect of more digitized global news, improved LLM extraction, and genuinely increasing flood frequency.

Claim Verification

✅ "2.6 million geo-tagged events"

CONFIRMED. 2,646,302 events, all with polygon geometry and dates. Zero nulls, zero duplicates.

⚠️ "GDACS had roughly 10,000 entries"

Plausible, but needs context. GDACS tracks significant disasters (affecting 100+ people). EM-DAT covers ~22K total natural disasters since 1900. The 260× scale increase is real, but GDACS events are curated expert assessments while Groundsource captures every reported flood — fundamentally different granularities.

⚠️ "5 million articles across 80 languages"

CANNOT VERIFY FROM DATASET. No language column, no source article metadata. The paper needs to provide this evidence directly.

✅ Africa coverage gap

CONFIRMED AND QUANTIFIED. 4.2% of events vs ~17% of world population — a 4× underrepresentation.

The Real-Time Question

If the dataset is a static archive of old news, how does it warn about a flood happening tomorrow?

Groundsource is training data, not forecast input. The model studied 2.6 million historical events alongside the weather conditions at each location at the time. It learned the patterns. For daily forecasting, it ingests live feeds from ECMWF, NASA, and NOAA and checks if today's weather matches a learned pattern.

TRAINING: Groundsource labels + Historical weather → Train model
OPERATIONAL: Live weather feeds → Frozen model → "Flash flood likely here tomorrow"

The dataset doesn't need updating for real-time forecasting, just as ImageNet doesn't need daily updates for an image classifier. Periodic retraining would improve performance, but the v1 snapshot is sufficient for operational deployment.

The Africa Gap

Africa represents 4.2% of events despite ~17% of world population. The causes are structural:

Fewer digitized news sources — African outlets less indexed by Google News; local radio invisible to text mining
Language gap — Africa has ~2,000 languages; Gemini handles ~80
Urban reporting bias — rural flash floods may never appear in any outlet
The paradox: the regions with the least monitoring infrastructure are where this methodology works worst

Concrete Approaches to Fix It

Multi-source fusion — Combine satellite SAR (works everywhere) + community platforms + radio monitoring
Satellite-only ground truth — Use SAR flood maps (Kuro Siwo, AI4G-Flood) as labels independent of news
Synthetic augmentation — Generate flood scenarios from physics models (SAGDA shows this works for African agriculture)
Transfer learning — RiverMamba already generalizes from data-rich to data-poor regions
Low-resource language improvement — Fine-tune extraction models for Hausa, Amharic, Swahili

The Methodology Is The Story

The most important thing about Groundsource is not the flood data — it's the demonstration that LLMs can convert unstructured text into structured scientific ground truth at global scale.

Where Else Can This Go?

Domain	Feasibility	Why
Disease outbreaks	🟢 Very high	Already working (ProMED/WHO) — F1 up to 0.954
Conflict/displacement	🟢 High	ACLED exists, news coverage very high
Pollution events	🟡 Medium	Binary events work; continuous levels hard
Wildfires	🟡 Medium	Satellite already strong; text adds context
Mining hazards	🟡 Medium	Rare events, chronic vs acute exposure
Drought/agriculture	🔴 Lower	Slow onset, not event-based

The methodology works best for binary, acute, widely-reported events paired with continuously-available physical observations.

Tutorial: The Enriched Dataset

We've published an enriched version with decoded coordinates:

from datasets import load_dataset

ds = load_dataset("rdjarbeng/groundsource-enriched")
df = ds['train'].to_pandas()

# Columns: uuid, area_km2, start_date, end_date,
#          longitude, latitude, year, month, duration_days, region

# Analyze Africa gap
africa = df[df['region'] == 'Africa']
print(f"Africa: {len(africa):,} events ({100*len(africa)/len(df):.1f}%)")

# Time series by region
monthly = df.groupby(['year', 'region']).size().unstack(fill_value=0)
print(monthly.tail(5))

Resources

📊 Enriched Dataset — Decoded coordinates, region classification
🌐 Full Interactive Article — Complete analysis with diagrams
💾 Zenodo Original — CC-BY 4.0
📄 EarthArxiv Preprint
📰 Google Blog
🔬 Google Research Blog

Key Papers

RiverMamba — Global flood forecasting with Mamba SSM
Epidemic IE — LLMs for epidemic surveillance from ProMED/WHO
eKG from WHO DONs — Knowledge graph from disease outbreak news
DengueNet — Satellite-based disease prediction for resource-limited countries
SAGDA — Synthetic data for Africa's agricultural data gap
AirPhyNet — Physics-guided air quality prediction

The original Groundsource dataset is by Google Research, licensed CC-BY 4.0. This analysis and enriched dataset by rdjarbeng.