Spaces:
Running
title: 'Dissecting Groundsource: When LLMs Become Scientific Instruments'
thumbnail: https://huggingface.co/spaces/rdjarbeng/groundsource-analysis
authors:
- user: rdjarbeng
tags:
- flood
- climate
- disaster
- geospatial
- google
- gemini
- analysis
- research
Dissecting Groundsource: When LLMs Become Scientific Instruments
A deep-dive into Google's 2.6-million-event flood dataset β what the data actually shows, what claims hold up, and why the methodology may matter more than the dataset itself.
Resources: Enriched Dataset | Full Interactive Article | Original on Zenodo
What is Groundsource?
In February 2026, Google Research released Groundsource β an open-access global dataset of 2.6 million historical flood events extracted from news articles using Gemini LLMs. The dataset was published on Zenodo with a preprint on EarthArxiv.
Google used Gemini to scan 5 million news articles across 80+ languages and generated 2.6 million geo-tagged flood events spanning 150+ countries. This is the training data behind Google's operational flash flood forecasting system.
The best existing global flash flood database (GDACS) had roughly 10,000 entries. If Groundsource genuinely delivers 2.6 million validated events, that's not an incremental improvement β it's a demonstration that LLMs can turn the world's unstructured text into structured scientific ground truth.
We downloaded the full dataset, decoded every geometry, and verified the claims.
What the Data Actually Shows
The dataset is a single 667 MB Parquet file containing exactly 2,646,302 flood events. Each event has a UUID, polygon boundary (WKB geometry), area in kmΒ², start date, and end date.
Key Numbers
| Metric | Value |
|---|---|
| Total events | 2,646,302 |
| Null values | 0 |
| Duplicates | 0 |
| Date range | 2000-01-01 to 2026-02-03 |
| Median area | 2.05 kmΒ² |
| Peak year | 2024 (402,012 events) |
What's absent
No country column. No language of source article. No confidence score. No link to original news article. No severity classification. The dataset is deliberately minimalist β just polygons, dates, and areas.
Geographic Distribution
We decoded all 2.6M WKB geometries into lat/lon centroids:
| Region | Events | Share |
|---|---|---|
| Europe | 590,603 | 22.3% |
| Southeast Asia | 488,885 | 18.5% |
| South Asia | 484,418 | 18.3% |
| North America | 412,254 | 15.6% |
| South America | 248,652 | 9.4% |
| East Asia | 179,846 | 6.8% |
| Africa | 111,053 | 4.2% |
| Other | 131,591 | 4.9% |
Temporal Growth
| Period | Events | Share |
|---|---|---|
| 2000-2009 | 40,581 | 1.5% |
| 2010-2019 | 876,630 | 33.1% |
| 2020-2026 | 1,729,091 | 65.3% |
65% of all data comes from the last 6 years β a compound effect of more digitized global news, improved LLM extraction, and genuinely increasing flood frequency.
Claim Verification
β "2.6 million geo-tagged events"
CONFIRMED. 2,646,302 events, all with polygon geometry and dates. Zero nulls, zero duplicates.
β οΈ "GDACS had roughly 10,000 entries"
Plausible, but needs context. GDACS tracks significant disasters (affecting 100+ people). EM-DAT covers ~22K total natural disasters since 1900. The 260Γ scale increase is real, but GDACS events are curated expert assessments while Groundsource captures every reported flood β fundamentally different granularities.
β οΈ "5 million articles across 80 languages"
CANNOT VERIFY FROM DATASET. No language column, no source article metadata. The paper needs to provide this evidence directly.
β Africa coverage gap
CONFIRMED AND QUANTIFIED. 4.2% of events vs ~17% of world population β a 4Γ underrepresentation.
The Real-Time Question
If the dataset is a static archive of old news, how does it warn about a flood happening tomorrow?
Groundsource is training data, not forecast input. The model studied 2.6 million historical events alongside the weather conditions at each location at the time. It learned the patterns. For daily forecasting, it ingests live feeds from ECMWF, NASA, and NOAA and checks if today's weather matches a learned pattern.
TRAINING: Groundsource labels + Historical weather β Train model
OPERATIONAL: Live weather feeds β Frozen model β "Flash flood likely here tomorrow"
The dataset doesn't need updating for real-time forecasting, just as ImageNet doesn't need daily updates for an image classifier. Periodic retraining would improve performance, but the v1 snapshot is sufficient for operational deployment.
The Africa Gap
Africa represents 4.2% of events despite ~17% of world population. The causes are structural:
- Fewer digitized news sources β African outlets less indexed by Google News; local radio invisible to text mining
- Language gap β Africa has ~2,000 languages; Gemini handles ~80
- Urban reporting bias β rural flash floods may never appear in any outlet
- The paradox: the regions with the least monitoring infrastructure are where this methodology works worst
Concrete Approaches to Fix It
- Multi-source fusion β Combine satellite SAR (works everywhere) + community platforms + radio monitoring
- Satellite-only ground truth β Use SAR flood maps (Kuro Siwo, AI4G-Flood) as labels independent of news
- Synthetic augmentation β Generate flood scenarios from physics models (SAGDA shows this works for African agriculture)
- Transfer learning β RiverMamba already generalizes from data-rich to data-poor regions
- Low-resource language improvement β Fine-tune extraction models for Hausa, Amharic, Swahili
The Methodology Is The Story
The most important thing about Groundsource is not the flood data β it's the demonstration that LLMs can convert unstructured text into structured scientific ground truth at global scale.
Where Else Can This Go?
| Domain | Feasibility | Why |
|---|---|---|
| Disease outbreaks | π’ Very high | Already working (ProMED/WHO) β F1 up to 0.954 |
| Conflict/displacement | π’ High | ACLED exists, news coverage very high |
| Pollution events | π‘ Medium | Binary events work; continuous levels hard |
| Wildfires | π‘ Medium | Satellite already strong; text adds context |
| Mining hazards | π‘ Medium | Rare events, chronic vs acute exposure |
| Drought/agriculture | π΄ Lower | Slow onset, not event-based |
The methodology works best for binary, acute, widely-reported events paired with continuously-available physical observations.
Tutorial: The Enriched Dataset
We've published an enriched version with decoded coordinates:
from datasets import load_dataset
ds = load_dataset("rdjarbeng/groundsource-enriched")
df = ds['train'].to_pandas()
# Columns: uuid, area_km2, start_date, end_date,
# longitude, latitude, year, month, duration_days, region
# Analyze Africa gap
africa = df[df['region'] == 'Africa']
print(f"Africa: {len(africa):,} events ({100*len(africa)/len(df):.1f}%)")
# Time series by region
monthly = df.groupby(['year', 'region']).size().unstack(fill_value=0)
print(monthly.tail(5))
Resources
- π Enriched Dataset β Decoded coordinates, region classification
- π Full Interactive Article β Complete analysis with diagrams
- πΎ Zenodo Original β CC-BY 4.0
- π EarthArxiv Preprint
- π° Google Blog
- π¬ Google Research Blog
Key Papers
- RiverMamba β Global flood forecasting with Mamba SSM
- Epidemic IE β LLMs for epidemic surveillance from ProMED/WHO
- eKG from WHO DONs β Knowledge graph from disease outbreak news
- DengueNet β Satellite-based disease prediction for resource-limited countries
- SAGDA β Synthetic data for Africa's agricultural data gap
- AirPhyNet β Physics-guided air quality prediction
The original Groundsource dataset is by Google Research, licensed CC-BY 4.0. This analysis and enriched dataset by rdjarbeng.