rdjarbeng commited on
Commit
9894750
Β·
verified Β·
1 Parent(s): 19705f9

Add markdown version of blog post for community blog publishing

Browse files
Files changed (1) hide show
  1. blog_post.md +182 -0
blog_post.md ADDED
@@ -0,0 +1,182 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "Dissecting Groundsource: When LLMs Become Scientific Instruments"
3
+ thumbnail: https://huggingface.co/spaces/rdjarbeng/groundsource-analysis
4
+ authors:
5
+ - user: rdjarbeng
6
+ tags:
7
+ - flood
8
+ - climate
9
+ - disaster
10
+ - geospatial
11
+ - google
12
+ - gemini
13
+ - analysis
14
+ - research
15
+ ---
16
+
17
+ # Dissecting Groundsource: When LLMs Become Scientific Instruments
18
+
19
+ *A deep-dive into Google's 2.6-million-event flood dataset β€” what the data actually shows, what claims hold up, and why the methodology may matter more than the dataset itself.*
20
+
21
+ **Resources:** [Enriched Dataset](https://huggingface.co/datasets/rdjarbeng/groundsource-enriched) | [Full Interactive Article](https://huggingface.co/spaces/rdjarbeng/groundsource-analysis) | [Original on Zenodo](https://zenodo.org/records/18647054)
22
+
23
+ ---
24
+
25
+ ## What is Groundsource?
26
+
27
+ In February 2026, Google Research released **Groundsource** β€” an open-access global dataset of 2.6 million historical flood events extracted from news articles using Gemini LLMs. The dataset was published on [Zenodo](https://zenodo.org/records/18647054) with a [preprint on EarthArxiv](https://eartharxiv.org/repository/view/12082/).
28
+
29
+ Google used Gemini to scan **5 million news articles across 80+ languages** and generated **2.6 million geo-tagged flood events** spanning 150+ countries. This is the training data behind Google's operational flash flood forecasting system.
30
+
31
+ > The best existing global flash flood database (GDACS) had roughly 10,000 entries. If Groundsource genuinely delivers 2.6 million validated events, that's not an incremental improvement β€” it's a demonstration that LLMs can turn the world's unstructured text into structured scientific ground truth.
32
+
33
+ We downloaded the full dataset, decoded every geometry, and verified the claims.
34
+
35
+ ## What the Data Actually Shows
36
+
37
+ The dataset is a single 667 MB Parquet file containing exactly **2,646,302 flood events**. Each event has a UUID, polygon boundary (WKB geometry), area in kmΒ², start date, and end date.
38
+
39
+ ### Key Numbers
40
+
41
+ | Metric | Value |
42
+ |--------|-------|
43
+ | Total events | 2,646,302 |
44
+ | Null values | 0 |
45
+ | Duplicates | 0 |
46
+ | Date range | 2000-01-01 to 2026-02-03 |
47
+ | Median area | 2.05 kmΒ² |
48
+ | Peak year | 2024 (402,012 events) |
49
+
50
+ ### What's absent
51
+
52
+ No country column. No language of source article. No confidence score. No link to original news article. No severity classification. The dataset is deliberately minimalist β€” just polygons, dates, and areas.
53
+
54
+ ### Geographic Distribution
55
+
56
+ We decoded all 2.6M WKB geometries into lat/lon centroids:
57
+
58
+ | Region | Events | Share |
59
+ |--------|--------|-------|
60
+ | Europe | 590,603 | 22.3% |
61
+ | Southeast Asia | 488,885 | 18.5% |
62
+ | South Asia | 484,418 | 18.3% |
63
+ | North America | 412,254 | 15.6% |
64
+ | South America | 248,652 | 9.4% |
65
+ | East Asia | 179,846 | 6.8% |
66
+ | **Africa** | **111,053** | **4.2%** |
67
+ | Other | 131,591 | 4.9% |
68
+
69
+ ### Temporal Growth
70
+
71
+ | Period | Events | Share |
72
+ |--------|--------|-------|
73
+ | 2000-2009 | 40,581 | 1.5% |
74
+ | 2010-2019 | 876,630 | 33.1% |
75
+ | 2020-2026 | 1,729,091 | 65.3% |
76
+
77
+ 65% of all data comes from the last 6 years β€” a compound effect of more digitized global news, improved LLM extraction, and genuinely increasing flood frequency.
78
+
79
+ ## Claim Verification
80
+
81
+ ### βœ… "2.6 million geo-tagged events"
82
+ **CONFIRMED.** 2,646,302 events, all with polygon geometry and dates. Zero nulls, zero duplicates.
83
+
84
+ ### ⚠️ "GDACS had roughly 10,000 entries"
85
+ **Plausible, but needs context.** GDACS tracks *significant* disasters (affecting 100+ people). EM-DAT covers ~22K total natural disasters since 1900. The 260Γ— scale increase is real, but GDACS events are curated expert assessments while Groundsource captures every reported flood β€” fundamentally different granularities.
86
+
87
+ ### ⚠️ "5 million articles across 80 languages"
88
+ **CANNOT VERIFY FROM DATASET.** No language column, no source article metadata. The paper needs to provide this evidence directly.
89
+
90
+ ### βœ… Africa coverage gap
91
+ **CONFIRMED AND QUANTIFIED.** 4.2% of events vs ~17% of world population β€” a 4Γ— underrepresentation.
92
+
93
+ ## The Real-Time Question
94
+
95
+ > If the dataset is a static archive of old news, how does it warn about a flood happening tomorrow?
96
+
97
+ **Groundsource is training data, not forecast input.** The model studied 2.6 million historical events alongside the weather conditions at each location at the time. It learned the patterns. For daily forecasting, it ingests live feeds from ECMWF, NASA, and NOAA and checks if today's weather matches a learned pattern.
98
+
99
+ ```
100
+ TRAINING: Groundsource labels + Historical weather β†’ Train model
101
+ OPERATIONAL: Live weather feeds β†’ Frozen model β†’ "Flash flood likely here tomorrow"
102
+ ```
103
+
104
+ The dataset doesn't need updating for real-time forecasting, just as ImageNet doesn't need daily updates for an image classifier. Periodic retraining would improve performance, but the v1 snapshot is sufficient for operational deployment.
105
+
106
+ ## The Africa Gap
107
+
108
+ Africa represents **4.2% of events** despite **~17% of world population**. The causes are structural:
109
+
110
+ 1. **Fewer digitized news sources** β€” African outlets less indexed by Google News; local radio invisible to text mining
111
+ 2. **Language gap** β€” Africa has ~2,000 languages; Gemini handles ~80
112
+ 3. **Urban reporting bias** β€” rural flash floods may never appear in any outlet
113
+ 4. **The paradox:** the regions with the least monitoring infrastructure are where this methodology works worst
114
+
115
+ ### Concrete Approaches to Fix It
116
+
117
+ 1. **Multi-source fusion** β€” Combine satellite SAR (works everywhere) + community platforms + radio monitoring
118
+ 2. **Satellite-only ground truth** β€” Use SAR flood maps ([Kuro Siwo](https://arxiv.org/abs/2311.12056), [AI4G-Flood](https://arxiv.org/abs/2411.01411)) as labels independent of news
119
+ 3. **Synthetic augmentation** β€” Generate flood scenarios from physics models ([SAGDA](https://arxiv.org/abs/2506.13123) shows this works for African agriculture)
120
+ 4. **Transfer learning** β€” [RiverMamba](https://arxiv.org/abs/2505.22535) already generalizes from data-rich to data-poor regions
121
+ 5. **Low-resource language improvement** β€” Fine-tune extraction models for Hausa, Amharic, Swahili
122
+
123
+ ## The Methodology Is The Story
124
+
125
+ The most important thing about Groundsource is not the flood data β€” it's the demonstration that **LLMs can convert unstructured text into structured scientific ground truth at global scale.**
126
+
127
+ ### Where Else Can This Go?
128
+
129
+ | Domain | Feasibility | Why |
130
+ |--------|------------|-----|
131
+ | **Disease outbreaks** | 🟒 Very high | Already working (ProMED/WHO) β€” F1 up to 0.954 |
132
+ | **Conflict/displacement** | 🟒 High | ACLED exists, news coverage very high |
133
+ | **Pollution events** | 🟑 Medium | Binary events work; continuous levels hard |
134
+ | **Wildfires** | 🟑 Medium | Satellite already strong; text adds context |
135
+ | **Mining hazards** | 🟑 Medium | Rare events, chronic vs acute exposure |
136
+ | **Drought/agriculture** | πŸ”΄ Lower | Slow onset, not event-based |
137
+
138
+ The methodology works best for **binary, acute, widely-reported events** paired with **continuously-available physical observations.**
139
+
140
+ ## Tutorial: The Enriched Dataset
141
+
142
+ We've published an enriched version with decoded coordinates:
143
+
144
+ ```python
145
+ from datasets import load_dataset
146
+
147
+ ds = load_dataset("rdjarbeng/groundsource-enriched")
148
+ df = ds['train'].to_pandas()
149
+
150
+ # Columns: uuid, area_km2, start_date, end_date,
151
+ # longitude, latitude, year, month, duration_days, region
152
+
153
+ # Analyze Africa gap
154
+ africa = df[df['region'] == 'Africa']
155
+ print(f"Africa: {len(africa):,} events ({100*len(africa)/len(df):.1f}%)")
156
+
157
+ # Time series by region
158
+ monthly = df.groupby(['year', 'region']).size().unstack(fill_value=0)
159
+ print(monthly.tail(5))
160
+ ```
161
+
162
+ ## Resources
163
+
164
+ - πŸ“Š [Enriched Dataset](https://huggingface.co/datasets/rdjarbeng/groundsource-enriched) β€” Decoded coordinates, region classification
165
+ - 🌐 [Full Interactive Article](https://huggingface.co/spaces/rdjarbeng/groundsource-analysis) β€” Complete analysis with diagrams
166
+ - πŸ’Ύ [Zenodo Original](https://zenodo.org/records/18647054) β€” CC-BY 4.0
167
+ - πŸ“„ [EarthArxiv Preprint](https://eartharxiv.org/repository/view/12082/)
168
+ - πŸ“° [Google Blog](https://blog.google/technology/ai/gemini-communities-predict-crises/)
169
+ - πŸ”¬ [Google Research Blog](https://research.google/blog/protecting-cities-with-ai-driven-flash-flood-forecasting/)
170
+
171
+ ### Key Papers
172
+
173
+ - [RiverMamba](https://arxiv.org/abs/2505.22535) β€” Global flood forecasting with Mamba SSM
174
+ - [Epidemic IE](https://arxiv.org/abs/2408.14277) β€” LLMs for epidemic surveillance from ProMED/WHO
175
+ - [eKG from WHO DONs](https://arxiv.org/abs/2509.02258) β€” Knowledge graph from disease outbreak news
176
+ - [DengueNet](https://arxiv.org/abs/2401.11114) β€” Satellite-based disease prediction for resource-limited countries
177
+ - [SAGDA](https://arxiv.org/abs/2506.13123) β€” Synthetic data for Africa's agricultural data gap
178
+ - [AirPhyNet](https://arxiv.org/abs/2402.03784) β€” Physics-guided air quality prediction
179
+
180
+ ---
181
+
182
+ *The original Groundsource dataset is by Google Research, licensed CC-BY 4.0. This analysis and enriched dataset by [rdjarbeng](https://huggingface.co/rdjarbeng).*