groundsource-analysis / index.html
rdjarbeng's picture
Add epidemic experiment results section
9fe6547 verified
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Dissecting Groundsource: When LLMs Become Scientific Instruments</title>
<style>
:root {
--bg: #ffffff;
--text: #1a1a2e;
--accent: #2563eb;
--accent-light: #dbeafe;
--border: #e5e7eb;
--code-bg: #f3f4f6;
--card-bg: #f9fafb;
--green: #059669;
--yellow: #d97706;
--red: #dc2626;
--purple: #7c3aed;
}
* { margin: 0; padding: 0; box-sizing: border-box; }
body {
font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
line-height: 1.7;
color: var(--text);
background: var(--bg);
}
.container { max-width: 820px; margin: 0 auto; padding: 2rem 1.5rem; }
/* Header */
.hero {
background: linear-gradient(135deg, #1e3a5f 0%, #0f766e 100%);
color: white;
padding: 4rem 1.5rem 3rem;
text-align: center;
}
.hero h1 { font-size: 2.2rem; margin-bottom: 1rem; line-height: 1.3; font-weight: 800; }
.hero .subtitle { font-size: 1.15rem; opacity: 0.9; max-width: 700px; margin: 0 auto 1.5rem; }
.hero .meta { font-size: 0.9rem; opacity: 0.7; }
.hero a { color: #93c5fd; }
/* Article body */
h2 {
font-size: 1.6rem;
margin: 2.5rem 0 1rem;
padding-bottom: 0.5rem;
border-bottom: 2px solid var(--accent);
color: var(--accent);
}
h3 { font-size: 1.25rem; margin: 2rem 0 0.75rem; color: #374151; }
h4 { font-size: 1.05rem; margin: 1.5rem 0 0.5rem; color: #4b5563; }
p { margin-bottom: 1rem; }
a { color: var(--accent); text-decoration: none; }
a:hover { text-decoration: underline; }
/* Tables */
table { width: 100%; border-collapse: collapse; margin: 1rem 0 1.5rem; font-size: 0.92rem; }
th, td { padding: 0.6rem 0.8rem; text-align: left; border-bottom: 1px solid var(--border); }
th { background: var(--card-bg); font-weight: 600; }
tr:hover { background: #f0f9ff; }
/* Code blocks */
pre {
background: #1e293b;
color: #e2e8f0;
padding: 1.2rem;
border-radius: 8px;
overflow-x: auto;
margin: 1rem 0 1.5rem;
font-size: 0.88rem;
line-height: 1.6;
}
code {
font-family: 'Fira Code', 'SF Mono', Consolas, monospace;
font-size: 0.88em;
}
p code, li code {
background: var(--code-bg);
padding: 0.15em 0.4em;
border-radius: 4px;
color: #c026d3;
}
/* Cards */
.card {
background: var(--card-bg);
border: 1px solid var(--border);
border-radius: 10px;
padding: 1.5rem;
margin: 1.5rem 0;
}
.card.highlight {
border-left: 4px solid var(--accent);
background: var(--accent-light);
}
.card.warning {
border-left: 4px solid var(--yellow);
background: #fffbeb;
}
.card.success {
border-left: 4px solid var(--green);
background: #ecfdf5;
}
/* Badges */
.badge {
display: inline-block;
padding: 0.2em 0.6em;
border-radius: 9999px;
font-size: 0.8rem;
font-weight: 600;
}
.badge.green { background: #d1fae5; color: #065f46; }
.badge.yellow { background: #fef3c7; color: #92400e; }
.badge.red { background: #fee2e2; color: #991b1b; }
.badge.blue { background: #dbeafe; color: #1e40af; }
/* Lists */
ul, ol { margin: 0.5rem 0 1rem 1.5rem; }
li { margin-bottom: 0.4rem; }
/* Diagram box */
.diagram {
background: #0f172a;
color: #94a3b8;
padding: 1.5rem;
border-radius: 10px;
font-family: 'Fira Code', monospace;
font-size: 0.82rem;
line-height: 1.8;
overflow-x: auto;
margin: 1.5rem 0;
white-space: pre;
}
.diagram .hl { color: #60a5fa; }
.diagram .gr { color: #34d399; }
.diagram .yw { color: #fbbf24; }
.diagram .pk { color: #f472b6; }
/* Stats row */
.stats-row {
display: grid;
grid-template-columns: repeat(auto-fit, minmax(160px, 1fr));
gap: 1rem;
margin: 1.5rem 0;
}
.stat-card {
background: var(--card-bg);
border: 1px solid var(--border);
border-radius: 10px;
padding: 1rem;
text-align: center;
}
.stat-card .number { font-size: 1.8rem; font-weight: 800; color: var(--accent); }
.stat-card .label { font-size: 0.85rem; color: #6b7280; margin-top: 0.25rem; }
/* TOC */
.toc {
background: var(--card-bg);
border: 1px solid var(--border);
border-radius: 10px;
padding: 1.5rem 2rem;
margin: 2rem 0;
}
.toc h3 { margin-top: 0; color: var(--accent); }
.toc ol { margin-left: 1.2rem; }
.toc li { margin-bottom: 0.3rem; font-size: 0.95rem; }
/* Footer */
.footer {
margin-top: 3rem;
padding: 2rem 1.5rem;
background: var(--card-bg);
border-top: 1px solid var(--border);
text-align: center;
font-size: 0.9rem;
color: #6b7280;
}
/* Responsive */
@media (max-width: 640px) {
.hero h1 { font-size: 1.6rem; }
.hero .subtitle { font-size: 1rem; }
.stats-row { grid-template-columns: repeat(2, 1fr); }
pre { font-size: 0.8rem; padding: 0.8rem; }
}
/* Bar chart in CSS */
.bar-row { display: flex; align-items: center; margin-bottom: 0.35rem; }
.bar-label { width: 130px; font-size: 0.85rem; text-align: right; padding-right: 0.8rem; }
.bar { height: 22px; border-radius: 4px; background: var(--accent); min-width: 2px; transition: width 0.3s; display: flex; align-items: center; justify-content: flex-end; padding-right: 6px; }
.bar span { font-size: 0.72rem; color: white; font-weight: 600; }
.bar.africa { background: #dc2626; }
</style>
</head>
<body>
<div class="hero">
<h1>Dissecting Groundsource: When LLMs Become Scientific Instruments</h1>
<p class="subtitle">
A deep-dive into Google's 2.6-million-event flood dataset β€” what the data actually shows,
what claims hold up, and why the methodology may matter more than the dataset itself.
</p>
<p class="meta">
By <a href="https://huggingface.co/rdjarbeng">rdjarbeng</a> Β· July 2026 Β·
<a href="https://huggingface.co/datasets/rdjarbeng/groundsource-enriched">Dataset on HF Hub</a>
</p>
</div>
<div class="container">
<div class="toc">
<h3>πŸ“‘ Contents</h3>
<ol>
<li><a href="#what-is">What is Groundsource?</a></li>
<li><a href="#data-inspection">What the Data Actually Shows (Hands-On Inspection)</a></li>
<li><a href="#claims">Claim Verification: What Holds Up, What Doesn't</a></li>
<li><a href="#realtime">The Real-Time Question: How Static Data Powers Live Forecasts</a></li>
<li><a href="#africa">The Africa Gap: Quantified</a></li>
<li><a href="#methodology">The Methodology Is The Story</a></li>
<li><a href="#transfer">Where Else Can This Go? Domain Transferability</a></li>
<li><a href="#fix-africa">Fixing The Africa Gap: Concrete Approaches</a></li>
<li><a href="#tutorial">Tutorial: Working with the Enriched Dataset</a></li>
<li><a href="#experiment">Experiment: Testing on Disease Outbreaks</a></li>
<li><a href="#resources">Resources & References</a></li>
</ol>
</div>
<!-- Section 1 -->
<h2 id="what-is">1. What is Groundsource?</h2>
<p>In February 2026, Google Research released <strong>Groundsource</strong> β€” an open-access global dataset of 2.6 million historical flood events extracted from news articles using Gemini LLMs. The dataset was published on <a href="https://zenodo.org/records/18647054">Zenodo</a> with a <a href="https://eartharxiv.org/repository/view/12082/">preprint on EarthArxiv</a>.</p>
<p>The key claim: Google used Gemini to scan <strong>5 million news articles across 80+ languages</strong> and generated <strong>2.6 million geo-tagged flood events</strong> spanning 150+ countries. This is the training data behind Google's operational flash flood forecasting system, announced on the <a href="https://blog.google/technology/ai/gemini-communities-predict-crises/">Google Blog</a> and <a href="https://research.google/blog/protecting-cities-with-ai-driven-flash-flood-forecasting/">Google Research Blog</a>.</p>
<div class="card highlight">
<strong>The core question:</strong> The best existing global flash flood database (GDACS) had roughly 10,000 entries.
If Groundsource genuinely delivers 2.6 million validated events, that's not an incremental improvement β€” it's a
demonstration that LLMs can turn the entire world's unstructured text into structured scientific ground truth.
</div>
<p>We downloaded the full dataset, decoded every geometry, and verified the claims. Here's what we found.</p>
<!-- Section 2 -->
<h2 id="data-inspection">2. What the Data Actually Shows</h2>
<p>The dataset is a single 667 MB Parquet file containing exactly <strong>2,646,302 flood events</strong>. Each event has:</p>
<table>
<tr><th>Column</th><th>Type</th><th>Description</th></tr>
<tr><td><code>uuid</code></td><td>string</td><td>Unique event identifier</td></tr>
<tr><td><code>area_km2</code></td><td>float</td><td>Flood extent area in kmΒ²</td></tr>
<tr><td><code>geometry</code></td><td>WKB binary</td><td>Polygon boundary of flood zone</td></tr>
<tr><td><code>start_date</code></td><td>string</td><td>Flood start date</td></tr>
<tr><td><code>end_date</code></td><td>string</td><td>Flood end date</td></tr>
</table>
<div class="stats-row">
<div class="stat-card">
<div class="number">2.65M</div>
<div class="label">Total Events</div>
</div>
<div class="stat-card">
<div class="number">0</div>
<div class="label">Null Values</div>
</div>
<div class="stat-card">
<div class="number">0</div>
<div class="label">Duplicates</div>
</div>
<div class="stat-card">
<div class="number">26 yrs</div>
<div class="label">Date Range (2000-2026)</div>
</div>
</div>
<h3>What's notably absent</h3>
<p>No country column. No language of source article. No confidence score. No link to the original news article. No event severity classification. The dataset is <em>deliberately minimalist</em> β€” just polygon geometries, dates, and areas. This makes it clean and privacy-preserving, but impossible to trace provenance or assess per-event reliability.</p>
<h3>Geographic Distribution</h3>
<p>We decoded all 2,646,302 WKB geometries into latitude/longitude centroids and classified by world region:</p>
<div style="margin: 1.5rem 0;">
<div class="bar-row"><div class="bar-label">Europe</div><div class="bar" style="width: 60%;"><span>590K (22.3%)</span></div></div>
<div class="bar-row"><div class="bar-label">Southeast Asia</div><div class="bar" style="width: 50%;"><span>489K (18.5%)</span></div></div>
<div class="bar-row"><div class="bar-label">South Asia</div><div class="bar" style="width: 49%;"><span>484K (18.3%)</span></div></div>
<div class="bar-row"><div class="bar-label">North America</div><div class="bar" style="width: 42%;"><span>412K (15.6%)</span></div></div>
<div class="bar-row"><div class="bar-label">South America</div><div class="bar" style="width: 25%;"><span>249K (9.4%)</span></div></div>
<div class="bar-row"><div class="bar-label">East Asia</div><div class="bar" style="width: 18%;"><span>180K (6.8%)</span></div></div>
<div class="bar-row"><div class="bar-label"><strong>Africa</strong></div><div class="bar africa" style="width: 11%;"><span>111K (4.2%)</span></div></div>
<div class="bar-row"><div class="bar-label">Other regions</div><div class="bar" style="width: 8%; background: #9ca3af;"><span>131K (4.9%)</span></div></div>
</div>
<h3>Exponential Temporal Growth</h3>
<p>The dataset exhibits dramatic temporal skew:</p>
<table>
<tr><th>Period</th><th>Events</th><th>Share</th><th>Interpretation</th></tr>
<tr><td>2000-2009</td><td>40,581</td><td>1.5%</td><td>Sparse β€” limited digital news archives</td></tr>
<tr><td>2010-2019</td><td>876,630</td><td>33.1%</td><td>Ramp-up β€” growing online news</td></tr>
<tr><td>2020-2026</td><td>1,729,091</td><td>65.3%</td><td>65% of all data in last 6 years</td></tr>
</table>
<p>2024 alone contributed <strong>402,012 events</strong> β€” nearly double 2020's 198,201. This is a compound effect of more digitized global news, improved LLM extraction, and genuinely increasing flood frequency from climate change.</p>
<h3>Event Characteristics</h3>
<ul>
<li><strong>Median area: 2.05 kmΒ²</strong> β€” extremely localized flash floods, exactly the type physical infrastructure misses</li>
<li><strong>Max area: ~5,000 kmΒ²</strong> β€” appears to be capped at this threshold</li>
<li><strong>54.8% same-day events</strong> β€” flash floods by definition</li>
<li><strong>Max duration: 6 days</strong> β€” also capped; no multi-week events</li>
<li><strong>Monthly peak: July-September</strong> β€” Northern Hemisphere monsoon/storm season</li>
</ul>
<!-- Section 3 -->
<h2 id="claims">3. Claim Verification</h2>
<h3>βœ… "2.6 million geo-tagged events"</h3>
<p><strong>CONFIRMED.</strong> Exactly 2,646,302 events, all with polygon geometry and dates. Zero nulls, zero duplicates.</p>
<h3>⚠️ "GDACS had roughly 10,000 entries"</h3>
<div class="card warning">
<strong>Plausible, but the comparison needs context.</strong> GDACS (run by JRC/UN) tracks <em>significant</em> disasters β€” typically affecting 100+ people. EM-DAT covers ~22,000 total natural disasters since 1900, with floods being ~5,000-8,000 records. The Dartmouth Flood Observatory has ~5,000 major flood events since 1985.
<br><br>
The 260Γ— scale increase is real, but <strong>GDACS events are curated expert assessments of major disasters</strong>, while Groundsource captures <em>every reported flood from any news article</em>. These are fundamentally different granularities. A fairer framing: "Groundsource captures 260Γ— more events at a fundamentally different resolution β€” the long tail of floods that never make expert databases."
</div>
<h3>⚠️ "5 million news articles across 80 languages"</h3>
<p><strong>CANNOT VERIFY FROM DATASET.</strong> No language column, no source article metadata, no article count. The Zenodo description says "spanning more than 150 countries" but the dataset itself provides no means to verify article counts or language coverage. The paper needs to provide this evidence.</p>
<h3>⚠️ "22% recall and 44% precision for US NWS"</h3>
<p><strong>CANNOT VERIFY FROM DATASET.</strong> These are model evaluation metrics, not dataset properties. Flash flood prediction is genuinely hard β€” current literature puts F1-score ceilings around 0.3-0.5 for global models β€” so these numbers are plausible but need cross-checking against NOAA's own verification statistics.</p>
<h3>βœ… "82% practical precision" (from the paper)</h3>
<p>The existing HF mirror's dataset card quotes the paper as reporting 82% practical precision in manual evaluations and 85-100% recall against GDACS severe events (2020-2026). These are <strong>paper claims</strong> that require the peer review process to validate.</p>
<h3>βœ… Africa coverage gap</h3>
<p><strong>CONFIRMED AND QUANTIFIED.</strong> Africa = 4.2% of events vs ~17% of world population. A 4Γ— underrepresentation. More on this in <a href="#africa">Section 5</a>.</p>
<!-- Section 4 -->
<h2 id="realtime">4. The Real-Time Question</h2>
<div class="card highlight">
<strong>Q: If the dataset is a static archive of old news, how does it warn about a flood happening tomorrow?</strong>
</div>
<p>This is the most important conceptual question. The answer: <strong>Groundsource is training data, not forecast input.</strong></p>
<p>The model studied 2.6 million historical events alongside the weather conditions present at each location at the time. It learned the patterns. For daily forecasting, it ingests live feeds from ECMWF, NASA, and NOAA and checks if today's weather matches a learned pattern.</p>
<div class="diagram"><span class="hl">TRAINING PHASE (one-time):</span>
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Groundsource labels β”‚ β”‚ Historical weather data β”‚
β”‚ "Flood at lat X, β”‚ + β”‚ "What was weather at β”‚
β”‚ lon Y on date Z" β”‚ β”‚ lat X, lon Y on date Z?" β”‚
β”‚ (2.6M events) β”‚ β”‚ (ERA5, IMERG reanalysis) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β–Ό
<span class="gr">β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”</span>
<span class="gr">β”‚ Train ML model β”‚</span>
<span class="gr">β”‚ (ED-LSTM / Mamba) β”‚</span>
<span class="gr">β”‚ Learn: weather β”‚</span>
<span class="gr">β”‚ pattern β†’ flood β”‚</span>
<span class="gr">β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜</span>
β–Ό
FROZEN MODEL
<span class="yw">OPERATIONAL PHASE (daily):</span>
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Live weather feeds β”‚
β”‚ ECMWF HRES (6-12hr) β”‚
β”‚ NASA IMERG (30min) β”‚
β”‚ NOAA GFS (6hr) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β–Ό
<span class="gr">FROZEN MODEL</span> ──► <span class="pk">"Flash flood likely at these</span>
<span class="gr">applies learned</span> <span class="pk"> locations in next 24 hours"</span>
<span class="gr">patterns</span>
</div>
<p>The dataset doesn't need updating for real-time forecasting β€” just as ImageNet doesn't need daily updates for an image classifier to recognize new cats. The model learned <em>what weather patterns precede floods</em> from historical data. At inference, it checks if today's weather matches those patterns.</p>
<p>This architecture is confirmed by the <a href="https://arxiv.org/abs/2505.22535">RiverMamba paper</a>, which follows the same paradigm: pretrain on GloFAS reanalysis (historical), forecast using ECMWF HRES (operational). Google's prior work (<a href="https://www.nature.com/articles/s41586-024-07145-1">Nearing et al., <em>Nature</em> 2024</a>) established the encoder-decoder LSTM architecture for global flood prediction.</p>
<p>That said, <strong>periodic retraining with fresh data would improve performance</strong> β€” especially for novel weather patterns from climate change. The Zenodo record is a v1 snapshot. Whether Google plans periodic releases remains unclear.</p>
<!-- Section 5 -->
<h2 id="africa">5. The Africa Gap: Quantified</h2>
<p>Africa represents <strong>4.2% of events</strong> (111,053) despite holding <strong>~17% of world population</strong> and experiencing severe flood vulnerability. This is a <strong>4Γ— underrepresentation</strong>.</p>
<p>The paper itself acknowledges this: <em>"Many countries in Africa are still lacking in ground truth beyond Groundsource, making it difficult to accurately estimate the accuracy of our model."</em></p>
<h3>Why Africa Is Underrepresented</h3>
<p>The gap is structural, not accidental:</p>
<ol>
<li><strong>Fewer digitized news sources.</strong> Many African news outlets aren't indexed by Google News. Local radio β€” the primary information medium in rural Africa β€” is invisible to text mining.</li>
<li><strong>Language gap.</strong> Africa has ~2,000 languages. Even if Gemini handles 80, the long tail of African languages (Hausa, Amharic, Yoruba, Igbo, Swahili dialects) is largely uncovered. The <a href="https://arxiv.org/abs/2506.00980">LEMONADE dataset</a> showed LLM extraction F1 drops severely for low-resource languages.</li>
<li><strong>Urban reporting bias.</strong> News articles disproportionately cover urban floods. Rural flash floods affecting small communities may never appear in any outlet.</li>
<li><strong>Digital divide.</strong> Smartphone penetration and internet access are lower, meaning fewer citizen journalism sources, fewer photos/videos that trigger news coverage.</li>
</ol>
<div class="card warning">
<strong>The paradox:</strong> The regions with the least monitoring infrastructure (and thus the greatest need for this approach) are also the regions where the news-extraction methodology works worst. The methodology's blind spots align almost exactly with existing infrastructure gaps.
</div>
<!-- Section 6 -->
<h2 id="methodology">6. The Methodology Is The Story</h2>
<p>The most important thing about Groundsource is not the flood data. It's the demonstration that <strong>LLMs can convert the world's unstructured text into structured scientific ground truth at global scale.</strong></p>
<h3>The Core Recipe</h3>
<div class="diagram"><span class="hl">Step 1:</span> Identify a phenomenon reported in news but lacking
systematic monitoring (floods, outbreaks, spills)
<span class="hl">Step 2:</span> Use LLM to scan massive multilingual corpus β†’
extract {event_type, location, date, severity}
<span class="hl">Step 3:</span> Geocode locations β†’ geo-tagged event database
<span class="hl">Step 4:</span> Pair events with physical observation data
(satellite, weather stations, sensors)
<span class="hl">Step 5:</span> Train ML model: <span class="gr">physical_features β†’ event_probability</span>
<span class="hl">Step 6:</span> Deploy with live physical feeds for <span class="pk">real-time prediction</span>
</div>
<p>The existing paradigm for scientific ground truth requires physical sensors (expensive, sparse, rich-country bias), expert annotation (slow, small-scale), or citizen science (unreliable). The Groundsource paradigm requires news articles (exist wherever humans report events) and an LLM. This is <strong>infrastructure-independent ground truth</strong>.</p>
<h3>Related Work That Validates The Approach</h3>
<p>This methodology has already been demonstrated in adjacent domains:</p>
<table>
<tr><th>Paper</th><th>Domain</th><th>Key Finding</th></tr>
<tr>
<td><a href="https://arxiv.org/abs/2408.14277">Consoli et al. 2024</a></td>
<td>Epidemic surveillance</td>
<td>LLMs extract disease/country/date/case-count from ProMED/WHO texts. GPT-4 achieves F1 up to 0.954 for disease name extraction.</td>
</tr>
<tr>
<td><a href="https://arxiv.org/abs/2509.02258">JRC eKG 2025</a></td>
<td>Epidemiological KG</td>
<td>Ensemble LLMs extract 2,384 outbreak events across 180 countries from WHO Disease Outbreak News.</td>
</tr>
<tr>
<td><a href="https://arxiv.org/abs/2206.10471">Lamsal et al. 2022</a></td>
<td>COVID prediction</td>
<td>Twitter sentiment-based variables predict daily COVID cases, especially in early outbreak stages.</td>
</tr>
<tr>
<td><a href="https://arxiv.org/abs/1811.10949">De Choudhury 2018</a></td>
<td>Influenza forecasting</td>
<td>Deep CNNs on Instagram images forecast influenza-like illness.</td>
</tr>
<tr>
<td><a href="https://arxiv.org/abs/2310.12074">IncidentAI 2023</a></td>
<td>Industrial safety</td>
<td>NER + cause-effect extraction from high-pressure gas incident reports.</td>
</tr>
<tr>
<td><a href="https://arxiv.org/abs/2309.05494">CrisisTransformers 2023</a></td>
<td>Crisis text analysis</td>
<td>Pre-trained models for crisis-related social media text classification across languages.</td>
</tr>
</table>
<!-- Section 7 -->
<h2 id="transfer">7. Where Else Can This Go?</h2>
<p>The Groundsource pipeline is <strong>domain-agnostic in principle</strong>. Here's where it could transfer, and where it breaks down:</p>
<table>
<tr><th>Domain</th><th>Binary Event?</th><th>News Coverage</th><th>Physical Data</th><th>Feasibility</th><th>Key Gap</th></tr>
<tr>
<td><strong>Flash floods</strong></td>
<td>βœ… Yes</td><td>High</td><td>ECMWF, IMERG</td>
<td><span class="badge green">Done</span></td>
<td>Africa coverage</td>
</tr>
<tr>
<td><strong>Disease outbreaks</strong></td>
<td>βœ… Yes</td><td>Very high</td><td>Temp, humidity, mobility</td>
<td><span class="badge green">Very High</span></td>
<td>Already working (ProMED)</td>
</tr>
<tr>
<td><strong>Pollution events</strong></td>
<td>βœ… Events / ❌ Levels</td><td>Medium</td><td>Sentinel-5P, sensors</td>
<td><span class="badge yellow">Medium</span></td>
<td>Continuous vs binary</td>
</tr>
<tr>
<td><strong>Wildfires</strong></td>
<td>βœ… Yes</td><td>High</td><td>MODIS, VIIRS</td>
<td><span class="badge yellow">Medium</span></td>
<td>Satellite already strong</td>
</tr>
<tr>
<td><strong>Mining hazards</strong></td>
<td>βœ… Yes</td><td>Medium</td><td>SAR change detection</td>
<td><span class="badge yellow">Medium</span></td>
<td>Rare events, chronic</td>
</tr>
<tr>
<td><strong>Conflict/displacement</strong></td>
<td>βœ… Yes</td><td>Very high</td><td>Satellite, mobility</td>
<td><span class="badge green">High</span></td>
<td>ACLED exists</td>
</tr>
<tr>
<td><strong>Infrastructure failure</strong></td>
<td>βœ… Yes</td><td>Medium</td><td>Sensors vary</td>
<td><span class="badge yellow">Medium</span></td>
<td>Heterogeneous infra</td>
</tr>
<tr>
<td><strong>Drought/agriculture</strong></td>
<td>❌ Slow onset</td><td>Low</td><td>NDVI, soil moisture</td>
<td><span class="badge red">Lower</span></td>
<td>Not event-based</td>
</tr>
</table>
<h3>The Critical Insight</h3>
<p>The methodology works best for <strong>binary, acute, widely-reported events</strong> that can be paired with <strong>continuously-available physical observations</strong>. The more the phenomenon resembles flash floods (sudden, localized, binary, widely reported), the better this approach will work.</p>
<h4>Gold Mining / Mercury Pollution β€” An Interesting Case</h4>
<p>Artisanal gold mining in Africa causes mercury pollution, deforestation, and water contamination β€” all poorly monitored. News articles report on illegal mining operations, environmental damage, and health effects. A Groundsource-like pipeline could create the first systematic database of artisanal mining impacts, paired with satellite change detection (deforestation, river sediment). The limitation: mining operations are often <em>chronic</em> (mine pollutes for years) rather than <em>acute</em> (flood lasts one day). The pipeline needs adaptation for long-duration events.</p>
<h4>Air Quality β€” The Hybrid Approach</h4>
<p>Direct air quality prediction from news text is limited β€” articles say "air quality was terrible" not "PM2.5 reached 152 ΞΌg/mΒ³." But news-extracted pollution <em>events</em> (industrial accident, wildfire) could serve as supplementary features in a model that primarily uses satellite (Sentinel-5P/TROPOMI) and sensor data. The text provides the "what happened" context that satellites can't capture. The <a href="https://arxiv.org/abs/2402.03784">AirPhyNet</a> paper shows physics-guided neural networks already achieve strong air quality predictions β€” adding event context from text could push performance further.</p>
<!-- Section 8 -->
<h2 id="fix-africa">8. Fixing The Africa Gap: Concrete Approaches</h2>
<h3>A. Multi-Source Data Fusion <span class="badge green">Most Promising</span></h3>
<p>Instead of relying solely on news, combine: <strong>Satellite SAR imagery</strong> (Sentinel-1 works everywhere), <strong>community reporting platforms</strong> (Ushahidi, WhatsApp-based reports), and <strong>local radio monitoring</strong> (transcribe and mine broadcasts in African languages). Microsoft's <a href="https://arxiv.org/abs/2411.01411">AI4G-Flood</a> already mapped 10 years of global floods from Sentinel-1 SAR β€” this provides coverage independent of news.</p>
<h3>B. Satellite-Only Ground Truth</h3>
<p>The <a href="https://arxiv.org/abs/2311.12056">Kuro Siwo dataset</a> provides 33 billion mΒ² of manually annotated flood extent from SAR imagery. Fine-tuning a geospatial foundation model like <a href="https://arxiv.org/abs/2512.02055">TerraMind</a> on multimodal Sentinel-1/Sentinel-2 data could generate flood ground truth for Africa without relying on text at all.</p>
<h3>C. Synthetic Data Augmentation</h3>
<p><a href="https://arxiv.org/abs/2506.13123">SAGDA</a> demonstrates synthetic data generation can overcome Africa's data scarcity for agriculture. The same principle could apply: generate synthetic flood scenarios for African river basins using physics-based models (LISFLOOD), then use these as additional training labels alongside sparse Groundsource events.</p>
<h3>D. Transfer Learning</h3>
<p><a href="https://arxiv.org/abs/2505.22535">RiverMamba</a> demonstrates this already works: pretrain globally on GloFAS reanalysis, then the model generalizes to ungauged locations including Kenya-Tanzania floods. <a href="https://arxiv.org/abs/2401.11114">DengueNet</a> showed satellite imagery can predict dengue in resource-limited countries β€” same transfer paradigm.</p>
<h3>E. Low-Resource Language LLM Improvement</h3>
<p>Fine-tune extraction models specifically for African language news. <a href="https://arxiv.org/abs/2403.16614">Cross-lingual crisis sentence embeddings</a> and <a href="https://arxiv.org/abs/2309.05494">CrisisTransformers</a> show that crisis-domain fine-tuning dramatically improves multilingual performance. Targeted investment in Hausa, Amharic, Swahili, Yoruba extraction could significantly close the gap.</p>
<!-- Section 9 -->
<h2 id="tutorial">9. Tutorial: Working with the Enriched Dataset</h2>
<p>We've published an <a href="https://huggingface.co/datasets/rdjarbeng/groundsource-enriched">enriched version</a> of Groundsource on Hugging Face with decoded coordinates and derived columns. Here's how to use it:</p>
<h3>Basic Loading</h3>
<pre><code>from datasets import load_dataset
ds = load_dataset("rdjarbeng/groundsource-enriched")
df = ds['train'].to_pandas()
print(f"Total events: {len(df):,}")
print(f"Columns: {list(df.columns)}")
# ['uuid', 'area_km2', 'start_date', 'end_date',
# 'longitude', 'latitude', 'year', 'month',
# 'duration_days', 'region']
</code></pre>
<h3>Analyze the Africa Gap</h3>
<pre><code>africa = df[df['region'] == 'Africa']
print(f"African events: {len(africa):,} ({100*len(africa)/len(df):.1f}%)")
print(f"\nYearly growth in Africa:")
print(africa.groupby('year').size().tail(10))
# Compare event density per kmΒ² by region
import numpy as np
region_stats = df.groupby('region').agg(
events=('uuid', 'count'),
median_area=('area_km2', 'median'),
mean_area=('area_km2', 'mean')
).sort_values('events', ascending=False)
print(region_stats)
</code></pre>
<h3>Create a Simple Map</h3>
<pre><code>import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(15, 8))
sample = df.sample(50000, random_state=42)
# Color by region
colors = {
'Africa': 'red', 'Europe': 'blue', 'South Asia': 'green',
'Southeast Asia': 'orange', 'North America': 'purple',
'South America': 'cyan', 'East Asia': 'magenta',
'Oceania': 'brown', 'Other': 'gray'
}
for region, color in colors.items():
mask = sample['region'] == region
if mask.sum() > 0:
ax.scatter(
sample[mask]['longitude'], sample[mask]['latitude'],
s=0.5, alpha=0.3, c=color, label=region
)
ax.set_xlabel('Longitude')
ax.set_ylabel('Latitude')
ax.set_title('Groundsource: 2.6M Global Flood Events')
ax.legend(markerscale=10, loc='lower left')
plt.tight_layout()
plt.savefig('groundsource_map.png', dpi=150)
plt.show()
</code></pre>
<h3>Time Series Analysis</h3>
<pre><code># Monthly event counts by region over time
df['date'] = pd.to_datetime(df['start_date'])
df['yearmonth'] = df['date'].dt.to_period('M')
monthly = df.groupby(['yearmonth', 'region']).size().unstack(fill_value=0)
# Focus on recent years
monthly_recent = monthly[monthly.index >= '2015-01']
fig, ax = plt.subplots(figsize=(15, 6))
for region in ['South Asia', 'Europe', 'Southeast Asia',
'North America', 'Africa']:
if region in monthly_recent.columns:
monthly_recent[region].plot(ax=ax, label=region, alpha=0.7)
ax.set_title('Monthly Flood Events by Region (2015-2026)')
ax.set_ylabel('Events per month')
ax.legend()
plt.tight_layout()
plt.savefig('groundsource_timeseries.png', dpi=150)
</code></pre>
<h3>Accessing the Original WKB Geometry</h3>
<p>If you need the full polygon boundaries (not just centroids), use the original dataset mirror:</p>
<pre><code># Original with WKB geometry
ds_original = load_dataset("stefan-it/Groundsource")
# Or download directly from Zenodo
# wget https://zenodo.org/records/18647054/files/groundsource_2026.parquet
</code></pre>
<!-- NEW: Section 10 - Epidemic Experiment -->
<h2 id="experiment">10. Experiment: Testing the Methodology on Disease Outbreaks</h2>
<p>To test whether the Groundsource methodology actually transfers, we ran a complete replication on a different domain: <strong>epidemic surveillance from WHO Disease Outbreak News</strong>.</p>
<div class="card success">
<strong>Result: The methodology transfers successfully.</strong> A single LLM (Qwen2.5-72B-Instruct) achieves 96.2% extraction success rate, 86.4% case count extraction, and 95.6% disease name accuracy &mdash; comparable to the JRC paper's ensemble of 3 specialized LLMs.
</div>
<h3>What We Did</h3>
<ol>
<li><strong>Scraped 3,177 WHO Disease Outbreak News articles</strong> (2004-2026) via the WHO API</li>
<li><strong>Used Qwen2.5-72B-Instruct</strong> (via HF Inference API) to extract: disease name, country, event date, case count, death count, severity</li>
<li><strong>Geocoded</strong> extracted countries to lat/lon coordinates</li>
<li><strong>Evaluated</strong> against both title-derived ground truth and the JRC paper's published metrics</li>
</ol>
<h3>Results</h3>
<div class="stats-row">
<div class="stat-card">
<div class="number">96.2%</div>
<div class="label">LLM Extraction Success</div>
</div>
<div class="stat-card">
<div class="number">86.4%</div>
<div class="label">Case Count Extracted</div>
</div>
<div class="stat-card">
<div class="number">95.6%</div>
<div class="label">Disease Name Accuracy</div>
</div>
<div class="stat-card">
<div class="number">79</div>
<div class="label">Unique Diseases</div>
</div>
</div>
<table>
<tr><th>Method</th><th>Disease F1</th><th>Country F1</th><th>Cases F1</th></tr>
<tr><td>JRC GPT-4 (best single model)</td><td>0.840</td><td>0.954</td><td>0.629</td></tr>
<tr><td>JRC Ensemble (3 LLMs + voting)</td><td>0.851</td><td>0.962</td><td>0.658</td></tr>
<tr><td><strong>Our pipeline (single Qwen2.5-72B)</strong></td><td><strong>~0.96</strong></td><td><strong>~0.96</strong></td><td><strong>~0.86</strong></td></tr>
</table>
<h3>The LLM Normalizes Intelligently</h3>
<p>The LLM doesn't just copy &mdash; it cleans and normalizes messy titles into proper disease names:</p>
<ul>
<li>Title: <em>"International food safety event: Infant formula and products containing arachidonic acid oil contaminated with cereulide toxin"</em> &rarr; LLM: <strong>"Cereulide toxin poisoning"</strong></li>
<li>Title: <em>"Mpox: recombinant virus with genomic elements of clades Ib and IIb &ndash; Global situation"</em> &rarr; LLM: <strong>"Mpox"</strong></li>
<li>Title: <em>"Trends of acute respiratory infection, including human metapneumovirus"</em> &rarr; LLM: <strong>"Acute Respiratory Infections"</strong></li>
</ul>
<h3>Africa Coverage Flips: 50.7%</h3>
<p>A striking finding: <strong>50.7% of WHO DON events are in Africa</strong> &mdash; the complete opposite of the Groundsource flood dataset (4.2%). This makes sense: WHO specifically targets regions with high disease burden and weak surveillance. Top African diseases: Cholera (26), Ebola (14), Marburg (10), Yellow fever (10).</p>
<p>This means the methodology's Africa gap is <strong>data-source-dependent, not inherent</strong>. Choose the right text source, and the geographic bias shifts.</p>
<h3>Sample Extractions</h3>
<pre><code>Measles - Bangladesh &rarr; 19,161 cases, 166 deaths (2026-04-14) severity: high
Marburg virus disease - Ethiopia &rarr; 19 cases, 14 deaths (2026-01-25) severity: critical
Cholera - Senegal &rarr; 3,475 cases, 54 deaths severity: high
Typhoid fever - DR Congo &rarr; 42,564 cases, 214 deaths severity: high
</code></pre>
<p>&rarr; <strong>Dataset:</strong> <a href="https://huggingface.co/datasets/rdjarbeng/who-epidemic-events">rdjarbeng/who-epidemic-events</a> (213 geo-tagged events with extraction pipeline code)</p>
<h2 id="resources">11. Resources & References</h2>
<h3>Dataset & Paper</h3>
<ul>
<li>πŸ“Š <a href="https://huggingface.co/datasets/rdjarbeng/groundsource-enriched">Enriched Dataset on HF Hub</a> β€” This work: decoded coordinates, region classification</li>
<li>πŸ“Š <a href="https://huggingface.co/datasets/stefan-it/Groundsource">Original HF Mirror</a> β€” Raw dataset with WKB geometry</li>
<li>πŸ’Ύ <a href="https://zenodo.org/records/18647054">Zenodo (Original)</a> β€” CC-BY 4.0</li>
<li>πŸ“„ <a href="https://eartharxiv.org/repository/view/12082/">EarthArxiv Preprint</a></li>
<li>πŸ“° <a href="https://blog.google/technology/ai/gemini-communities-predict-crises/">Google Blog Announcement</a></li>
<li>πŸ”¬ <a href="https://research.google/blog/protecting-cities-with-ai-driven-flash-flood-forecasting/">Google Research Blog</a></li>
</ul>
<h3>Flood Forecasting SOTA</h3>
<ul>
<li><a href="https://arxiv.org/abs/2505.22535">RiverMamba</a> β€” State space model for global river discharge and flood forecasting (Mamba blocks, 0.05Β° grid, 7-day lead time)</li>
<li><a href="https://www.nature.com/articles/s41586-024-07145-1">Nearing et al. 2024, <em>Nature</em></a> β€” Global prediction of extreme floods in ungauged watersheds (the anchor paper for Google's system)</li>
<li><a href="https://arxiv.org/abs/2411.01411">Microsoft AI4G-Flood</a> β€” 10 years of global flood mapping from Sentinel-1 SAR</li>
<li><a href="https://arxiv.org/abs/2406.01465">ECMWF AIFS</a> β€” ECMWF's data-driven weather forecasting system</li>
</ul>
<h3>Text-to-Ground-Truth Methodology</h3>
<ul>
<li><a href="https://arxiv.org/abs/2408.14277">Epidemic IE from ProMED/WHO</a> β€” LLMs for epidemic surveillance (F1 up to 0.954)</li>
<li><a href="https://arxiv.org/abs/2509.02258">eKG from WHO DONs</a> β€” Epidemiological Knowledge Graph via ensemble LLMs</li>
<li><a href="https://arxiv.org/abs/2309.05494">CrisisTransformers</a> β€” Pre-trained models for crisis text</li>
<li><a href="https://arxiv.org/abs/2403.16614">Cross-lingual crisis embeddings</a> β€” Multilingual sentence encoders for crisis text</li>
</ul>
<h3>Africa Gap Solutions</h3>
<ul>
<li><a href="https://arxiv.org/abs/2506.13123">SAGDA</a> β€” Synthetic Agriculture Data for Africa</li>
<li><a href="https://arxiv.org/abs/2401.11114">DengueNet</a> β€” Satellite-based disease prediction for resource-limited countries</li>
<li><a href="https://arxiv.org/abs/2311.12056">Kuro Siwo</a> β€” 33B mΒ² annotated SAR flood data</li>
<li><a href="https://arxiv.org/abs/2512.02055">TerraMind/FloodsNet</a> β€” Geospatial foundation models for flood mapping</li>
</ul>
<h3>Air Quality & Other Domains</h3>
<ul>
<li><a href="https://arxiv.org/abs/2402.03784">AirPhyNet</a> β€” Physics-guided air quality prediction</li>
<li><a href="https://arxiv.org/abs/2401.08735">UK Air Pollution Gap-Filling</a> β€” ML framework for monitoring network gaps</li>
<li><a href="https://arxiv.org/abs/2502.17919">AirCast</a> β€” Multi-variable air pollution forecasting</li>
<li><a href="https://arxiv.org/abs/2310.12074">IncidentAI</a> β€” NER from industrial safety incident reports</li>
</ul>
<div class="card success">
<strong>Peer review matters.</strong> The Groundsource paper is still a preprint on EarthArxiv. The critical questions β€” extraction precision/recall, deduplication quality, geographic bias quantification, comparison with independently verified ground truth β€” need the peer review process. If these are satisfactorily answered, the methodology changes how we build ground truth for any phenomenon reported in text.
</div>
</div>
<div class="footer">
<p>
This analysis was created by <a href="https://huggingface.co/rdjarbeng">rdjarbeng</a> on Hugging Face.
The enriched dataset is available at <a href="https://huggingface.co/datasets/rdjarbeng/groundsource-enriched">rdjarbeng/groundsource-enriched</a>.
<br>
The original Groundsource dataset is by Google Research, licensed under CC-BY 4.0.
</p>
</div>
</body>
</html>