EPI-Eval
AI & ML interests
None defined yet.
Recent Activity
EPI-Eval
A curated collection of large epidemiological datasets, normalized to a single schema so they can be searched, joined, and benchmarked against each other.
What we track
Time-series surveillance data on infectious disease β primarily respiratory
viruses (flu, COVID-19, RSV) and arboviral disease (dengue, Zika,
chikungunya), with smaller coverage of notifiable, mortality, wastewater, and
behavioural / search signals. Sources come from CDC, WHO, ECDC, PAHO, OWID,
and national public-health agencies; we re-publish them as Parquet with a
consistent set of row-level columns (date, location_id, location_level,
optional condition / case_status / as_of) and a metadata header
describing pathogens, geography, cadence, and per-column units.
Why
Forecasting and modeling work routinely stalls on data plumbing β finding the canonical version of a series, normalizing geography codes, reconciling reporting cadences, tracking when a source was last revised. The goal of this org is to do that work once, in the open.
Schema
Every dataset card on this org uses the same frontmatter format
(schema v0.1),
validated against a controlled vocabulary
(vocabularies.yaml).
Curated metadata (pathogens, license, units) lives alongside computed metadata
(time coverage, row count, observed cadence) generated at ingest.
Contributing a dataset
The ingest pipeline is in
apart-forecasting-tool/upload_pipeline.
A new dataset is one ingest.py + card.yaml under
upload_pipeline/sources/<source_id>/; the validator confirms schema fit
before upload. Each new truth dataset auto-creates an empty
<id>-predictions companion at upload time.
Datasets (21)
Respiratory
Syndromic / ED
| Dataset | Pathogens | Geography | Cadence |
|---|---|---|---|
| CDC NSSP / ESSENCE β ED visits for ILI / COVID / RSV | influenza, sars-cov-2, rsv | US | weekly |
Arboviral
| Dataset | Pathogens | Geography | Cadence |
|---|---|---|---|
| OpenDengue β national dengue case counts (V1.3) | dengue | multiple | irregular |
Mobility & contact
| Dataset | Pathogens | Geography | Cadence |
|---|---|---|---|
| Google Community Mobility Reports β global daily | β | multiple | daily |
Search & behavioural
| Dataset | Pathogens | Geography | Cadence |
|---|---|---|---|
| Wikipedia pageviews β disease-article daily views | influenza, sars-cov-2, rsv +6 | multiple | daily |
Notifiable / other
| Dataset | Pathogens | Geography | Cadence |
|---|---|---|---|
| OWID Mpox β global daily compiled | mpox | multiple | daily |
| WHO Global TB β annual country estimates | tuberculosis | multiple | annual |
Predictions
Each truth dataset has a companion EPI-Eval/<id>-predictions repo that
accumulates community-submitted forecasts. Schema is long-format: one row per
(target_date, [dim valuesβ¦], quantile, value), with quantile = NULL
reserved for the point estimate. Forecasters submit through the
EPI-Eval dashboard;
a maintainer reviews each PR before merging, and merged predictions show up
on the corresponding truth dataset's Show predictions toggle in the
dashboard, with a per-submitter leaderboard (MAE / WIS / rWIS / coverage).
Status
Active. Coverage and dataset list grow through PRs to the upload pipeline.