Machine learning for alien climates: Introducing the ThousandWorlds benchmark

Community Article
Published June 23, 2026

Dataset · Code · Paper

There's a good chance we find alien life in the next 20 years, possibly much sooner...

An entire subfield of astronomy, exoplanetary science, has arisen with this goal as its north star. And it is one of the fastest-growing.

We (and many exoplaneteers!) think our best hope for achieving it runs roughly like so: 1) scan the galaxy for as many potentially habitable planets as possible 2) detect the gases in their atmospheres with powerful telescopes like JWST and 3) infer from these gases whether life is present or not.

We've already detected some potentially habitable planets, with many more on the way, but those last two steps (the harder two...) depend on knowledge of the planet's climate: its temperature, its winds, where its clouds sit, its heat transport. This is because the spectral signatures that gases leave behind in their atmospheres depend on the environment in which they occur (crucially, the temperature and pressure) and are confounded with clouds and other atmospheric phenomena; plus their interpretation depends on what you think the climate is – does an O₂ detection mean life, or just photodissociation of water?

The most sophisticated models of exoplanet climates we have are global climate models (GCMs). These exoplanet-adapted GCMs are built from the same underlying models used to predict climate change on Earth, and they are expensive: a single exoplanet run can cost millions of core-hours, and can take weeks of expert time to set up and babysit to a steady-state (GCMs crash a lot!). So the field is limited to studying a handful of hand-picked planet configurations at a time.

An emulator that produces fast climate predictions would remove this bottleneck, opening the door to large parameter sweeps, principled uncertainties, and integration with the pipelines that interpret telescope data.

So why has no-one built one, beyond some small proof-of-concepts? Well because there was no dataset! The raw simulations exist, produced by different groups running different GCMs, but they are scattered across studies in incompatible formats, on different grids, with different output variables – not a situation very amenable to ML... This led us to create ThousandWorlds, a unified collection of simulated alien climates, readily usable to train and benchmark ML surrogates.

What is ThousandWorlds?

ThousandWorlds comprises 1760 simulations from 5 GCMs, covering everything from frozen snowball worlds to steamy sauna-like worlds. The simulations are drawn from the exoplanet science community and supplemented with ~400 bespoke runs we ran to fill gaps in coverage.

The task is parameter-to-field regression: from 8 planet parameters predict the planet's steady-state 3D climate (temperature, humidity, winds, cloud fraction, radiation), which we provide as a stack of 53 2D fields, each on a 32×6432 \times 64 latitude-longitude grid.

OVERVIEW

Opportunities for the ML community

Beyond advancing the search for life, ThousandWorlds offers a challenging but accessible example problem for an underserved regime in ML for science.

Most scientific ML benchmarks focus on the field-to-field prediction regime where deep learning is already starting to dominate. ThousandWorlds sits in the complementary regime of parameter-to-field prediction, which is common to many scientific problems. And deep learning does not dominate here – our two strongest baselines are both based on Gaussian processes (GPs). Perhaps GPs will ultimately win out, or perhaps there's some low-hanging deep learning fruit that hasn't been picked yet.

Concretely, the challenges are:

  • Limited data. 1760 simulations, a few hundred of which are at the highest fidelity which we ultimately target.
  • Parameter-to-field. The input is eight bulk properties describing the planet. The output is a 105\sim 10^5-dimensional climate.
  • Spherical geometry. All our fields are on the sphere. Some of our baselines use spherical harmonics to good effect.
  • Structured missingness (optional). Not all simulations include the exact same set of fields (due to differing grid and summary-variable choices), which introduces patterns of missingness in the full dataset. We also ship subsets with complete outputs only for testing methods that don't handle missing data.
  • Multi-simulator transfer (optional). We have five climate models with different physics and different fidelity. Better multi-fidelity/multi-simulator transfer methods are a promising direction for improving on our baselines. We also have a single-model subset for testing methods that don't do multi-simulator learning.

We suggest three rungs to aim at: "PCA-MLP" as the deep learning baseline to beat, "PPCA-ICM" as the stronger GP-based target, and extra kudos if you can beat our custom-built model GPLFR (a GP-based model we came up with specifically targeting problems in this regime).

Come see our worlds!

ThousandWorlds is for the ML community to benchmark methods on a problem at the frontier of astronomy, and for the exoplanet community to see which methods earn their place.

Dataset · Code · Paper

Community

Sign up or log in to comment