FAMA-Astro / DataReadme.md

lvjiameng

Update DataReadme.md

3e924ae verified about 2 months ago

preview code

raw

history blame contribute delete

4.34 kB

Dataset Card for FAMA Downstream Evaluation Datasets

This repository contains three benchmark datasets used in the evaluation of the FAMA (Foundational Astronomical Masked Autoencoder) model across heterogeneous astronomical tasks. These include two galaxy morphology classification datasets and one photometric redshift regression dataset.

1. `galaxy-desi`: Galaxy Morphology Classification (In-Distribution)

Description

The galaxy-desi dataset is a curated collection of galaxy images extracted from the DESI Legacy Imaging Surveys Data Release 9 (DR9). It serves as the primary in-distribution benchmark for fine-grained galaxy morphology classification.

Each image is a 3-band (g, r, z) cutout of size 256×256 pixels with a pixel scale of 0.262 arcseconds per pixel, centered on the galaxy’s celestial coordinates. Labels are derived from Galaxy Zoo 2 crowdsourced classifications, filtered using confidence thresholds and quality protocols.

Task Type

Image classification into 8 morphological classes.

Total Samples

70,132 labeled galaxy images.

Class Distribution

Round elliptical: 12,321 samples
In-between elliptical: 12,193 samples
Cigar-shaped elliptical: 12,130 samples
Edge-on: 6,282 samples
Barred spirals: 4,090 samples
Unbarred spirals: 12,060 samples
Irregular (without merger): 6,738 samples
Merger: 4,318 samples

The dataset is slightly imbalanced, with elliptical and spiral types being more prevalent than mergers or barred spirals.

2. `galaxy-sdss`: Galaxy Morphology Classification (Out-of-Distribution)

Description

The galaxy-sdss dataset comprises galaxy images from the Sloan Digital Sky Survey (SDSS) and is used as an out-of-distribution (OOD) testbed to evaluate model generalization across different survey instruments and data distributions.

Images are labeled into five simplified morphological categories based on Galaxy Zoo–derived thresholds. The dataset follows the protocol of [Cheng et al., 2020] and uses the same train/test split as the original study.

Task Type

Image classification into 5 coarse morphological classes.

Total Samples

28,793 images (23,037 training + 5,754 test).

Class Distribution (Training Set)

Completely round smooth: 6,749 samples
In-between smooth: 6,456 samples
Cigar-shaped smooth: 464 samples
Spiral: 6,245 samples
Edge-on: 3,123 samples

Class Distribution (Test Set)

Completely round smooth: 1,687 samples
In-between smooth: 1,612 samples
Cigar-shaped smooth: 115 samples
Spiral: 1,560 samples
Edge-on: 780 samples

Note the extreme rarity of cigar-shaped galaxies in this dataset, especially in the test set.

3. `photo-z-sdss`: Photometric Redshift Estimation

Description

This dataset is constructed from the SDSS Data Release 12 Main Galaxy Sample via the CasJobs interface. It is designed for photometric redshift regression, a critical task in cosmology.

For each galaxy, multi-band (u, g, r, i, z) cutouts of size 300×300 pixels (0.396 arcsec/pixel) are retrieved from the SDSS Science Archive Server using astroquery. Only galaxies with reliable spectroscopic redshifts (ZWARNING = 0), r-band dereddened Petrosian magnitude < 17.77, and redshift in the range 0.01 < z < 0.3 are included.

Task Type

Regression (predicting continuous redshift value from multi-band images).

Total Samples

50,896 galaxies.

Split

Training set: 10,100 samples
Test set: 40,796 samples

The split is performed using redshift-stratified sampling to ensure consistent redshift distributions between training and test sets.

4. `lens-detection`: Gravitational Lens Detection (Object Detection)

Description

This dataset is designed for strong gravitational lens detection in wide-field survey images. It likely consists of image cutouts from surveys such as DESI, LSST, or the Kilo-Degree Survey (KiDS), annotated with bounding boxes around candidate lens systems (e.g., Einstein rings, arcs).

The dataset supports the object detection downstream task evaluated in the FAMA paper, where the model demonstrated significant gains over supervised baselines.

Task Type

Object detection (localization + binary classification: lens vs. non-lens).

Dataset Card for FAMA Downstream Evaluation Datasets

1. galaxy-desi: Galaxy Morphology Classification (In-Distribution)

Description

Task Type

Total Samples

Class Distribution

2. galaxy-sdss: Galaxy Morphology Classification (Out-of-Distribution)

Description

Task Type

Total Samples

Class Distribution (Training Set)

Class Distribution (Test Set)

3. photo-z-sdss: Photometric Redshift Estimation

Description

Task Type

Total Samples

Split

4. lens-detection: Gravitational Lens Detection (Object Detection)

Description

Task Type

1. `galaxy-desi`: Galaxy Morphology Classification (In-Distribution)

2. `galaxy-sdss`: Galaxy Morphology Classification (Out-of-Distribution)

3. `photo-z-sdss`: Photometric Redshift Estimation

4. `lens-detection`: Gravitational Lens Detection (Object Detection)