# Dataset Card for FAMA Downstream Evaluation Datasets

This repository contains three benchmark datasets used in the evaluation of the **FAMA (Foundational Astronomical Masked Autoencoder)** model across heterogeneous astronomical tasks. These include two galaxy morphology classification datasets and one photometric redshift regression dataset.

## 1. `galaxy-desi`: Galaxy Morphology Classification (In-Distribution)

### Description
The `galaxy-desi` dataset is a curated collection of galaxy images extracted from the **DESI Legacy Imaging Surveys Data Release 9 (DR9)**. It serves as the primary in-distribution benchmark for fine-grained galaxy morphology classification.

Each image is a 3-band (g, r, z) cutout of size 256×256 pixels with a pixel scale of 0.262 arcseconds per pixel, centered on the galaxy’s celestial coordinates. Labels are derived from **Galaxy Zoo 2** crowdsourced classifications, filtered using confidence thresholds and quality protocols.

### Task Type
Image classification into **8 morphological classes**.

### Total Samples
70,132 labeled galaxy images.

### Class Distribution
- Round elliptical: 12,321 samples  
- In-between elliptical: 12,193 samples  
- Cigar-shaped elliptical: 12,130 samples  
- Edge-on: 6,282 samples  
- Barred spirals: 4,090 samples  
- Unbarred spirals: 12,060 samples  
- Irregular (without merger): 6,738 samples  
- Merger: 4,318 samples  

The dataset is slightly imbalanced, with elliptical and spiral types being more prevalent than mergers or barred spirals.

---

## 2. `galaxy-sdss`: Galaxy Morphology Classification (Out-of-Distribution)

### Description
The `galaxy-sdss` dataset comprises galaxy images from the **Sloan Digital Sky Survey (SDSS)** and is used as an **out-of-distribution (OOD)** testbed to evaluate model generalization across different survey instruments and data distributions.

Images are labeled into five simplified morphological categories based on Galaxy Zoo–derived thresholds. The dataset follows the protocol of [Cheng et al., 2020] and uses the same train/test split as the original study.

### Task Type
Image classification into **5 coarse morphological classes**.

### Total Samples
28,793 images (23,037 training + 5,754 test).

### Class Distribution (Training Set)
- Completely round smooth: 6,749 samples  
- In-between smooth: 6,456 samples  
- Cigar-shaped smooth: 464 samples  
- Spiral: 6,245 samples  
- Edge-on: 3,123 samples  

### Class Distribution (Test Set)
- Completely round smooth: 1,687 samples  
- In-between smooth: 1,612 samples  
- Cigar-shaped smooth: 115 samples  
- Spiral: 1,560 samples  
- Edge-on: 780 samples  

Note the extreme rarity of cigar-shaped galaxies in this dataset, especially in the test set.

---

## 3. `photo-z-sdss`: Photometric Redshift Estimation

### Description
This dataset is constructed from the **SDSS Data Release 12 Main Galaxy Sample** via the CasJobs interface. It is designed for **photometric redshift regression**, a critical task in cosmology.

For each galaxy, multi-band (u, g, r, i, z) cutouts of size 300×300 pixels (0.396 arcsec/pixel) are retrieved from the SDSS Science Archive Server using `astroquery`. Only galaxies with reliable spectroscopic redshifts (`ZWARNING = 0`), r-band dereddened Petrosian magnitude < 17.77, and redshift in the range 0.01 < z < 0.3 are included.

### Task Type
Regression (predicting continuous redshift value from multi-band images).

### Total Samples
50,896 galaxies.

### Split
- Training set: 10,100 samples  
- Test set: 40,796 samples  

The split is performed using **redshift-stratified sampling** to ensure consistent redshift distributions between training and test sets.

## 4. `lens-detection`: Gravitational Lens Detection (Object Detection)

### Description
This dataset is designed for **strong gravitational lens detection** in wide-field survey images. It likely consists of image cutouts from surveys such as DESI, LSST, or the Kilo-Degree Survey (KiDS), annotated with bounding boxes around candidate lens systems (e.g., Einstein rings, arcs).

The dataset supports the **object detection** downstream task evaluated in the FAMA paper, where the model demonstrated significant gains over supervised baselines.

### Task Type
Object detection (localization + binary classification: lens vs. non-lens).