Production LightGBM model — 49 features, 5-fold GroupKFold CV

Browse files

Files changed (5) hide show

README.md +138 -18
features.py +65 -0
model_days.pkl +3 -0
model_info.json +164 -0
model_mag.pkl +3 -0

README.md CHANGED Viewed

@@ -1,24 +1,144 @@
 ---
-tags: [astronomy, supernova, regression, ztf, timm, convnext]
-datasets: [MultimodalUniverse/btsbot]
 license: mit
 ---
-# Supernova Peak Predictor
-Predicts **when** and **how bright** a supernova will become from early ZTF observations.
-## Novel Contribution
-First model to predict `days_to_peak` and `peakmag` as regression targets from rise-phase alerts.
 ## Architecture
-ConvNeXt-pico (Galaxy Zoo pretrained) + 23-feature metadata MLP -> fusion -> dual regression heads
-8,811,816 params (8,425,576 trainable)
-## Test Results
-| Metric | days_to_peak | peakmag |
-|--------|-------------|---------|
-| MAE | 124.03 days | 0.324 mag |
-| Median AE | 29.15 days | 0.223 mag |
-## Data
-[MultimodalUniverse/btsbot](https://huggingface.co/datasets/MultimodalUniverse/btsbot) — rise-phase supernovae only
-3x63x63 image triplets + 23 metadata features. Leaky features (age, days_since_peak) excluded.

 ---
+tags:
+  - astronomy
+  - supernova
+  - regression
+  - ztf
+  - lightgbm
+  - tabular
+  - time-domain
+datasets:
+  - MultimodalUniverse/btsbot
 license: mit
+pipeline_tag: tabular-regression
 ---
+# 🌟 Supernova Peak Predictor
+**Predicts when and how bright a supernova will become from its earliest ZTF alert observations.**
+Given a single ZTF alert packet from a rising supernova, this model predicts:
+- **`days_to_peak`** — how many days until the supernova reaches maximum brightness
+- **`peakmag`** — the peak apparent magnitude it will achieve
+This enables astronomers to answer: *"Should I point the telescope at this target tonight, or can it wait?"*
+## Why this matters
+The bottleneck in transient astronomy isn't detection — ZTF finds thousands of candidates per night. The bottleneck is **follow-up telescope time**. Spectroscopic observations are expensive and limited. Every night, astronomers must decide which of dozens of candidates to prioritize.
+Currently, that decision is reactive: *"Is this a supernova? Yes → follow up."* ([BTSbot](https://huggingface.co/nabeelr/BTSbot-convnext-pico-galaxyzoo-metadata) solves this with 98.5% accuracy.)
+Our model makes it **proactive**: *"This SN will peak at mag 17.8 in 5 days → schedule it now"* vs *"This one won't peak for 3 months → deprioritize."*
+With the Vera C. Rubin Observatory (LSST) coming online, alert rates will jump from ~100K/night to ~10M/night. Automated triage like this will be essential.
+## Performance
+Evaluated with **5-fold grouped cross-validation** (grouped by supernova object ID to prevent data leakage — no alerts from the same SN appear in both train and validation).
+### Overall (27,202 alerts from 3,806 supernovae)
+| Target | MAE | Median AE | P90 |
+|--------|-----|-----------|-----|
+| **days_to_peak** | 118.8 days | 29.8 days | 327.2 days |
+| **peakmag** | 0.257 mag | 0.178 mag | 0.536 mag |
+### By number of prior detections
+| Detection stage | n | MAE days | Median days | MAE mag | Median mag |
+|----------------|---|----------|-------------|---------|------------|
+| **1-3 (first catches)** | 4,817 | 70.8 | **16.5** | 0.391 | 0.304 |
+| 4-10 (early rise) | 9,683 | 114.4 | 26.7 | 0.273 | 0.197 |
+| 11-50 (sampled rise) | 11,029 | 132.4 | 36.3 | 0.199 | 0.147 |
+| 50+ (monitored) | 1,673 | 192.0 | 77.6 | 0.164 | 0.117 |
+**Key finding:** The model is most useful on the hardest, most valuable cases — the first 1-3 detections — where median timing error is just **16.5 days** and median magnitude error is **0.304 mag**.
+### By true time-to-peak
+| Horizon | n | MAE days | Median days | MAE mag | Median mag |
+|---------|---|----------|-------------|---------|------------|
+| **Imminent (<7d)** | 12,694 | 66.2 | 24.9 | 0.330 | 0.229 |
+| **Soon (7-30d)** | 10,431 | 62.9 | **20.7** | **0.183** | **0.150** |
+| Weeks (30-100d) | 1,086 | 75.5 | 30.2 | 0.211 | 0.148 |
+| Distant (100d+) | 2,991 | 552.5 | 433.0 | 0.229 | 0.164 |
+The **"soon" horizon (7-30 days)** is the sweet spot — exactly the window where scheduling decisions matter most, and where the model achieves **0.150 mag** median error.
 ## Architecture
+**LightGBM** gradient boosted trees on **49 engineered features** extracted from ZTF alert metadata.
+No images are used. We tested a ConvNeXt-pico CNN on the 63×63 difference image triplets and found that **metadata alone outperforms the full multimodal model** (MAE mag 0.257 vs 0.271). The images add noise at this resolution. This is itself a useful finding — it means the predictor can run at alert-stream speed (microseconds per prediction, no GPU needed).
+### Top features (by LightGBM importance)
+**For days_to_peak:** `ncovhist` (coverage history), `distpsnr1` (distance to nearest PS1 source), `distpsnr2`, `neargaia`, `maxmag_so_far`
+**For peakmag:** `maggaia` (Gaia magnitude), `peakmag_so_far` (brightest seen), `sgscore1` (star/galaxy score), `maxmag_so_far`, `ndethist`
+The host galaxy properties (PS1 colors, star/galaxy scores, distances) dominate the timing prediction — the model is learning that **where a supernova lives** (host type, distance, environment) constrains **how it evolves**.
+## What we tried that didn't work
+| Approach | Result |
+|----------|--------|
+| ConvNeXt-pico CNN on 63×63 ZTF image triplets + metadata | 6-20% **worse** than metadata-only across all bins |
+| Simple MLP (114K params) on 23 raw features | Competitive but 2-5% worse than tree models with engineered features |
+| Including `age` / `days_since_peak` as input features | Creates direct data leakage (`days_to_peak = age - days_since_peak`) |
+## Training data
+- **Source:** [MultimodalUniverse/btsbot](https://huggingface.co/datasets/MultimodalUniverse/btsbot) — ZTF Bright Transient Survey alerts
+- **Filter:** Rise-phase supernovae only (`is_rise=True`, `is_SN=True`)
+- **Size:** 27,202 alerts from 3,806 unique supernovae
+- **Splits:** 5-fold GroupKFold by object ID (no alert-level leakage)
+## Usage
+```python
+import pickle, json
+import numpy as np
+from huggingface_hub import hf_hub_download
+# Download model files
+model_days = pickle.load(open(hf_hub_download("hawthorneluke/supernova-peak-predictor", "model_days.pkl"), "rb"))
+model_mag = pickle.load(open(hf_hub_download("hawthorneluke/supernova-peak-predictor", "model_mag.pkl"), "rb"))
+info = json.load(open(hf_hub_download("hawthorneluke/supernova-peak-predictor", "model_info.json")))
+# Your ZTF alert metadata (example)
+from features import engineer_features  # download features.py from this repo
+alert = {"magpsf": 19.2, "sigmapsf": 0.15, "ndethist": 3, ...}  # ZTF alert fields
+feats = engineer_features(alert)
+# Predict
+X = np.array([[feats[c] for c in info['feature_cols']]], dtype=np.float32)
+days_pred = model_days.predict(X)[0]
+mag_pred = model_mag.predict(X)[0]
+print(f"Predicted: peak in {days_pred:.1f} days at magnitude {mag_pred:.2f}")
+```
+## Interactive demo
+Try it live: [hawthorneluke/supernova-peak-predictor-demo](https://huggingface.co/spaces/hawthorneluke/supernova-peak-predictor-demo)
+## Limitations
+- **Long-horizon predictions are poor.** For SNe >100 days from peak, the MAE is 552 days. The model essentially can't predict these — they're rare, slow-evolving transients with ambiguous early signatures.
+- **No uncertainty quantification.** The model gives point estimates. A production system would need prediction intervals.
+- **ZTF-specific.** Features are tied to ZTF alert schema. Adaptation to LSST/Rubin alerts would require feature remapping.
+- **No spectroscopic type prediction.** We predict timing and brightness but not SN type (Ia vs II vs Ibc). This would be a natural extension.
+## Citation
+If you use this model, please cite the underlying data:
+```bibtex
+@article{rehemtulla2024btsbot,
+  title={BTSbot: A Multi-modal Deep Learning Model for Automated Bright Transient Identification},
+  author={Rehemtulla, Nabeel and others},
+  journal={arXiv preprint arXiv:2401.15167},
+  year={2024}
+}
+```

features.py ADDED Viewed

	@@ -0,0 +1,65 @@

+"""Feature engineering for Supernova Peak Predictor."""
+def engineer_features(row):
+    """Extract features from a ZTF alert metadata dict.
+    Args:
+        row: dict with ZTF alert fields (magpsf, sigmapsf, etc.)
+    Returns:
+        dict of engineered features
+    """
+    feats = {}
+    feats['magpsf'] = float(row.get('magpsf', 0) or 0)
+    feats['sigmapsf'] = float(row.get('sigmapsf', 0) or 0)
+    feats['magap'] = float(row.get('magap', 0) or 0)
+    feats['sigmagap'] = float(row.get('sigmagap', 0) or 0)
+    feats['diffmaglim'] = float(row.get('diffmaglim', 0) or 0)
+    feats['peakmag_so_far'] = float(row.get('peakmag_so_far', 0) or 0)
+    feats['maxmag_so_far'] = float(row.get('maxmag_so_far', 0) or 0)
+    feats['mag_range'] = feats['maxmag_so_far'] - feats['peakmag_so_far']
+    feats['mag_vs_peak'] = feats['magpsf'] - feats['peakmag_so_far']
+    feats['mag_vs_lim'] = feats['diffmaglim'] - feats['magpsf']
+    feats['mag_psf_ap_diff'] = feats['magpsf'] - feats['magap']
+    feats['ndethist'] = float(row.get('ndethist', 0) or 0)
+    feats['ncovhist'] = float(row.get('ncovhist', 0) or 0)
+    feats['nnotdet'] = float(row.get('nnotdet', 0) or 0)
+    feats['nmtchps'] = float(row.get('nmtchps', 0) or 0)
+    feats['det_fraction'] = feats['ndethist'] / (feats['ncovhist'] + 1)
+    feats['N'] = float(row.get('N', 0) or 0)
+    feats['nneg'] = float(row.get('nneg', 0) or 0)
+    feats['nbad'] = float(row.get('nbad', 0) or 0)
+    feats['fwhm'] = float(row.get('fwhm', 0) or 0)
+    feats['chipsf'] = float(row.get('chipsf', 0) or 0)
+    feats['chinr'] = float(row.get('chinr', 0) or 0)
+    feats['sharpnr'] = float(row.get('sharpnr', 0) or 0)
+    feats['scorr'] = float(row.get('scorr', 0) or 0)
+    feats['sky'] = float(row.get('sky', 0) or 0)
+    feats['classtar'] = float(row.get('classtar', 0) or 0)
+    feats['new_drb'] = float(row.get('new_drb', 0) or 0)
+    feats['drb'] = float(row.get('drb', 0) or 0)
+    feats['exptime'] = float(row.get('exptime', 30) or 30)
+    feats['sgscore1'] = float(row.get('sgscore1', 0) or 0)
+    feats['distpsnr1'] = float(row.get('distpsnr1', 0) or 0)
+    feats['sgscore2'] = float(row.get('sgscore2', 0) or 0)
+    feats['distpsnr2'] = float(row.get('distpsnr2', 0) or 0)
+    feats['distnr'] = float(row.get('distnr', 0) or 0)
+    feats['magnr'] = float(row.get('magnr', 0) or 0)
+    feats['mag_vs_host'] = feats['magpsf'] - feats['magnr']
+    feats['neargaia'] = float(row.get('neargaia', 0) or 0)
+    v = row.get('neargaia', 0) or 0
+    feats['neargaia'] = float(v) if float(v) > -998 else 0
+    feats['maggaia'] = float(row.get('maggaia', 0) or 0)
+    v = row.get('maggaia', 0) or 0
+    feats['maggaia'] = float(v) if float(v) > -998 else 0
+    for col in ['sgmag1', 'srmag1', 'simag1', 'szmag1']:
+        val = row.get(col, -999) or -999
+        feats[col] = float(val) if float(val) > -998 else 0
+    sg, sr, si, sz = feats['sgmag1'], feats['srmag1'], feats['simag1'], feats['szmag1']
+    feats['host_g_r'] = (sg - sr) if sg > 0 and sr > 0 else 0
+    feats['host_r_i'] = (sr - si) if sr > 0 and si > 0 else 0
+    feats['host_i_z'] = (si - sz) if si > 0 and sz > 0 else 0
+    feats['fid'] = float(row.get('fid', 1))
+    feats['is_g_band'] = 1.0 if feats['fid'] == 1 else 0.0
+    feats['ndethist_x_magrange'] = feats['ndethist'] * feats['mag_range']
+    feats['snr_proxy'] = feats['mag_vs_lim'] / (feats['sigmapsf'] + 0.01)
+    return feats

model_days.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0d08ad55b1af0b994feab7a2e35437916c5521207be70ef15a0498078d2c506c
+size 1382042

model_info.json ADDED Viewed

	@@ -0,0 +1,164 @@

+{
+  "model_type": "LightGBM",
+  "feature_cols": [
+    "magpsf",
+    "sigmapsf",
+    "magap",
+    "sigmagap",
+    "diffmaglim",
+    "peakmag_so_far",
+    "maxmag_so_far",
+    "mag_range",
+    "mag_vs_peak",
+    "mag_vs_lim",
+    "mag_psf_ap_diff",
+    "ndethist",
+    "ncovhist",
+    "nnotdet",
+    "nmtchps",
+    "det_fraction",
+    "N",
+    "nneg",
+    "nbad",
+    "fwhm",
+    "chipsf",
+    "chinr",
+    "sharpnr",
+    "scorr",
+    "sky",
+    "classtar",
+    "new_drb",
+    "drb",
+    "exptime",
+    "sgscore1",
+    "distpsnr1",
+    "sgscore2",
+    "distpsnr2",
+    "distnr",
+    "magnr",
+    "mag_vs_host",
+    "neargaia",
+    "maggaia",
+    "sgmag1",
+    "srmag1",
+    "simag1",
+    "szmag1",
+    "host_g_r",
+    "host_r_i",
+    "host_i_z",
+    "fid",
+    "is_g_band",
+    "ndethist_x_magrange",
+    "snr_proxy"
+  ],
+  "n_features": 49,
+  "cv_metrics": {
+    "mae_d": 118.77106475830078,
+    "mae_m": 0.2573636472225189,
+    "med_d": 29.818622589111328,
+    "med_m": 0.17830181121826172,
+    "p90_d": 327.1890563964844,
+    "p90_m": 0.5362690091133118
+  },
+  "fold_results": [
+    {
+      "fold": 1,
+      "xgb_mae_d": 119.88993072509766,
+      "xgb_mae_m": 0.2529917061328888,
+      "xgb_med_d": 27.38703727722168,
+      "xgb_med_m": 0.17678070068359375,
+      "lgb_mae_d": 120.7685546875,
+      "lgb_mae_m": 0.2521520256996155,
+      "lgb_med_d": 29.628095626831055,
+      "lgb_med_m": 0.17668533325195312
+    },
+    {
+      "fold": 2,
+      "xgb_mae_d": 114.90345001220703,
+      "xgb_mae_m": 0.2524799108505249,
+      "xgb_med_d": 31.275833129882812,
+      "xgb_med_m": 0.17509841918945312,
+      "lgb_mae_d": 111.13099670410156,
+      "lgb_mae_m": 0.25365331768989563,
+      "lgb_med_d": 27.839418411254883,
+      "lgb_med_m": 0.17804908752441406
+    },
+    {
+      "fold": 3,
+      "xgb_mae_d": 140.1485137939453,
+      "xgb_mae_m": 0.27031439542770386,
+      "xgb_med_d": 38.103515625,
+      "xgb_med_m": 0.17546653747558594,
+      "lgb_mae_d": 135.17355346679688,
+      "lgb_mae_m": 0.27064448595046997,
+      "lgb_med_d": 31.190265655517578,
+      "lgb_med_m": 0.17244434356689453
+    },
+    {
+      "fold": 4,
+      "xgb_mae_d": 115.05944061279297,
+      "xgb_mae_m": 0.26252108812332153,
+      "xgb_med_d": 31.309772491455078,
+      "xgb_med_m": 0.18577289581298828,
+      "lgb_mae_d": 114.0598373413086,
+      "lgb_mae_m": 0.25898680090904236,
+      "lgb_med_d": 29.387041091918945,
+      "lgb_med_m": 0.18349266052246094
+    },
+    {
+      "fold": 5,
+      "xgb_mae_d": 117.32633209228516,
+      "xgb_mae_m": 0.24888695776462555,
+      "xgb_med_d": 29.38198471069336,
+      "xgb_med_m": 0.18015766143798828,
+      "lgb_mae_d": 112.7234115600586,
+      "lgb_mae_m": 0.2513831853866577,
+      "lgb_med_d": 31.076351165771484,
+      "lgb_med_m": 0.18096446990966797
+    }
+  ],
+  "n_train": 27202,
+  "n_objects": 3806,
+  "target_stats": {
+    "days_mean": 99.75027465820312,
+    "days_std": 300.12255859375,
+    "days_median": 7.939062595367432,
+    "mag_mean": 18.47530746459961,
+    "mag_std": 0.8604589104652405,
+    "mag_median": 18.654937744140625
+  },
+  "feature_importance_days": {
+    "ncovhist": 1014.0,
+    "distpsnr1": 848.0,
+    "distpsnr2": 777.0,
+    "neargaia": 754.0,
+    "maxmag_so_far": 645.0,
+    "nnotdet": 637.0,
+    "host_r_i": 629.0,
+    "host_g_r": 595.0,
+    "maggaia": 543.0,
+    "sgscore1": 518.0,
+    "simag1": 510.0,
+    "host_i_z": 503.0,
+    "sgmag1": 477.0,
+    "distnr": 461.0,
+    "magnr": 451.0
+  },
+  "feature_importance_mag": {
+    "maggaia": 713.0,
+    "peakmag_so_far": 666.0,
+    "sgscore1": 576.0,
+    "maxmag_so_far": 570.0,
+    "ndethist": 557.0,
+    "det_fraction": 542.0,
+    "nmtchps": 534.0,
+    "host_i_z": 519.0,
+    "N": 517.0,
+    "sgscore2": 505.0,
+    "distpsnr2": 504.0,
+    "host_r_i": 503.0,
+    "host_g_r": 493.0,
+    "ncovhist": 490.0,
+    "simag1": 483.0
+  }
+}

model_mag.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d1c9944bd94483a4b3a036e1369fc77bb0d83a4c1b268701995eda1d2970ec2a
+size 1461336