|
|
--- |
|
|
license: mit |
|
|
--- |
|
|
|
|
|
# Exoplanet Classification using NASA Datasets (Kepler, K2, TESS) — README |
|
|
|
|
|
This README explains **how we trained** a **mission-agnostic** classifier to predict whether a detected object is **CONFIRMED**, **CANDIDATE**, or **FALSE POSITIVE** using public catalogs from **Kepler**, **K2**, and **TESS**. It documents the data ingestion, harmonization, feature engineering (including SNR logic), preprocessing, model selection, evaluation, class-imbalance handling, explainability, and inference artifacts. |
|
|
|
|
|
--- |
|
|
|
|
|
## Project Structure (key files) |
|
|
|
|
|
``` |
|
|
. |
|
|
├─ Exoplanet_Classification_NASA_Kepler_K2_TESS_withSNR_ROBUST.ipynb # main notebook |
|
|
└─ artifacts/ |
|
|
├─ exoplanet_best_model.joblib # full sklearn Pipeline (preprocessing + estimator) |
|
|
├─ exoplanet_feature_columns.json # exact feature order used during training |
|
|
├─ exoplanet_class_labels.json # label names in prediction index order |
|
|
└─ exoplanet_metadata.json # summary (best model name, n_features, timestamp) |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## Datasets |
|
|
|
|
|
- **Kepler Cumulative:** `cumulative_YYYY.MM.DD_HH.MM.SS.csv` |
|
|
- **K2 Planets & Candidates:** `k2pandc_YYYY.MM.DD_HH.MM.SS.csv` |
|
|
- **TESS Objects of Interest (TOI):** `TOI_YYYY.MM.DD_HH.MM.SS.csv` |
|
|
|
|
|
Each table contains per-candidate features and a mission-specific disposition field. These missions are **complementary** (different cadences/systems), so merging increases coverage and diversity. |
|
|
|
|
|
--- |
|
|
|
|
|
## Environment |
|
|
|
|
|
- Python ≥ 3.9 |
|
|
- Libraries: `pandas`, `numpy`, `matplotlib`, `scikit-learn`, `imblearn`, optional `xgboost`, optional `shap`, `joblib`. |
|
|
|
|
|
> If `xgboost` or `shap` aren’t installed, the notebook automatically skips those steps. |
|
|
|
|
|
--- |
|
|
|
|
|
## Reproducibility |
|
|
|
|
|
- Random seeds set to **42** in model splits and estimators. |
|
|
- Stratified splits used for consistent class composition. |
|
|
- The exact **feature order** and **class label order** are saved to JSON and re-used at inference. |
|
|
|
|
|
--- |
|
|
|
|
|
## Data Loading (robust, no `low_memory`) |
|
|
|
|
|
CSV parsing is **robust**: |
|
|
|
|
|
- Try multiple separators and engines (prefers **Python engine** with `on_bad_lines="skip"`). |
|
|
- Normalize column names to `snake_case`. |
|
|
- Re-detect header if the first read looks suspicious (e.g., numeric column names). |
|
|
- Drop columns that are entirely empty. |
|
|
|
|
|
> We deliberately **do not** pass `low_memory` (it’s unsupported by the Python engine and can degrade type inference). |
|
|
|
|
|
--- |
|
|
|
|
|
## Label Harmonization |
|
|
|
|
|
Different catalogs use different labels. We map all to a **unified set**: |
|
|
|
|
|
- **CONFIRMED**: `CONFIRMED`, `CP`, `KP` |
|
|
- **CANDIDATE**: `CANDIDATE`, `PC`, `CAND`, `APC` |
|
|
- **FALSE POSITIVE**: `FALSE POSITIVE`, `FP`, `FA`, `REFUTED` |
|
|
|
|
|
Rows without a recognized disposition are dropped before modeling. |
|
|
|
|
|
--- |
|
|
|
|
|
## Feature Harmonization (mission-agnostic) |
|
|
|
|
|
We build a **canonical numeric feature set** by scanning a list of **alias groups** and picking the **first numeric-coercible** column present per mission, e.g.: |
|
|
|
|
|
- Period: `['koi_period','orbital_period','period','pl_orbper','per']` |
|
|
- Duration: `['koi_duration','duration','tran_dur']` |
|
|
- Depth: `['koi_depth','depth','tran_depth']` |
|
|
- Stellar context: `['koi_steff','st_teff','teff']`, `['koi_slogg','st_logg','logg']`, `['koi_smet','st_metfe','feh']`, `['koi_srad','st_rad','star_radius']` |
|
|
- Transit geometry: `['koi_impact','impact','b']`, `['koi_sma','sma','a_rs','semi_major_axis']` |
|
|
- **SNR/MES (broad):** `['koi_snr','koi_model_snr','koi_mes','mes','max_mes','snr','detection_snr','transit_snr','signal_to_noise','koi_max_snr']` |
|
|
|
|
|
We **coerce to numeric** (`errors='coerce'`) and accept if any non-null values remain. |
|
|
|
|
|
--- |
|
|
|
|
|
## Feature Engineering |
|
|
|
|
|
Physics-motivated features improve separability: |
|
|
|
|
|
- **Duty cycle**: |
|
|
`duty_cycle = koi_duration / (koi_period * 24.0)` |
|
|
(transit duration relative to orbital period) |
|
|
- **Log transforms**: |
|
|
`log_koi_period = log10(koi_period)` |
|
|
`log_koi_depth = log10(koi_depth)` |
|
|
- **Equilibrium-temperature proxy**: |
|
|
- Simple: `teq_proxy = koi_steff` |
|
|
- Refined (if `a/R*` available or derivable): `teq_proxy = koi_steff / sqrt(2 * a_rs)` |
|
|
- **SNR logic (guaranteed SNR-like feature)**: |
|
|
- If any mission exposes a usable SNR/MES, we use **`koi_snr`** and also compute **`log_koi_snr`**. |
|
|
- Otherwise we compute a **mission-agnostic proxy**: |
|
|
``` |
|
|
snr_proxy = koi_depth * sqrt( koi_duration / (koi_period * 24.0) ) |
|
|
log_snr_proxy = log10(snr_proxy) |
|
|
``` |
|
|
|
|
|
> The exact features used in training are whatever appear in `artifacts/exoplanet_feature_columns.json` (this list is created dynamically from the actual files you load). |
|
|
|
|
|
--- |
|
|
|
|
|
## Mission-Agnostic Policy |
|
|
|
|
|
- We keep a temporary `mission` column only for auditing, then **drop it** before training. |
|
|
- Features are derived from **physical quantities**, not the mission identity. |
|
|
|
|
|
--- |
|
|
|
|
|
## Preprocessing & Split |
|
|
|
|
|
- Keep **numeric** columns only. |
|
|
- **Imputation**: median (`SimpleImputer(strategy="median")`). |
|
|
- **Scaling**: `RobustScaler()` (robust to outliers). |
|
|
- **Label encoding**: `LabelEncoder` → integer target (needed by XGBoost). |
|
|
- **Split**: `train_test_split(test_size=0.2, stratify=y, random_state=42)`. |
|
|
|
|
|
--- |
|
|
|
|
|
## Model Selection & Training |
|
|
|
|
|
We wrap preprocessing and the classifier in a **single Pipeline**, and optimize **macro-F1** with `RandomizedSearchCV` using `StratifiedKFold`: |
|
|
|
|
|
- **Random Forest** (class_weight="balanced") |
|
|
- `n_estimators`: 150–600 (fast mode: 150/300) |
|
|
- `max_depth`: None or small integers |
|
|
- `min_samples_split`, `min_samples_leaf` |
|
|
- **SVM (RBF)** (class_weight="balanced") |
|
|
- `C`: logspace grid |
|
|
- `gamma`: `["scale","auto"]` |
|
|
- **XGBoost** (optional; only if installed) |
|
|
- `objective="multi:softprob"`, `eval_metric="mlogloss"`, `tree_method="hist"` |
|
|
- Grid over `learning_rate`, `max_depth`, `subsample`, `colsample_bytree`, `n_estimators` |
|
|
|
|
|
> We use a `FAST` flag (defaults to **True**) to keep grids small for iteration speed. Disabling FAST expands the search. |
|
|
|
|
|
--- |
|
|
|
|
|
## Evaluation |
|
|
|
|
|
- **Primary metric**: **Macro-F1** (treats classes evenly). |
|
|
- Also report **accuracy** and per-class **precision/recall/F1**. |
|
|
- **Confusion matrices** (shared color scale across models). |
|
|
- **ROC AUC** (one-vs-rest) when probability estimates are available. |
|
|
|
|
|
--- |
|
|
|
|
|
## Handling Class Imbalance |
|
|
|
|
|
We provide an **optional** SMOTE section using `imblearn.Pipeline`: |
|
|
|
|
|
``` |
|
|
ImbPipeline([ |
|
|
("impute", SimpleImputer(median)), |
|
|
("scale", RobustScaler()), |
|
|
("smote", SMOTE()), |
|
|
("clf", RandomForestClassifier or SVC(probability=True)) |
|
|
]) |
|
|
``` |
|
|
|
|
|
- We **avoid nested sklearn Pipelines** inside the imblearn pipeline to prevent: |
|
|
> `TypeError: All intermediate steps of the chain should not be Pipelines`. |
|
|
|
|
|
The SMOTE models are tuned on macro-F1 and compared against the non-SMOTE runs. |
|
|
|
|
|
--- |
|
|
|
|
|
## Explainability |
|
|
|
|
|
- **Permutation importance** computed on the **full pipeline** (so it reflects preprocessing). |
|
|
- **SHAP** (if installed) for tree-based models: |
|
|
- Global bar chart of mean |SHAP| across classes |
|
|
- Per-class bar chart (e.g., CONFIRMED) |
|
|
- SHAP summary bar (optional) |
|
|
|
|
|
If the best model isn’t tree-based, SHAP is skipped automatically. |
|
|
|
|
|
--- |
|
|
|
|
|
## Saved Artifacts |
|
|
|
|
|
After selecting the **best model** (highest macro-F1 on the held-out test set), we serialize: |
|
|
|
|
|
- `artifacts/exoplanet_best_model.joblib` – the full sklearn **Pipeline** |
|
|
- `artifacts/exoplanet_feature_columns.json` – **exact feature order** expected at inference |
|
|
- `artifacts/exoplanet_class_labels.json` – class names in output index order |
|
|
- `artifacts/exoplanet_metadata.json` – name, n_features, labels, timestamp |
|
|
|
|
|
These provide **stable inference** even if the notebook or environment changes later. |
|
|
|
|
|
--- |
|
|
|
|
|
## Inference Usage |
|
|
|
|
|
We include a helper that builds an input row with the **exact training feature order**, warns about **unknown/missing** keys, and prints predictions and class probabilities: |
|
|
|
|
|
```python |
|
|
from pathlib import Path |
|
|
import json, joblib, numpy as np, pandas as pd |
|
|
|
|
|
ARTIFACTS_DIR = Path("artifacts") |
|
|
model = joblib.load(ARTIFACTS_DIR / "exoplanet_best_model.joblib") |
|
|
feature_columns = json.loads((ARTIFACTS_DIR / "exoplanet_feature_columns.json").read_text()) |
|
|
class_labels = json.loads((ARTIFACTS_DIR / "exoplanet_class_labels.json").read_text()) |
|
|
|
|
|
def predict_with_debug(model, feature_columns, class_labels, params: dict): |
|
|
X = pd.DataFrame([params], dtype=float).reindex(columns=feature_columns) |
|
|
y_idx = int(model.predict(X)[0]) |
|
|
label = class_labels[y_idx] |
|
|
print("Prediction:", label) |
|
|
try: |
|
|
proba = model.predict_proba(X)[0] |
|
|
for lbl, p in sorted(zip(class_labels, proba), key=lambda t: t[1], reverse=True): |
|
|
print(f" {lbl:>15s}: {p:.3f}") |
|
|
except Exception: |
|
|
pass |
|
|
return label |
|
|
``` |
|
|
|
|
|
### Engineered features to compute in your inputs |
|
|
|
|
|
- `duty_cycle = koi_duration / (koi_period * 24.0)` |
|
|
- `log_koi_period = log10(koi_period)` |
|
|
- `log_koi_depth = log10(koi_depth)` |
|
|
- `teq_proxy = koi_steff` (or `koi_steff / sqrt(2 * a_rs)` if using refined variant) |
|
|
- If the trained feature list includes **`koi_snr`**: also send `log_koi_snr = log10(koi_snr)` |
|
|
- Else, if it includes **`snr_proxy`**: send `snr_proxy = koi_depth * sqrt(koi_duration / (koi_period*24.0))` and `log_snr_proxy` |
|
|
|
|
|
> You can check which SNR flavor was used by inspecting `feature_columns.json`. |
|
|
|
|
|
--- |
|
|
|
|
|
## Design Choices & Rationale |
|
|
|
|
|
- **Mission-agnostic**: We drop mission identity; rely on physical features to generalize. |
|
|
- **Macro-F1**: More balanced assessment when class counts differ (especially for FALSE POSITIVE). |
|
|
- **RobustScaler**: Less sensitive to outliers than StandardScaler. |
|
|
- **LabelEncoder**: Ensures XGBoost gets integers (fixes “Invalid classes inferred” errors). |
|
|
- **SNR strategy**: If SNR/MES exists we use it; otherwise we inject a **physics-motivated proxy** to preserve detection strength information across missions. |
|
|
|
|
|
--- |
|
|
|
|
|
## Troubleshooting |
|
|
|
|
|
- **CSV ParserError / “python engine” warnings** |
|
|
We avoid `low_memory` and use Python engine with `on_bad_lines="skip"`. The loader will try alternate separators and re-try with a detected header line. |
|
|
|
|
|
- **SMOTE “intermediate steps should not be Pipelines”** |
|
|
We use a single `imblearn.Pipeline` with inline steps (impute → scale → SMOTE → clf). |
|
|
|
|
|
- **`PicklingError: QuantileCapper`** |
|
|
We removed custom transformers and rely on standard sklearn components to ensure picklability. |
|
|
|
|
|
- **`AttributeError: 'numpy.ndarray' has no attribute 'columns'`** |
|
|
We compute permutation importance on the **pipeline** (not the bare estimator), and we take feature names from `X_train.columns`. |
|
|
|
|
|
- **XGBoost “Invalid classes inferred”** |
|
|
We label-encode `y` to integers (`LabelEncoder`) before fitting. |
|
|
|
|
|
- **Different inputs return the “same” class** |
|
|
Often caused by: (a) too few features provided (most imputed), (b) features outside training ranges (scaled to similar z-scores), or (c) providing keys not used by the model. Use `predict_with_debug` to see **recognized/unknown/missing** features and adjust your dict. |
|
|
|
|
|
--- |
|
|
|
|
|
## How to Re-train |
|
|
|
|
|
1. Open the notebook `Exoplanet_Classification_NASA_Kepler_K2_TESS_withSNR_ROBUST.ipynb`. |
|
|
2. Ensure the three CSV files are available at the paths set in the **Data Loading** cell. |
|
|
3. (Optional) Set `FAST = False` in the training cell to run a broader hyper-parameter search. |
|
|
4. Run all cells. Artifacts are written to `./artifacts/`. |
|
|
|
|
|
--- |
|
|
|
|
|
## Extending the Work |
|
|
|
|
|
- Add more SNR/MES name variants if future catalogs introduce new headers. |
|
|
- Experiment with **calibrated probabilities** (e.g., `CalibratedClassifierCV`) for better ROC behavior. |
|
|
- Try additional models (LightGBM/CatBoost) if available. |
|
|
- Incorporate vetting-pipeline shape features (e.g., V-shape metrics) when harmonizable across missions. |
|
|
- Add **temporal cross-validation** per mission or per release to stress test generalization. |
|
|
|
|
|
|