File size: 11,846 Bytes
5cda531 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 |
---
license: mit
---
# Exoplanet Classification using NASA Datasets (Kepler, K2, TESS) — README
This README explains **how we trained** a **mission-agnostic** classifier to predict whether a detected object is **CONFIRMED**, **CANDIDATE**, or **FALSE POSITIVE** using public catalogs from **Kepler**, **K2**, and **TESS**. It documents the data ingestion, harmonization, feature engineering (including SNR logic), preprocessing, model selection, evaluation, class-imbalance handling, explainability, and inference artifacts.
---
## Project Structure (key files)
```
.
├─ Exoplanet_Classification_NASA_Kepler_K2_TESS_withSNR_ROBUST.ipynb # main notebook
└─ artifacts/
├─ exoplanet_best_model.joblib # full sklearn Pipeline (preprocessing + estimator)
├─ exoplanet_feature_columns.json # exact feature order used during training
├─ exoplanet_class_labels.json # label names in prediction index order
└─ exoplanet_metadata.json # summary (best model name, n_features, timestamp)
```
---
## Datasets
- **Kepler Cumulative:** `cumulative_YYYY.MM.DD_HH.MM.SS.csv`
- **K2 Planets & Candidates:** `k2pandc_YYYY.MM.DD_HH.MM.SS.csv`
- **TESS Objects of Interest (TOI):** `TOI_YYYY.MM.DD_HH.MM.SS.csv`
Each table contains per-candidate features and a mission-specific disposition field. These missions are **complementary** (different cadences/systems), so merging increases coverage and diversity.
---
## Environment
- Python ≥ 3.9
- Libraries: `pandas`, `numpy`, `matplotlib`, `scikit-learn`, `imblearn`, optional `xgboost`, optional `shap`, `joblib`.
> If `xgboost` or `shap` aren’t installed, the notebook automatically skips those steps.
---
## Reproducibility
- Random seeds set to **42** in model splits and estimators.
- Stratified splits used for consistent class composition.
- The exact **feature order** and **class label order** are saved to JSON and re-used at inference.
---
## Data Loading (robust, no `low_memory`)
CSV parsing is **robust**:
- Try multiple separators and engines (prefers **Python engine** with `on_bad_lines="skip"`).
- Normalize column names to `snake_case`.
- Re-detect header if the first read looks suspicious (e.g., numeric column names).
- Drop columns that are entirely empty.
> We deliberately **do not** pass `low_memory` (it’s unsupported by the Python engine and can degrade type inference).
---
## Label Harmonization
Different catalogs use different labels. We map all to a **unified set**:
- **CONFIRMED**: `CONFIRMED`, `CP`, `KP`
- **CANDIDATE**: `CANDIDATE`, `PC`, `CAND`, `APC`
- **FALSE POSITIVE**: `FALSE POSITIVE`, `FP`, `FA`, `REFUTED`
Rows without a recognized disposition are dropped before modeling.
---
## Feature Harmonization (mission-agnostic)
We build a **canonical numeric feature set** by scanning a list of **alias groups** and picking the **first numeric-coercible** column present per mission, e.g.:
- Period: `['koi_period','orbital_period','period','pl_orbper','per']`
- Duration: `['koi_duration','duration','tran_dur']`
- Depth: `['koi_depth','depth','tran_depth']`
- Stellar context: `['koi_steff','st_teff','teff']`, `['koi_slogg','st_logg','logg']`, `['koi_smet','st_metfe','feh']`, `['koi_srad','st_rad','star_radius']`
- Transit geometry: `['koi_impact','impact','b']`, `['koi_sma','sma','a_rs','semi_major_axis']`
- **SNR/MES (broad):** `['koi_snr','koi_model_snr','koi_mes','mes','max_mes','snr','detection_snr','transit_snr','signal_to_noise','koi_max_snr']`
We **coerce to numeric** (`errors='coerce'`) and accept if any non-null values remain.
---
## Feature Engineering
Physics-motivated features improve separability:
- **Duty cycle**:
`duty_cycle = koi_duration / (koi_period * 24.0)`
(transit duration relative to orbital period)
- **Log transforms**:
`log_koi_period = log10(koi_period)`
`log_koi_depth = log10(koi_depth)`
- **Equilibrium-temperature proxy**:
- Simple: `teq_proxy = koi_steff`
- Refined (if `a/R*` available or derivable): `teq_proxy = koi_steff / sqrt(2 * a_rs)`
- **SNR logic (guaranteed SNR-like feature)**:
- If any mission exposes a usable SNR/MES, we use **`koi_snr`** and also compute **`log_koi_snr`**.
- Otherwise we compute a **mission-agnostic proxy**:
```
snr_proxy = koi_depth * sqrt( koi_duration / (koi_period * 24.0) )
log_snr_proxy = log10(snr_proxy)
```
> The exact features used in training are whatever appear in `artifacts/exoplanet_feature_columns.json` (this list is created dynamically from the actual files you load).
---
## Mission-Agnostic Policy
- We keep a temporary `mission` column only for auditing, then **drop it** before training.
- Features are derived from **physical quantities**, not the mission identity.
---
## Preprocessing & Split
- Keep **numeric** columns only.
- **Imputation**: median (`SimpleImputer(strategy="median")`).
- **Scaling**: `RobustScaler()` (robust to outliers).
- **Label encoding**: `LabelEncoder` → integer target (needed by XGBoost).
- **Split**: `train_test_split(test_size=0.2, stratify=y, random_state=42)`.
---
## Model Selection & Training
We wrap preprocessing and the classifier in a **single Pipeline**, and optimize **macro-F1** with `RandomizedSearchCV` using `StratifiedKFold`:
- **Random Forest** (class_weight="balanced")
- `n_estimators`: 150–600 (fast mode: 150/300)
- `max_depth`: None or small integers
- `min_samples_split`, `min_samples_leaf`
- **SVM (RBF)** (class_weight="balanced")
- `C`: logspace grid
- `gamma`: `["scale","auto"]`
- **XGBoost** (optional; only if installed)
- `objective="multi:softprob"`, `eval_metric="mlogloss"`, `tree_method="hist"`
- Grid over `learning_rate`, `max_depth`, `subsample`, `colsample_bytree`, `n_estimators`
> We use a `FAST` flag (defaults to **True**) to keep grids small for iteration speed. Disabling FAST expands the search.
---
## Evaluation
- **Primary metric**: **Macro-F1** (treats classes evenly).
- Also report **accuracy** and per-class **precision/recall/F1**.
- **Confusion matrices** (shared color scale across models).
- **ROC AUC** (one-vs-rest) when probability estimates are available.
---
## Handling Class Imbalance
We provide an **optional** SMOTE section using `imblearn.Pipeline`:
```
ImbPipeline([
("impute", SimpleImputer(median)),
("scale", RobustScaler()),
("smote", SMOTE()),
("clf", RandomForestClassifier or SVC(probability=True))
])
```
- We **avoid nested sklearn Pipelines** inside the imblearn pipeline to prevent:
> `TypeError: All intermediate steps of the chain should not be Pipelines`.
The SMOTE models are tuned on macro-F1 and compared against the non-SMOTE runs.
---
## Explainability
- **Permutation importance** computed on the **full pipeline** (so it reflects preprocessing).
- **SHAP** (if installed) for tree-based models:
- Global bar chart of mean |SHAP| across classes
- Per-class bar chart (e.g., CONFIRMED)
- SHAP summary bar (optional)
If the best model isn’t tree-based, SHAP is skipped automatically.
---
## Saved Artifacts
After selecting the **best model** (highest macro-F1 on the held-out test set), we serialize:
- `artifacts/exoplanet_best_model.joblib` – the full sklearn **Pipeline**
- `artifacts/exoplanet_feature_columns.json` – **exact feature order** expected at inference
- `artifacts/exoplanet_class_labels.json` – class names in output index order
- `artifacts/exoplanet_metadata.json` – name, n_features, labels, timestamp
These provide **stable inference** even if the notebook or environment changes later.
---
## Inference Usage
We include a helper that builds an input row with the **exact training feature order**, warns about **unknown/missing** keys, and prints predictions and class probabilities:
```python
from pathlib import Path
import json, joblib, numpy as np, pandas as pd
ARTIFACTS_DIR = Path("artifacts")
model = joblib.load(ARTIFACTS_DIR / "exoplanet_best_model.joblib")
feature_columns = json.loads((ARTIFACTS_DIR / "exoplanet_feature_columns.json").read_text())
class_labels = json.loads((ARTIFACTS_DIR / "exoplanet_class_labels.json").read_text())
def predict_with_debug(model, feature_columns, class_labels, params: dict):
X = pd.DataFrame([params], dtype=float).reindex(columns=feature_columns)
y_idx = int(model.predict(X)[0])
label = class_labels[y_idx]
print("Prediction:", label)
try:
proba = model.predict_proba(X)[0]
for lbl, p in sorted(zip(class_labels, proba), key=lambda t: t[1], reverse=True):
print(f" {lbl:>15s}: {p:.3f}")
except Exception:
pass
return label
```
### Engineered features to compute in your inputs
- `duty_cycle = koi_duration / (koi_period * 24.0)`
- `log_koi_period = log10(koi_period)`
- `log_koi_depth = log10(koi_depth)`
- `teq_proxy = koi_steff` (or `koi_steff / sqrt(2 * a_rs)` if using refined variant)
- If the trained feature list includes **`koi_snr`**: also send `log_koi_snr = log10(koi_snr)`
- Else, if it includes **`snr_proxy`**: send `snr_proxy = koi_depth * sqrt(koi_duration / (koi_period*24.0))` and `log_snr_proxy`
> You can check which SNR flavor was used by inspecting `feature_columns.json`.
---
## Design Choices & Rationale
- **Mission-agnostic**: We drop mission identity; rely on physical features to generalize.
- **Macro-F1**: More balanced assessment when class counts differ (especially for FALSE POSITIVE).
- **RobustScaler**: Less sensitive to outliers than StandardScaler.
- **LabelEncoder**: Ensures XGBoost gets integers (fixes “Invalid classes inferred” errors).
- **SNR strategy**: If SNR/MES exists we use it; otherwise we inject a **physics-motivated proxy** to preserve detection strength information across missions.
---
## Troubleshooting
- **CSV ParserError / “python engine” warnings**
We avoid `low_memory` and use Python engine with `on_bad_lines="skip"`. The loader will try alternate separators and re-try with a detected header line.
- **SMOTE “intermediate steps should not be Pipelines”**
We use a single `imblearn.Pipeline` with inline steps (impute → scale → SMOTE → clf).
- **`PicklingError: QuantileCapper`**
We removed custom transformers and rely on standard sklearn components to ensure picklability.
- **`AttributeError: 'numpy.ndarray' has no attribute 'columns'`**
We compute permutation importance on the **pipeline** (not the bare estimator), and we take feature names from `X_train.columns`.
- **XGBoost “Invalid classes inferred”**
We label-encode `y` to integers (`LabelEncoder`) before fitting.
- **Different inputs return the “same” class**
Often caused by: (a) too few features provided (most imputed), (b) features outside training ranges (scaled to similar z-scores), or (c) providing keys not used by the model. Use `predict_with_debug` to see **recognized/unknown/missing** features and adjust your dict.
---
## How to Re-train
1. Open the notebook `Exoplanet_Classification_NASA_Kepler_K2_TESS_withSNR_ROBUST.ipynb`.
2. Ensure the three CSV files are available at the paths set in the **Data Loading** cell.
3. (Optional) Set `FAST = False` in the training cell to run a broader hyper-parameter search.
4. Run all cells. Artifacts are written to `./artifacts/`.
---
## Extending the Work
- Add more SNR/MES name variants if future catalogs introduce new headers.
- Experiment with **calibrated probabilities** (e.g., `CalibratedClassifierCV`) for better ROC behavior.
- Try additional models (LightGBM/CatBoost) if available.
- Incorporate vetting-pipeline shape features (e.g., V-shape metrics) when harmonizable across missions.
- Add **temporal cross-validation** per mission or per release to stress test generalization.
|