File size: 11,846 Bytes
5cda531
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
---
license: mit
---

# Exoplanet Classification using NASA Datasets (Kepler, K2, TESS) — README

This README explains **how we trained** a **mission-agnostic** classifier to predict whether a detected object is **CONFIRMED**, **CANDIDATE**, or **FALSE POSITIVE** using public catalogs from **Kepler**, **K2**, and **TESS**. It documents the data ingestion, harmonization, feature engineering (including SNR logic), preprocessing, model selection, evaluation, class-imbalance handling, explainability, and inference artifacts.

---

## Project Structure (key files)

```
.
├─ Exoplanet_Classification_NASA_Kepler_K2_TESS_withSNR_ROBUST.ipynb   # main notebook
└─ artifacts/
   ├─ exoplanet_best_model.joblib            # full sklearn Pipeline (preprocessing + estimator)
   ├─ exoplanet_feature_columns.json         # exact feature order used during training
   ├─ exoplanet_class_labels.json            # label names in prediction index order
   └─ exoplanet_metadata.json                # summary (best model name, n_features, timestamp)
```

---

## Datasets

- **Kepler Cumulative:** `cumulative_YYYY.MM.DD_HH.MM.SS.csv`
- **K2 Planets & Candidates:** `k2pandc_YYYY.MM.DD_HH.MM.SS.csv`
- **TESS Objects of Interest (TOI):** `TOI_YYYY.MM.DD_HH.MM.SS.csv`

Each table contains per-candidate features and a mission-specific disposition field. These missions are **complementary** (different cadences/systems), so merging increases coverage and diversity.

---

## Environment

- Python ≥ 3.9  
- Libraries: `pandas`, `numpy`, `matplotlib`, `scikit-learn`, `imblearn`, optional `xgboost`, optional `shap`, `joblib`.

> If `xgboost` or `shap` aren’t installed, the notebook automatically skips those steps.

---

## Reproducibility

- Random seeds set to **42** in model splits and estimators.
- Stratified splits used for consistent class composition.
- The exact **feature order** and **class label order** are saved to JSON and re-used at inference.

---

## Data Loading (robust, no `low_memory`)

CSV parsing is **robust**:

- Try multiple separators and engines (prefers **Python engine** with `on_bad_lines="skip"`).
- Normalize column names to `snake_case`.
- Re-detect header if the first read looks suspicious (e.g., numeric column names).
- Drop columns that are entirely empty.

> We deliberately **do not** pass `low_memory` (it’s unsupported by the Python engine and can degrade type inference).

---

## Label Harmonization

Different catalogs use different labels. We map all to a **unified set**:

- **CONFIRMED**: `CONFIRMED`, `CP`, `KP`
- **CANDIDATE**: `CANDIDATE`, `PC`, `CAND`, `APC`
- **FALSE POSITIVE**: `FALSE POSITIVE`, `FP`, `FA`, `REFUTED`

Rows without a recognized disposition are dropped before modeling.

---

## Feature Harmonization (mission-agnostic)

We build a **canonical numeric feature set** by scanning a list of **alias groups** and picking the **first numeric-coercible** column present per mission, e.g.:

- Period: `['koi_period','orbital_period','period','pl_orbper','per']`
- Duration: `['koi_duration','duration','tran_dur']`
- Depth: `['koi_depth','depth','tran_depth']`
- Stellar context: `['koi_steff','st_teff','teff']`, `['koi_slogg','st_logg','logg']`, `['koi_smet','st_metfe','feh']`, `['koi_srad','st_rad','star_radius']`
- Transit geometry: `['koi_impact','impact','b']`, `['koi_sma','sma','a_rs','semi_major_axis']`
- **SNR/MES (broad):** `['koi_snr','koi_model_snr','koi_mes','mes','max_mes','snr','detection_snr','transit_snr','signal_to_noise','koi_max_snr']`

We **coerce to numeric** (`errors='coerce'`) and accept if any non-null values remain.

---

## Feature Engineering

Physics-motivated features improve separability:

- **Duty cycle**:  
  `duty_cycle = koi_duration / (koi_period * 24.0)`  
  (transit duration relative to orbital period)
- **Log transforms**:  
  `log_koi_period = log10(koi_period)`  
  `log_koi_depth  = log10(koi_depth)`
- **Equilibrium-temperature proxy**:  
  - Simple: `teq_proxy = koi_steff`  
  - Refined (if `a/R*` available or derivable): `teq_proxy = koi_steff / sqrt(2 * a_rs)`
- **SNR logic (guaranteed SNR-like feature)**:  
  - If any mission exposes a usable SNR/MES, we use **`koi_snr`** and also compute **`log_koi_snr`**.  
  - Otherwise we compute a **mission-agnostic proxy**:  
    ```
    snr_proxy = koi_depth * sqrt( koi_duration / (koi_period * 24.0) )
    log_snr_proxy = log10(snr_proxy)
    ```

> The exact features used in training are whatever appear in `artifacts/exoplanet_feature_columns.json` (this list is created dynamically from the actual files you load).

---

## Mission-Agnostic Policy

- We keep a temporary `mission` column only for auditing, then **drop it** before training.
- Features are derived from **physical quantities**, not the mission identity.

---

## Preprocessing & Split

- Keep **numeric** columns only.
- **Imputation**: median (`SimpleImputer(strategy="median")`).
- **Scaling**: `RobustScaler()` (robust to outliers).
- **Label encoding**: `LabelEncoder` → integer target (needed by XGBoost).
- **Split**: `train_test_split(test_size=0.2, stratify=y, random_state=42)`.

---

## Model Selection & Training

We wrap preprocessing and the classifier in a **single Pipeline**, and optimize **macro-F1** with `RandomizedSearchCV` using `StratifiedKFold`:

- **Random Forest** (class_weight="balanced")
  - `n_estimators`: 150–600 (fast mode: 150/300)
  - `max_depth`: None or small integers
  - `min_samples_split`, `min_samples_leaf`
- **SVM (RBF)** (class_weight="balanced")
  - `C`: logspace grid
  - `gamma`: `["scale","auto"]`
- **XGBoost** (optional; only if installed)
  - `objective="multi:softprob"`, `eval_metric="mlogloss"`, `tree_method="hist"`
  - Grid over `learning_rate`, `max_depth`, `subsample`, `colsample_bytree`, `n_estimators`

> We use a `FAST` flag (defaults to **True**) to keep grids small for iteration speed. Disabling FAST expands the search.

---

## Evaluation

- **Primary metric**: **Macro-F1** (treats classes evenly).
- Also report **accuracy** and per-class **precision/recall/F1**.
- **Confusion matrices** (shared color scale across models).
- **ROC AUC** (one-vs-rest) when probability estimates are available.

---

## Handling Class Imbalance

We provide an **optional** SMOTE section using `imblearn.Pipeline`:

```
ImbPipeline([
  ("impute", SimpleImputer(median)),
  ("scale", RobustScaler()),
  ("smote", SMOTE()),
  ("clf", RandomForestClassifier or SVC(probability=True))
])
```

- We **avoid nested sklearn Pipelines** inside the imblearn pipeline to prevent:
  > `TypeError: All intermediate steps of the chain should not be Pipelines`.

The SMOTE models are tuned on macro-F1 and compared against the non-SMOTE runs.

---

## Explainability

- **Permutation importance** computed on the **full pipeline** (so it reflects preprocessing).
- **SHAP** (if installed) for tree-based models:
  - Global bar chart of mean |SHAP| across classes
  - Per-class bar chart (e.g., CONFIRMED)  
  - SHAP summary bar (optional)

If the best model isn’t tree-based, SHAP is skipped automatically.

---

## Saved Artifacts

After selecting the **best model** (highest macro-F1 on the held-out test set), we serialize:

- `artifacts/exoplanet_best_model.joblib` – the full sklearn **Pipeline**
- `artifacts/exoplanet_feature_columns.json` – **exact feature order** expected at inference
- `artifacts/exoplanet_class_labels.json` – class names in output index order
- `artifacts/exoplanet_metadata.json` – name, n_features, labels, timestamp

These provide **stable inference** even if the notebook or environment changes later.

---

## Inference Usage

We include a helper that builds an input row with the **exact training feature order**, warns about **unknown/missing** keys, and prints predictions and class probabilities:

```python
from pathlib import Path
import json, joblib, numpy as np, pandas as pd

ARTIFACTS_DIR = Path("artifacts")
model = joblib.load(ARTIFACTS_DIR / "exoplanet_best_model.joblib")
feature_columns = json.loads((ARTIFACTS_DIR / "exoplanet_feature_columns.json").read_text())
class_labels    = json.loads((ARTIFACTS_DIR / "exoplanet_class_labels.json").read_text())

def predict_with_debug(model, feature_columns, class_labels, params: dict):
    X = pd.DataFrame([params], dtype=float).reindex(columns=feature_columns)
    y_idx = int(model.predict(X)[0])
    label = class_labels[y_idx]
    print("Prediction:", label)
    try:
        proba = model.predict_proba(X)[0]
        for lbl, p in sorted(zip(class_labels, proba), key=lambda t: t[1], reverse=True):
            print(f"  {lbl:>15s}: {p:.3f}")
    except Exception:
        pass
    return label
```

### Engineered features to compute in your inputs

- `duty_cycle = koi_duration / (koi_period * 24.0)`
- `log_koi_period = log10(koi_period)`
- `log_koi_depth = log10(koi_depth)`
- `teq_proxy = koi_steff` (or `koi_steff / sqrt(2 * a_rs)` if using refined variant)
- If the trained feature list includes **`koi_snr`**: also send `log_koi_snr = log10(koi_snr)`  
- Else, if it includes **`snr_proxy`**: send `snr_proxy = koi_depth * sqrt(koi_duration / (koi_period*24.0))` and `log_snr_proxy`

> You can check which SNR flavor was used by inspecting `feature_columns.json`.

---

## Design Choices & Rationale

- **Mission-agnostic**: We drop mission identity; rely on physical features to generalize.
- **Macro-F1**: More balanced assessment when class counts differ (especially for FALSE POSITIVE).
- **RobustScaler**: Less sensitive to outliers than StandardScaler.
- **LabelEncoder**: Ensures XGBoost gets integers (fixes “Invalid classes inferred” errors).
- **SNR strategy**: If SNR/MES exists we use it; otherwise we inject a **physics-motivated proxy** to preserve detection strength information across missions.

---

## Troubleshooting

- **CSV ParserError / “python engine” warnings**  
  We avoid `low_memory` and use Python engine with `on_bad_lines="skip"`. The loader will try alternate separators and re-try with a detected header line.

- **SMOTE “intermediate steps should not be Pipelines”**  
  We use a single `imblearn.Pipeline` with inline steps (impute → scale → SMOTE → clf).

- **`PicklingError: QuantileCapper`**  
  We removed custom transformers and rely on standard sklearn components to ensure picklability.

- **`AttributeError: 'numpy.ndarray' has no attribute 'columns'`**  
  We compute permutation importance on the **pipeline** (not the bare estimator), and we take feature names from `X_train.columns`.

- **XGBoost “Invalid classes inferred”**  
  We label-encode `y` to integers (`LabelEncoder`) before fitting.

- **Different inputs return the “same” class**  
  Often caused by: (a) too few features provided (most imputed), (b) features outside training ranges (scaled to similar z-scores), or (c) providing keys not used by the model. Use `predict_with_debug` to see **recognized/unknown/missing** features and adjust your dict.

---

## How to Re-train

1. Open the notebook `Exoplanet_Classification_NASA_Kepler_K2_TESS_withSNR_ROBUST.ipynb`.
2. Ensure the three CSV files are available at the paths set in the **Data Loading** cell.
3. (Optional) Set `FAST = False` in the training cell to run a broader hyper-parameter search.
4. Run all cells. Artifacts are written to `./artifacts/`.

---

## Extending the Work

- Add more SNR/MES name variants if future catalogs introduce new headers.
- Experiment with **calibrated probabilities** (e.g., `CalibratedClassifierCV`) for better ROC behavior.
- Try additional models (LightGBM/CatBoost) if available.
- Incorporate vetting-pipeline shape features (e.g., V-shape metrics) when harmonizable across missions.
- Add **temporal cross-validation** per mission or per release to stress test generalization.