ZapatoProgramming commited on
Commit
5cda531
·
verified ·
1 Parent(s): 48b6e8a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +295 -3
README.md CHANGED
@@ -1,3 +1,295 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ ---
4
+
5
+ # Exoplanet Classification using NASA Datasets (Kepler, K2, TESS) — README
6
+
7
+ This README explains **how we trained** a **mission-agnostic** classifier to predict whether a detected object is **CONFIRMED**, **CANDIDATE**, or **FALSE POSITIVE** using public catalogs from **Kepler**, **K2**, and **TESS**. It documents the data ingestion, harmonization, feature engineering (including SNR logic), preprocessing, model selection, evaluation, class-imbalance handling, explainability, and inference artifacts.
8
+
9
+ ---
10
+
11
+ ## Project Structure (key files)
12
+
13
+ ```
14
+ .
15
+ ├─ Exoplanet_Classification_NASA_Kepler_K2_TESS_withSNR_ROBUST.ipynb # main notebook
16
+ └─ artifacts/
17
+ ├─ exoplanet_best_model.joblib # full sklearn Pipeline (preprocessing + estimator)
18
+ ├─ exoplanet_feature_columns.json # exact feature order used during training
19
+ ├─ exoplanet_class_labels.json # label names in prediction index order
20
+ └─ exoplanet_metadata.json # summary (best model name, n_features, timestamp)
21
+ ```
22
+
23
+ ---
24
+
25
+ ## Datasets
26
+
27
+ - **Kepler Cumulative:** `cumulative_YYYY.MM.DD_HH.MM.SS.csv`
28
+ - **K2 Planets & Candidates:** `k2pandc_YYYY.MM.DD_HH.MM.SS.csv`
29
+ - **TESS Objects of Interest (TOI):** `TOI_YYYY.MM.DD_HH.MM.SS.csv`
30
+
31
+ Each table contains per-candidate features and a mission-specific disposition field. These missions are **complementary** (different cadences/systems), so merging increases coverage and diversity.
32
+
33
+ ---
34
+
35
+ ## Environment
36
+
37
+ - Python ≥ 3.9
38
+ - Libraries: `pandas`, `numpy`, `matplotlib`, `scikit-learn`, `imblearn`, optional `xgboost`, optional `shap`, `joblib`.
39
+
40
+ > If `xgboost` or `shap` aren’t installed, the notebook automatically skips those steps.
41
+
42
+ ---
43
+
44
+ ## Reproducibility
45
+
46
+ - Random seeds set to **42** in model splits and estimators.
47
+ - Stratified splits used for consistent class composition.
48
+ - The exact **feature order** and **class label order** are saved to JSON and re-used at inference.
49
+
50
+ ---
51
+
52
+ ## Data Loading (robust, no `low_memory`)
53
+
54
+ CSV parsing is **robust**:
55
+
56
+ - Try multiple separators and engines (prefers **Python engine** with `on_bad_lines="skip"`).
57
+ - Normalize column names to `snake_case`.
58
+ - Re-detect header if the first read looks suspicious (e.g., numeric column names).
59
+ - Drop columns that are entirely empty.
60
+
61
+ > We deliberately **do not** pass `low_memory` (it’s unsupported by the Python engine and can degrade type inference).
62
+
63
+ ---
64
+
65
+ ## Label Harmonization
66
+
67
+ Different catalogs use different labels. We map all to a **unified set**:
68
+
69
+ - **CONFIRMED**: `CONFIRMED`, `CP`, `KP`
70
+ - **CANDIDATE**: `CANDIDATE`, `PC`, `CAND`, `APC`
71
+ - **FALSE POSITIVE**: `FALSE POSITIVE`, `FP`, `FA`, `REFUTED`
72
+
73
+ Rows without a recognized disposition are dropped before modeling.
74
+
75
+ ---
76
+
77
+ ## Feature Harmonization (mission-agnostic)
78
+
79
+ We build a **canonical numeric feature set** by scanning a list of **alias groups** and picking the **first numeric-coercible** column present per mission, e.g.:
80
+
81
+ - Period: `['koi_period','orbital_period','period','pl_orbper','per']`
82
+ - Duration: `['koi_duration','duration','tran_dur']`
83
+ - Depth: `['koi_depth','depth','tran_depth']`
84
+ - Stellar context: `['koi_steff','st_teff','teff']`, `['koi_slogg','st_logg','logg']`, `['koi_smet','st_metfe','feh']`, `['koi_srad','st_rad','star_radius']`
85
+ - Transit geometry: `['koi_impact','impact','b']`, `['koi_sma','sma','a_rs','semi_major_axis']`
86
+ - **SNR/MES (broad):** `['koi_snr','koi_model_snr','koi_mes','mes','max_mes','snr','detection_snr','transit_snr','signal_to_noise','koi_max_snr']`
87
+
88
+ We **coerce to numeric** (`errors='coerce'`) and accept if any non-null values remain.
89
+
90
+ ---
91
+
92
+ ## Feature Engineering
93
+
94
+ Physics-motivated features improve separability:
95
+
96
+ - **Duty cycle**:
97
+ `duty_cycle = koi_duration / (koi_period * 24.0)`
98
+ (transit duration relative to orbital period)
99
+ - **Log transforms**:
100
+ `log_koi_period = log10(koi_period)`
101
+ `log_koi_depth = log10(koi_depth)`
102
+ - **Equilibrium-temperature proxy**:
103
+ - Simple: `teq_proxy = koi_steff`
104
+ - Refined (if `a/R*` available or derivable): `teq_proxy = koi_steff / sqrt(2 * a_rs)`
105
+ - **SNR logic (guaranteed SNR-like feature)**:
106
+ - If any mission exposes a usable SNR/MES, we use **`koi_snr`** and also compute **`log_koi_snr`**.
107
+ - Otherwise we compute a **mission-agnostic proxy**:
108
+ ```
109
+ snr_proxy = koi_depth * sqrt( koi_duration / (koi_period * 24.0) )
110
+ log_snr_proxy = log10(snr_proxy)
111
+ ```
112
+
113
+ > The exact features used in training are whatever appear in `artifacts/exoplanet_feature_columns.json` (this list is created dynamically from the actual files you load).
114
+
115
+ ---
116
+
117
+ ## Mission-Agnostic Policy
118
+
119
+ - We keep a temporary `mission` column only for auditing, then **drop it** before training.
120
+ - Features are derived from **physical quantities**, not the mission identity.
121
+
122
+ ---
123
+
124
+ ## Preprocessing & Split
125
+
126
+ - Keep **numeric** columns only.
127
+ - **Imputation**: median (`SimpleImputer(strategy="median")`).
128
+ - **Scaling**: `RobustScaler()` (robust to outliers).
129
+ - **Label encoding**: `LabelEncoder` → integer target (needed by XGBoost).
130
+ - **Split**: `train_test_split(test_size=0.2, stratify=y, random_state=42)`.
131
+
132
+ ---
133
+
134
+ ## Model Selection & Training
135
+
136
+ We wrap preprocessing and the classifier in a **single Pipeline**, and optimize **macro-F1** with `RandomizedSearchCV` using `StratifiedKFold`:
137
+
138
+ - **Random Forest** (class_weight="balanced")
139
+ - `n_estimators`: 150–600 (fast mode: 150/300)
140
+ - `max_depth`: None or small integers
141
+ - `min_samples_split`, `min_samples_leaf`
142
+ - **SVM (RBF)** (class_weight="balanced")
143
+ - `C`: logspace grid
144
+ - `gamma`: `["scale","auto"]`
145
+ - **XGBoost** (optional; only if installed)
146
+ - `objective="multi:softprob"`, `eval_metric="mlogloss"`, `tree_method="hist"`
147
+ - Grid over `learning_rate`, `max_depth`, `subsample`, `colsample_bytree`, `n_estimators`
148
+
149
+ > We use a `FAST` flag (defaults to **True**) to keep grids small for iteration speed. Disabling FAST expands the search.
150
+
151
+ ---
152
+
153
+ ## Evaluation
154
+
155
+ - **Primary metric**: **Macro-F1** (treats classes evenly).
156
+ - Also report **accuracy** and per-class **precision/recall/F1**.
157
+ - **Confusion matrices** (shared color scale across models).
158
+ - **ROC AUC** (one-vs-rest) when probability estimates are available.
159
+
160
+ ---
161
+
162
+ ## Handling Class Imbalance
163
+
164
+ We provide an **optional** SMOTE section using `imblearn.Pipeline`:
165
+
166
+ ```
167
+ ImbPipeline([
168
+ ("impute", SimpleImputer(median)),
169
+ ("scale", RobustScaler()),
170
+ ("smote", SMOTE()),
171
+ ("clf", RandomForestClassifier or SVC(probability=True))
172
+ ])
173
+ ```
174
+
175
+ - We **avoid nested sklearn Pipelines** inside the imblearn pipeline to prevent:
176
+ > `TypeError: All intermediate steps of the chain should not be Pipelines`.
177
+
178
+ The SMOTE models are tuned on macro-F1 and compared against the non-SMOTE runs.
179
+
180
+ ---
181
+
182
+ ## Explainability
183
+
184
+ - **Permutation importance** computed on the **full pipeline** (so it reflects preprocessing).
185
+ - **SHAP** (if installed) for tree-based models:
186
+ - Global bar chart of mean |SHAP| across classes
187
+ - Per-class bar chart (e.g., CONFIRMED)
188
+ - SHAP summary bar (optional)
189
+
190
+ If the best model isn’t tree-based, SHAP is skipped automatically.
191
+
192
+ ---
193
+
194
+ ## Saved Artifacts
195
+
196
+ After selecting the **best model** (highest macro-F1 on the held-out test set), we serialize:
197
+
198
+ - `artifacts/exoplanet_best_model.joblib` – the full sklearn **Pipeline**
199
+ - `artifacts/exoplanet_feature_columns.json` – **exact feature order** expected at inference
200
+ - `artifacts/exoplanet_class_labels.json` – class names in output index order
201
+ - `artifacts/exoplanet_metadata.json` – name, n_features, labels, timestamp
202
+
203
+ These provide **stable inference** even if the notebook or environment changes later.
204
+
205
+ ---
206
+
207
+ ## Inference Usage
208
+
209
+ We include a helper that builds an input row with the **exact training feature order**, warns about **unknown/missing** keys, and prints predictions and class probabilities:
210
+
211
+ ```python
212
+ from pathlib import Path
213
+ import json, joblib, numpy as np, pandas as pd
214
+
215
+ ARTIFACTS_DIR = Path("artifacts")
216
+ model = joblib.load(ARTIFACTS_DIR / "exoplanet_best_model.joblib")
217
+ feature_columns = json.loads((ARTIFACTS_DIR / "exoplanet_feature_columns.json").read_text())
218
+ class_labels = json.loads((ARTIFACTS_DIR / "exoplanet_class_labels.json").read_text())
219
+
220
+ def predict_with_debug(model, feature_columns, class_labels, params: dict):
221
+ X = pd.DataFrame([params], dtype=float).reindex(columns=feature_columns)
222
+ y_idx = int(model.predict(X)[0])
223
+ label = class_labels[y_idx]
224
+ print("Prediction:", label)
225
+ try:
226
+ proba = model.predict_proba(X)[0]
227
+ for lbl, p in sorted(zip(class_labels, proba), key=lambda t: t[1], reverse=True):
228
+ print(f" {lbl:>15s}: {p:.3f}")
229
+ except Exception:
230
+ pass
231
+ return label
232
+ ```
233
+
234
+ ### Engineered features to compute in your inputs
235
+
236
+ - `duty_cycle = koi_duration / (koi_period * 24.0)`
237
+ - `log_koi_period = log10(koi_period)`
238
+ - `log_koi_depth = log10(koi_depth)`
239
+ - `teq_proxy = koi_steff` (or `koi_steff / sqrt(2 * a_rs)` if using refined variant)
240
+ - If the trained feature list includes **`koi_snr`**: also send `log_koi_snr = log10(koi_snr)`
241
+ - Else, if it includes **`snr_proxy`**: send `snr_proxy = koi_depth * sqrt(koi_duration / (koi_period*24.0))` and `log_snr_proxy`
242
+
243
+ > You can check which SNR flavor was used by inspecting `feature_columns.json`.
244
+
245
+ ---
246
+
247
+ ## Design Choices & Rationale
248
+
249
+ - **Mission-agnostic**: We drop mission identity; rely on physical features to generalize.
250
+ - **Macro-F1**: More balanced assessment when class counts differ (especially for FALSE POSITIVE).
251
+ - **RobustScaler**: Less sensitive to outliers than StandardScaler.
252
+ - **LabelEncoder**: Ensures XGBoost gets integers (fixes “Invalid classes inferred” errors).
253
+ - **SNR strategy**: If SNR/MES exists we use it; otherwise we inject a **physics-motivated proxy** to preserve detection strength information across missions.
254
+
255
+ ---
256
+
257
+ ## Troubleshooting
258
+
259
+ - **CSV ParserError / “python engine” warnings**
260
+ We avoid `low_memory` and use Python engine with `on_bad_lines="skip"`. The loader will try alternate separators and re-try with a detected header line.
261
+
262
+ - **SMOTE “intermediate steps should not be Pipelines”**
263
+ We use a single `imblearn.Pipeline` with inline steps (impute → scale → SMOTE → clf).
264
+
265
+ - **`PicklingError: QuantileCapper`**
266
+ We removed custom transformers and rely on standard sklearn components to ensure picklability.
267
+
268
+ - **`AttributeError: 'numpy.ndarray' has no attribute 'columns'`**
269
+ We compute permutation importance on the **pipeline** (not the bare estimator), and we take feature names from `X_train.columns`.
270
+
271
+ - **XGBoost “Invalid classes inferred”**
272
+ We label-encode `y` to integers (`LabelEncoder`) before fitting.
273
+
274
+ - **Different inputs return the “same” class**
275
+ Often caused by: (a) too few features provided (most imputed), (b) features outside training ranges (scaled to similar z-scores), or (c) providing keys not used by the model. Use `predict_with_debug` to see **recognized/unknown/missing** features and adjust your dict.
276
+
277
+ ---
278
+
279
+ ## How to Re-train
280
+
281
+ 1. Open the notebook `Exoplanet_Classification_NASA_Kepler_K2_TESS_withSNR_ROBUST.ipynb`.
282
+ 2. Ensure the three CSV files are available at the paths set in the **Data Loading** cell.
283
+ 3. (Optional) Set `FAST = False` in the training cell to run a broader hyper-parameter search.
284
+ 4. Run all cells. Artifacts are written to `./artifacts/`.
285
+
286
+ ---
287
+
288
+ ## Extending the Work
289
+
290
+ - Add more SNR/MES name variants if future catalogs introduce new headers.
291
+ - Experiment with **calibrated probabilities** (e.g., `CalibratedClassifierCV`) for better ROC behavior.
292
+ - Try additional models (LightGBM/CatBoost) if available.
293
+ - Incorporate vetting-pipeline shape features (e.g., V-shape metrics) when harmonizable across missions.
294
+ - Add **temporal cross-validation** per mission or per release to stress test generalization.
295
+