DarwinDanish's picture
Update README.md
59750f9 verified
---
library_name: sklearn
tags:
- exoplanets
- astronomy
- tabular-classification
- stacking
- lightgbm
- xgboost
- catboost
- physics
license: mit
metrics:
- accuracy
- f1
model-index:
- name: Exoplanet Candidate Classifier
results: []
---
# Model Card: Exoplanet Candidate Classifier (Stacking Ensemble)
## Model Details
### Model Description
This is a robust machine learning pipeline designed to classify **Kepler Objects of Interest (KOIs)**. It determines whether a detected signal represents a real exoplanet or a false positive.
The model utilizes a **Stacking Ensemble** architecture, combining the predictions of three powerful gradient boosting frameworks (LightGBM, XGBoost, and CatBoost) and aggregating them using a final LightGBM meta-learner. It is specifically engineered to handle missing data (NaNs) in scientific datasets through a dual-strategy imputation pipeline.
- **Developed by:** [Darwin Danish]
- **Model Type:** Scikit-learn Pipeline (StackingClassifier)
- **Input:** Tabular data (16 astrophysical features)
- **Output:** Multi-class classification (CANDIDATE, CONFIRMED, FALSE POSITIVE)
### Model Sources
- **Repository:** https://huggingface.co/DarwinDanish/exoplanet-classifier-stacking
- **Dataset Source:** NASA Kepler Object of Interest (KOI) Table
---
## Uses
### Direct Use
This model is intended for astronomers, data scientists, and space enthusiasts who want to analyze Kepler mission data or similar photometric datasets. It predicts the "disposition" of a celestial object based on its physical properties.
### Supported Features (Input)
To use this model, your input DataFrame must contain the following columns:
**Critical Features:**
* `koi_period`: Orbital period
* `koi_depth`: Transit depth
* `koi_prad`: Planetary radius
* `koi_sma`: Semi-major axis
* `koi_teq`: Equilibrium temperature
* `koi_insol`: Insolation flux
* `koi_model_snr`: Signal-to-Noise Ratio
**Auxiliary Features:**
* `koi_time0bk`, `koi_duration`, `koi_incl`, `koi_srho`, `koi_srad`, `koi_smass`, `koi_steff`, `koi_slogg`, `koi_smet`
---
## How to Get Started with the Model
You can load this model directly from the Hugging Face Hub using `joblib` and `huggingface_hub`.
### 1. Installation
```bash
pip install huggingface_hub joblib pandas scikit-learn lightgbm xgboost catboost
````
### 2\. Python Inference Code
```python
import joblib
import pandas as pd
from huggingface_hub import hf_hub_download
# 1. Download the model and label encoder
repo_id = "DarwinDanish/exoplanet-classifier-stacking"
model_path = hf_hub_download(repo_id=repo_id, filename="exo_stacking_pipeline.pkl")
encoder_path = hf_hub_download(repo_id=repo_id, filename="exo_label_encoder.pkl")
# 2. Load the artifacts
pipeline = joblib.load(model_path)
label_encoder = joblib.load(encoder_path)
# 3. Create sample data (Example: A likely planet candidate)
# Note: The model handles NaNs, so missing values are allowed.
data = {
'koi_period': [365.25],
'koi_depth': [1000.5],
'koi_prad': [1.02], # Earth radii
'koi_sma': [1.0], # AU
'koi_teq': [255.0], # Kelvin
'koi_insol': [1.0],
'koi_model_snr': [35.5],
# Aux features (can be mostly defaults or NaNs)
'koi_time0bk': [135.0],
'koi_duration': [4.5],
'koi_incl': [89.9],
'koi_srho': [1.0],
'koi_srad': [1.0],
'koi_smass': [1.0],
'koi_steff': [5700],
'koi_slogg': [4.5],
'koi_smet': [0.0]
}
df_new = pd.DataFrame(data)
# 4. Predict
prediction_index = pipeline.predict(df_new)
prediction_label = label_encoder.inverse_transform(prediction_index)
probabilities = pipeline.predict_proba(df_new)
print(f"Prediction: {prediction_label[0]}")
print(f"Confidence: {max(probabilities[0]):.4f}")
```
-----
## Training Details
### Training Procedure
The model was trained using a robust preprocessing pipeline followed by a Stacking Classifier.
#### 1\. Preprocessing
The pipeline splits features into two groups with different imputation strategies:
* **Critical Features:** Missing values filled with constant `-999`. Scaled via `StandardScaler`.
* **Auxiliary Features:** Missing values filled with the `median`. Scaled via `StandardScaler`.
#### 2\. Architecture
* **Level 0 (Base Learners):**
* **LightGBM:** (500 estimators, GPU accelerated)
* **XGBoost:** (500 estimators, Histogram tree method, GPU accelerated)
* **CatBoost:** (500 estimators, Depth 8, GPU accelerated)
* **Level 1 (Meta Learner):**
* **LightGBM:** (200 estimators) - Aggregates the probabilities from Level 0 to make the final decision.
### Feature Importance
Based on the base learners, the most critical features for classification were identified as:
1. `koi_model_snr` (Signal-to-Noise Ratio)
2. `koi_prad` (Planetary Radius)
3. `koi_depth` (Transit Depth)
4. `koi_period` (Orbital Period)
-----
## Evaluation Results
The model was evaluated on a held-out test set (20% of data) using stratified splitting.
* **Accuracy:** \~90%+ (Dependent on specific test split)
* **Precision/Recall:** High precision in distinguishing False Positives from Candidates.
-----
## Bias, Risks, and Limitations
* **Data Specificity:** This model is trained specifically on Kepler mission data. It may not generalize well to data from TESS or JWST without fine-tuning, as the instrumentation and noise profiles differ.
* **Class Imbalance:** Depending on the dataset version, "False Positives" are often more numerous than "Confirmed" planets, which can bias the model slightly toward false positive predictions in low-SNR ranges.
## Environmental Impact
* **Compute:** Trained on GPU (NVIDIA T4/P100 class) via Kaggle Kernels.
* **Training Time:** \< 5 minutes due to GPU acceleration and efficient gradient boosting implementations.
-----