---
library_name: sklearn
tags:
- exoplanets
- astronomy
- tabular-classification
- stacking
- lightgbm
- xgboost
- catboost
- physics
license: mit
metrics:
- accuracy
- f1
model-index:
- name: Exoplanet Candidate Classifier
  results: []
---

# Model Card: Exoplanet Candidate Classifier (Stacking Ensemble)

## Model Details

### Model Description

This is a robust machine learning pipeline designed to classify **Kepler Objects of Interest (KOIs)**. It determines whether a detected signal represents a real exoplanet or a false positive.

The model utilizes a **Stacking Ensemble** architecture, combining the predictions of three powerful gradient boosting frameworks (LightGBM, XGBoost, and CatBoost) and aggregating them using a final LightGBM meta-learner. It is specifically engineered to handle missing data (NaNs) in scientific datasets through a dual-strategy imputation pipeline.

- **Developed by:** [Darwin Danish]
- **Model Type:** Scikit-learn Pipeline (StackingClassifier)
- **Input:** Tabular data (16 astrophysical features)
- **Output:** Multi-class classification (CANDIDATE, CONFIRMED, FALSE POSITIVE)

### Model Sources

- **Repository:** https://huggingface.co/DarwinDanish/exoplanet-classifier-stacking
- **Dataset Source:** NASA Kepler Object of Interest (KOI) Table

---

## Uses

### Direct Use

This model is intended for astronomers, data scientists, and space enthusiasts who want to analyze Kepler mission data or similar photometric datasets. It predicts the "disposition" of a celestial object based on its physical properties.

### Supported Features (Input)
To use this model, your input DataFrame must contain the following columns:

**Critical Features:**
* `koi_period`: Orbital period
* `koi_depth`: Transit depth
* `koi_prad`: Planetary radius
* `koi_sma`: Semi-major axis
* `koi_teq`: Equilibrium temperature
* `koi_insol`: Insolation flux
* `koi_model_snr`: Signal-to-Noise Ratio

**Auxiliary Features:**
* `koi_time0bk`, `koi_duration`, `koi_incl`, `koi_srho`, `koi_srad`, `koi_smass`, `koi_steff`, `koi_slogg`, `koi_smet`

---

## How to Get Started with the Model

You can load this model directly from the Hugging Face Hub using `joblib` and `huggingface_hub`.

### 1. Installation

```bash
pip install huggingface_hub joblib pandas scikit-learn lightgbm xgboost catboost
````

### 2\. Python Inference Code

```python
import joblib
import pandas as pd
from huggingface_hub import hf_hub_download

# 1. Download the model and label encoder
repo_id = "DarwinDanish/exoplanet-classifier-stacking"

model_path = hf_hub_download(repo_id=repo_id, filename="exo_stacking_pipeline.pkl")
encoder_path = hf_hub_download(repo_id=repo_id, filename="exo_label_encoder.pkl")

# 2. Load the artifacts
pipeline = joblib.load(model_path)
label_encoder = joblib.load(encoder_path)

# 3. Create sample data (Example: A likely planet candidate)
# Note: The model handles NaNs, so missing values are allowed.
data = {
    'koi_period': [365.25],
    'koi_depth': [1000.5],
    'koi_prad': [1.02],   # Earth radii
    'koi_sma': [1.0],     # AU
    'koi_teq': [255.0],   # Kelvin
    'koi_insol': [1.0],
    'koi_model_snr': [35.5],
    # Aux features (can be mostly defaults or NaNs)
    'koi_time0bk': [135.0],
    'koi_duration': [4.5],
    'koi_incl': [89.9],
    'koi_srho': [1.0],
    'koi_srad': [1.0],
    'koi_smass': [1.0],
    'koi_steff': [5700],
    'koi_slogg': [4.5],
    'koi_smet': [0.0]
}

df_new = pd.DataFrame(data)

# 4. Predict
prediction_index = pipeline.predict(df_new)
prediction_label = label_encoder.inverse_transform(prediction_index)
probabilities = pipeline.predict_proba(df_new)

print(f"Prediction: {prediction_label[0]}")
print(f"Confidence: {max(probabilities[0]):.4f}")
```

-----

## Training Details

### Training Procedure

The model was trained using a robust preprocessing pipeline followed by a Stacking Classifier.

#### 1\. Preprocessing

The pipeline splits features into two groups with different imputation strategies:

  * **Critical Features:** Missing values filled with constant `-999`. Scaled via `StandardScaler`.
  * **Auxiliary Features:** Missing values filled with the `median`. Scaled via `StandardScaler`.

#### 2\. Architecture

  * **Level 0 (Base Learners):**
      * **LightGBM:** (500 estimators, GPU accelerated)
      * **XGBoost:** (500 estimators, Histogram tree method, GPU accelerated)
      * **CatBoost:** (500 estimators, Depth 8, GPU accelerated)
  * **Level 1 (Meta Learner):**
      * **LightGBM:** (200 estimators) - Aggregates the probabilities from Level 0 to make the final decision.

### Feature Importance

Based on the base learners, the most critical features for classification were identified as:

1.  `koi_model_snr` (Signal-to-Noise Ratio)
2.  `koi_prad` (Planetary Radius)
3.  `koi_depth` (Transit Depth)
4.  `koi_period` (Orbital Period)

-----

## Evaluation Results

The model was evaluated on a held-out test set (20% of data) using stratified splitting.

  * **Accuracy:** \~90%+ (Dependent on specific test split)
  * **Precision/Recall:** High precision in distinguishing False Positives from Candidates.

-----

## Bias, Risks, and Limitations

  * **Data Specificity:** This model is trained specifically on Kepler mission data. It may not generalize well to data from TESS or JWST without fine-tuning, as the instrumentation and noise profiles differ.
  * **Class Imbalance:** Depending on the dataset version, "False Positives" are often more numerous than "Confirmed" planets, which can bias the model slightly toward false positive predictions in low-SNR ranges.

## Environmental Impact

  * **Compute:** Trained on GPU (NVIDIA T4/P100 class) via Kaggle Kernels.
  * **Training Time:** \< 5 minutes due to GPU acceleration and efficient gradient boosting implementations.

-----