--- library_name: sklearn tags: - exoplanets - astronomy - tabular-classification - stacking - lightgbm - xgboost - catboost - physics license: mit metrics: - accuracy - f1 model-index: - name: Exoplanet Candidate Classifier results: [] --- # Model Card: Exoplanet Candidate Classifier (Stacking Ensemble) ## Model Details ### Model Description This is a robust machine learning pipeline designed to classify **Kepler Objects of Interest (KOIs)**. It determines whether a detected signal represents a real exoplanet or a false positive. The model utilizes a **Stacking Ensemble** architecture, combining the predictions of three powerful gradient boosting frameworks (LightGBM, XGBoost, and CatBoost) and aggregating them using a final LightGBM meta-learner. It is specifically engineered to handle missing data (NaNs) in scientific datasets through a dual-strategy imputation pipeline. - **Developed by:** [Darwin Danish] - **Model Type:** Scikit-learn Pipeline (StackingClassifier) - **Input:** Tabular data (16 astrophysical features) - **Output:** Multi-class classification (CANDIDATE, CONFIRMED, FALSE POSITIVE) ### Model Sources - **Repository:** https://huggingface.co/DarwinDanish/exoplanet-classifier-stacking - **Dataset Source:** NASA Kepler Object of Interest (KOI) Table --- ## Uses ### Direct Use This model is intended for astronomers, data scientists, and space enthusiasts who want to analyze Kepler mission data or similar photometric datasets. It predicts the "disposition" of a celestial object based on its physical properties. ### Supported Features (Input) To use this model, your input DataFrame must contain the following columns: **Critical Features:** * `koi_period`: Orbital period * `koi_depth`: Transit depth * `koi_prad`: Planetary radius * `koi_sma`: Semi-major axis * `koi_teq`: Equilibrium temperature * `koi_insol`: Insolation flux * `koi_model_snr`: Signal-to-Noise Ratio **Auxiliary Features:** * `koi_time0bk`, `koi_duration`, `koi_incl`, `koi_srho`, `koi_srad`, `koi_smass`, `koi_steff`, `koi_slogg`, `koi_smet` --- ## How to Get Started with the Model You can load this model directly from the Hugging Face Hub using `joblib` and `huggingface_hub`. ### 1. Installation ```bash pip install huggingface_hub joblib pandas scikit-learn lightgbm xgboost catboost ```` ### 2\. Python Inference Code ```python import joblib import pandas as pd from huggingface_hub import hf_hub_download # 1. Download the model and label encoder repo_id = "DarwinDanish/exoplanet-classifier-stacking" model_path = hf_hub_download(repo_id=repo_id, filename="exo_stacking_pipeline.pkl") encoder_path = hf_hub_download(repo_id=repo_id, filename="exo_label_encoder.pkl") # 2. Load the artifacts pipeline = joblib.load(model_path) label_encoder = joblib.load(encoder_path) # 3. Create sample data (Example: A likely planet candidate) # Note: The model handles NaNs, so missing values are allowed. data = { 'koi_period': [365.25], 'koi_depth': [1000.5], 'koi_prad': [1.02], # Earth radii 'koi_sma': [1.0], # AU 'koi_teq': [255.0], # Kelvin 'koi_insol': [1.0], 'koi_model_snr': [35.5], # Aux features (can be mostly defaults or NaNs) 'koi_time0bk': [135.0], 'koi_duration': [4.5], 'koi_incl': [89.9], 'koi_srho': [1.0], 'koi_srad': [1.0], 'koi_smass': [1.0], 'koi_steff': [5700], 'koi_slogg': [4.5], 'koi_smet': [0.0] } df_new = pd.DataFrame(data) # 4. Predict prediction_index = pipeline.predict(df_new) prediction_label = label_encoder.inverse_transform(prediction_index) probabilities = pipeline.predict_proba(df_new) print(f"Prediction: {prediction_label[0]}") print(f"Confidence: {max(probabilities[0]):.4f}") ``` ----- ## Training Details ### Training Procedure The model was trained using a robust preprocessing pipeline followed by a Stacking Classifier. #### 1\. Preprocessing The pipeline splits features into two groups with different imputation strategies: * **Critical Features:** Missing values filled with constant `-999`. Scaled via `StandardScaler`. * **Auxiliary Features:** Missing values filled with the `median`. Scaled via `StandardScaler`. #### 2\. Architecture * **Level 0 (Base Learners):** * **LightGBM:** (500 estimators, GPU accelerated) * **XGBoost:** (500 estimators, Histogram tree method, GPU accelerated) * **CatBoost:** (500 estimators, Depth 8, GPU accelerated) * **Level 1 (Meta Learner):** * **LightGBM:** (200 estimators) - Aggregates the probabilities from Level 0 to make the final decision. ### Feature Importance Based on the base learners, the most critical features for classification were identified as: 1. `koi_model_snr` (Signal-to-Noise Ratio) 2. `koi_prad` (Planetary Radius) 3. `koi_depth` (Transit Depth) 4. `koi_period` (Orbital Period) ----- ## Evaluation Results The model was evaluated on a held-out test set (20% of data) using stratified splitting. * **Accuracy:** \~90%+ (Dependent on specific test split) * **Precision/Recall:** High precision in distinguishing False Positives from Candidates. ----- ## Bias, Risks, and Limitations * **Data Specificity:** This model is trained specifically on Kepler mission data. It may not generalize well to data from TESS or JWST without fine-tuning, as the instrumentation and noise profiles differ. * **Class Imbalance:** Depending on the dataset version, "False Positives" are often more numerous than "Confirmed" planets, which can bias the model slightly toward false positive predictions in low-SNR ranges. ## Environmental Impact * **Compute:** Trained on GPU (NVIDIA T4/P100 class) via Kaggle Kernels. * **Training Time:** \< 5 minutes due to GPU acceleration and efficient gradient boosting implementations. -----