| | --- |
| | library_name: sklearn |
| | tags: |
| | - exoplanets |
| | - astronomy |
| | - tabular-classification |
| | - stacking |
| | - lightgbm |
| | - xgboost |
| | - catboost |
| | - physics |
| | license: mit |
| | metrics: |
| | - accuracy |
| | - f1 |
| | model-index: |
| | - name: Exoplanet Candidate Classifier |
| | results: [] |
| | --- |
| | |
| | # Model Card: Exoplanet Candidate Classifier (Stacking Ensemble) |
| |
|
| | ## Model Details |
| |
|
| | ### Model Description |
| |
|
| | This is a robust machine learning pipeline designed to classify **Kepler Objects of Interest (KOIs)**. It determines whether a detected signal represents a real exoplanet or a false positive. |
| |
|
| | The model utilizes a **Stacking Ensemble** architecture, combining the predictions of three powerful gradient boosting frameworks (LightGBM, XGBoost, and CatBoost) and aggregating them using a final LightGBM meta-learner. It is specifically engineered to handle missing data (NaNs) in scientific datasets through a dual-strategy imputation pipeline. |
| |
|
| | - **Developed by:** [Darwin Danish] |
| | - **Model Type:** Scikit-learn Pipeline (StackingClassifier) |
| | - **Input:** Tabular data (16 astrophysical features) |
| | - **Output:** Multi-class classification (CANDIDATE, CONFIRMED, FALSE POSITIVE) |
| |
|
| | ### Model Sources |
| |
|
| | - **Repository:** https://huggingface.co/DarwinDanish/exoplanet-classifier-stacking |
| | - **Dataset Source:** NASA Kepler Object of Interest (KOI) Table |
| |
|
| | --- |
| |
|
| | ## Uses |
| |
|
| | ### Direct Use |
| |
|
| | This model is intended for astronomers, data scientists, and space enthusiasts who want to analyze Kepler mission data or similar photometric datasets. It predicts the "disposition" of a celestial object based on its physical properties. |
| |
|
| | ### Supported Features (Input) |
| | To use this model, your input DataFrame must contain the following columns: |
| |
|
| | **Critical Features:** |
| | * `koi_period`: Orbital period |
| | * `koi_depth`: Transit depth |
| | * `koi_prad`: Planetary radius |
| | * `koi_sma`: Semi-major axis |
| | * `koi_teq`: Equilibrium temperature |
| | * `koi_insol`: Insolation flux |
| | * `koi_model_snr`: Signal-to-Noise Ratio |
| |
|
| | **Auxiliary Features:** |
| | * `koi_time0bk`, `koi_duration`, `koi_incl`, `koi_srho`, `koi_srad`, `koi_smass`, `koi_steff`, `koi_slogg`, `koi_smet` |
| |
|
| | --- |
| |
|
| | ## How to Get Started with the Model |
| |
|
| | You can load this model directly from the Hugging Face Hub using `joblib` and `huggingface_hub`. |
| |
|
| | ### 1. Installation |
| |
|
| | ```bash |
| | pip install huggingface_hub joblib pandas scikit-learn lightgbm xgboost catboost |
| | ```` |
| |
|
| | ### 2\. Python Inference Code |
| |
|
| | ```python |
| | import joblib |
| | import pandas as pd |
| | from huggingface_hub import hf_hub_download |
| | |
| | # 1. Download the model and label encoder |
| | repo_id = "DarwinDanish/exoplanet-classifier-stacking" |
| | |
| | model_path = hf_hub_download(repo_id=repo_id, filename="exo_stacking_pipeline.pkl") |
| | encoder_path = hf_hub_download(repo_id=repo_id, filename="exo_label_encoder.pkl") |
| | |
| | # 2. Load the artifacts |
| | pipeline = joblib.load(model_path) |
| | label_encoder = joblib.load(encoder_path) |
| | |
| | # 3. Create sample data (Example: A likely planet candidate) |
| | # Note: The model handles NaNs, so missing values are allowed. |
| | data = { |
| | 'koi_period': [365.25], |
| | 'koi_depth': [1000.5], |
| | 'koi_prad': [1.02], # Earth radii |
| | 'koi_sma': [1.0], # AU |
| | 'koi_teq': [255.0], # Kelvin |
| | 'koi_insol': [1.0], |
| | 'koi_model_snr': [35.5], |
| | # Aux features (can be mostly defaults or NaNs) |
| | 'koi_time0bk': [135.0], |
| | 'koi_duration': [4.5], |
| | 'koi_incl': [89.9], |
| | 'koi_srho': [1.0], |
| | 'koi_srad': [1.0], |
| | 'koi_smass': [1.0], |
| | 'koi_steff': [5700], |
| | 'koi_slogg': [4.5], |
| | 'koi_smet': [0.0] |
| | } |
| | |
| | df_new = pd.DataFrame(data) |
| | |
| | # 4. Predict |
| | prediction_index = pipeline.predict(df_new) |
| | prediction_label = label_encoder.inverse_transform(prediction_index) |
| | probabilities = pipeline.predict_proba(df_new) |
| | |
| | print(f"Prediction: {prediction_label[0]}") |
| | print(f"Confidence: {max(probabilities[0]):.4f}") |
| | ``` |
| |
|
| | ----- |
| |
|
| | ## Training Details |
| |
|
| | ### Training Procedure |
| |
|
| | The model was trained using a robust preprocessing pipeline followed by a Stacking Classifier. |
| |
|
| | #### 1\. Preprocessing |
| |
|
| | The pipeline splits features into two groups with different imputation strategies: |
| |
|
| | * **Critical Features:** Missing values filled with constant `-999`. Scaled via `StandardScaler`. |
| | * **Auxiliary Features:** Missing values filled with the `median`. Scaled via `StandardScaler`. |
| |
|
| | #### 2\. Architecture |
| |
|
| | * **Level 0 (Base Learners):** |
| | * **LightGBM:** (500 estimators, GPU accelerated) |
| | * **XGBoost:** (500 estimators, Histogram tree method, GPU accelerated) |
| | * **CatBoost:** (500 estimators, Depth 8, GPU accelerated) |
| | * **Level 1 (Meta Learner):** |
| | * **LightGBM:** (200 estimators) - Aggregates the probabilities from Level 0 to make the final decision. |
| |
|
| | ### Feature Importance |
| |
|
| | Based on the base learners, the most critical features for classification were identified as: |
| |
|
| | 1. `koi_model_snr` (Signal-to-Noise Ratio) |
| | 2. `koi_prad` (Planetary Radius) |
| | 3. `koi_depth` (Transit Depth) |
| | 4. `koi_period` (Orbital Period) |
| |
|
| | ----- |
| |
|
| | ## Evaluation Results |
| |
|
| | The model was evaluated on a held-out test set (20% of data) using stratified splitting. |
| |
|
| | * **Accuracy:** \~90%+ (Dependent on specific test split) |
| | * **Precision/Recall:** High precision in distinguishing False Positives from Candidates. |
| |
|
| | ----- |
| |
|
| | ## Bias, Risks, and Limitations |
| |
|
| | * **Data Specificity:** This model is trained specifically on Kepler mission data. It may not generalize well to data from TESS or JWST without fine-tuning, as the instrumentation and noise profiles differ. |
| | * **Class Imbalance:** Depending on the dataset version, "False Positives" are often more numerous than "Confirmed" planets, which can bias the model slightly toward false positive predictions in low-SNR ranges. |
| |
|
| | ## Environmental Impact |
| |
|
| | * **Compute:** Trained on GPU (NVIDIA T4/P100 class) via Kaggle Kernels. |
| | * **Training Time:** \< 5 minutes due to GPU acceleration and efficient gradient boosting implementations. |
| |
|
| | ----- |
| |
|