Update README.md

59750f9 verified 3 months ago

5.81 kB

	---
	library_name: sklearn
	tags:
	- exoplanets
	- astronomy
	- tabular-classification
	- stacking
	- lightgbm
	- xgboost
	- catboost
	- physics
	license: mit
	metrics:
	- accuracy
	- f1
	model-index:
	- name: Exoplanet Candidate Classifier
	results: []
	---

	# Model Card: Exoplanet Candidate Classifier (Stacking Ensemble)

	## Model Details

	### Model Description

	This is a robust machine learning pipeline designed to classify Kepler Objects of Interest (KOIs). It determines whether a detected signal represents a real exoplanet or a false positive.

	The model utilizes a Stacking Ensemble architecture, combining the predictions of three powerful gradient boosting frameworks (LightGBM, XGBoost, and CatBoost) and aggregating them using a final LightGBM meta-learner. It is specifically engineered to handle missing data (NaNs) in scientific datasets through a dual-strategy imputation pipeline.

	- Developed by: [Darwin Danish]
	- Model Type: Scikit-learn Pipeline (StackingClassifier)
	- Input: Tabular data (16 astrophysical features)
	- Output: Multi-class classification (CANDIDATE, CONFIRMED, FALSE POSITIVE)

	### Model Sources

	- Repository: https://huggingface.co/DarwinDanish/exoplanet-classifier-stacking
	- Dataset Source: NASA Kepler Object of Interest (KOI) Table

	---

	## Uses

	### Direct Use

	This model is intended for astronomers, data scientists, and space enthusiasts who want to analyze Kepler mission data or similar photometric datasets. It predicts the "disposition" of a celestial object based on its physical properties.

	### Supported Features (Input)
	To use this model, your input DataFrame must contain the following columns:

	Critical Features:
	* `koi_period`: Orbital period
	* `koi_depth`: Transit depth
	* `koi_prad`: Planetary radius
	* `koi_sma`: Semi-major axis
	* `koi_teq`: Equilibrium temperature
	* `koi_insol`: Insolation flux
	* `koi_model_snr`: Signal-to-Noise Ratio

	Auxiliary Features:
	* `koi_time0bk`, `koi_duration`, `koi_incl`, `koi_srho`, `koi_srad`, `koi_smass`, `koi_steff`, `koi_slogg`, `koi_smet`

	---

	## How to Get Started with the Model

	You can load this model directly from the Hugging Face Hub using `joblib` and `huggingface_hub`.

	### 1. Installation

	```bash
	pip install huggingface_hub joblib pandas scikit-learn lightgbm xgboost catboost
	````

	### 2\. Python Inference Code

	```python
	import joblib
	import pandas as pd
	from huggingface_hub import hf_hub_download

	# 1. Download the model and label encoder
	repo_id = "DarwinDanish/exoplanet-classifier-stacking"

	model_path = hf_hub_download(repo_id=repo_id, filename="exo_stacking_pipeline.pkl")
	encoder_path = hf_hub_download(repo_id=repo_id, filename="exo_label_encoder.pkl")

	# 2. Load the artifacts
	pipeline = joblib.load(model_path)
	label_encoder = joblib.load(encoder_path)

	# 3. Create sample data (Example: A likely planet candidate)
	# Note: The model handles NaNs, so missing values are allowed.
	data = {
	'koi_period': [365.25],
	'koi_depth': [1000.5],
	'koi_prad': [1.02], # Earth radii
	'koi_sma': [1.0], # AU
	'koi_teq': [255.0], # Kelvin
	'koi_insol': [1.0],
	'koi_model_snr': [35.5],
	# Aux features (can be mostly defaults or NaNs)
	'koi_time0bk': [135.0],
	'koi_duration': [4.5],
	'koi_incl': [89.9],
	'koi_srho': [1.0],
	'koi_srad': [1.0],
	'koi_smass': [1.0],
	'koi_steff': [5700],
	'koi_slogg': [4.5],
	'koi_smet': [0.0]
	}

	df_new = pd.DataFrame(data)

	# 4. Predict
	prediction_index = pipeline.predict(df_new)
	prediction_label = label_encoder.inverse_transform(prediction_index)
	probabilities = pipeline.predict_proba(df_new)

	print(f"Prediction: {prediction_label[0]}")
	print(f"Confidence: {max(probabilities[0]):.4f}")
	```

	-----

	## Training Details

	### Training Procedure

	The model was trained using a robust preprocessing pipeline followed by a Stacking Classifier.

	#### 1\. Preprocessing

	The pipeline splits features into two groups with different imputation strategies:

	* Critical Features: Missing values filled with constant `-999`. Scaled via `StandardScaler`.
	* Auxiliary Features: Missing values filled with the `median`. Scaled via `StandardScaler`.

	#### 2\. Architecture

	* Level 0 (Base Learners):
	* LightGBM: (500 estimators, GPU accelerated)
	* XGBoost: (500 estimators, Histogram tree method, GPU accelerated)
	* CatBoost: (500 estimators, Depth 8, GPU accelerated)
	* Level 1 (Meta Learner):
	* LightGBM: (200 estimators) - Aggregates the probabilities from Level 0 to make the final decision.

	### Feature Importance

	Based on the base learners, the most critical features for classification were identified as:

	1. `koi_model_snr` (Signal-to-Noise Ratio)
	2. `koi_prad` (Planetary Radius)
	3. `koi_depth` (Transit Depth)
	4. `koi_period` (Orbital Period)

	-----

	## Evaluation Results

	The model was evaluated on a held-out test set (20% of data) using stratified splitting.

	* Accuracy: \~90%+ (Dependent on specific test split)
	* Precision/Recall: High precision in distinguishing False Positives from Candidates.

	-----

	## Bias, Risks, and Limitations

	* Data Specificity: This model is trained specifically on Kepler mission data. It may not generalize well to data from TESS or JWST without fine-tuning, as the instrumentation and noise profiles differ.
	* Class Imbalance: Depending on the dataset version, "False Positives" are often more numerous than "Confirmed" planets, which can bias the model slightly toward false positive predictions in low-SNR ranges.

	## Environmental Impact

	* Compute: Trained on GPU (NVIDIA T4/P100 class) via Kaggle Kernels.
	* Training Time: \< 5 minutes due to GPU acceleration and efficient gradient boosting implementations.

	-----