Bike_Sharing_Demand / README.md
BrejBala's picture
Update README.md
46e7f0b verified
---
language: en
pipeline_tag: tabular-regression
library_name: autogluon
tags:
- autogluon
- tabular-regression
- regression
- automl
- aws-sagemaker
- udacity
- kaggle
- bike-sharing-demand
- time-series
- feature-engineering
metrics:
- rmse
- rmsle
model-index:
- name: Bike Sharing Demand Prediction (AutoGluon TabularPredictor)
results:
- task:
type: tabular-regression
name: Tabular Regression
dataset:
name: Kaggle Bike Sharing Demand (train.csv / test.csv)
type: csv
metrics:
- name: Validation RMSE (best run, internal AutoGluon validation)
type: rmse
value: 39.953761
- name: Kaggle Public Score (RMSLE, best submission)
type: rmsle
value: 0.49145
---
# 🚲 Bike Sharing Demand Prediction with AutoGluon (Udacity AWS MLE Nanodegree)
This model predicts hourly bike rental demand (the target column `count`) from structured historical + weather/time features using AutoGluon’s `TabularPredictor` (AutoML for tabular regression). The workflow is based on the Udacity “Predict Bike Sharing Demand with AutoGluon” project and targets the Kaggle Bike Sharing Demand competition dataset.
Repository:
https://github.com/brej-29/udacity-AWS-ml-engineer-nanodegree/tree/main/Bike%20Sharing%20Demand%20with%20AutoGluon
## Model Details
- Developed by: brej-29
- Model type: AutoGluon `TabularPredictor` (tabular regression)
- Target label: `count`
- Problem type: regression
- Core approach: AutoGluon trains and ensembles multiple models (e.g., ExtraTrees, LightGBM, CatBoost, XGBoost) and may create a weighted ensemble for best validation performance.
- Training environment: Notebook-based workflow (commonly run on AWS SageMaker Studio in the Udacity project setup)
## Intended Use
- Educational / portfolio demonstration of:
- Kaggle-style regression workflow
- AutoML with AutoGluon
- Feature engineering from datetime fields
- Hyperparameter optimization (HPO) experiments
- Baseline demand forecasting experiments on the Kaggle Bike Sharing dataset
Out of scope:
- Production forecasting without monitoring, retraining strategy, and strong input validation
- High-stakes operational decisioning (e.g., staffing, pricing) without deeper evaluation and error analysis
## Training Data
Dataset: Kaggle “Bike Sharing Demand”
Typical columns include:
- Features: `datetime`, `season`, `holiday`, `workingday`, `weather`, `temp`, `atemp`, `humidity`, `windspeed`
- Leakage columns present in train but not in test: `casual`, `registered`
- Target: `count`
Note: The Kaggle competition evaluates submissions using RMSLE (root mean squared log error). The project tracks Kaggle submission scores alongside offline validation metrics.
## Preprocessing and Feature Engineering
- `datetime` is parsed as a datetime type.
- Leakage prevention:
- The notebook sets `ignored_columns = ["casual", "registered"]` because they are not available in the Kaggle test set and would cause leakage if used.
- Feature engineering experiment:
- Additional time-derived features were created from `datetime`:
- `year`, `month`, `day`, `hour`
- These were used in a follow-up training run to measure impact on performance.
- AutoGluon also handles datetime features internally (converting datetime into numeric/date parts as needed).
## Training Procedure
Base configuration used in the notebook:
- `TabularPredictor(label="count", problem_type="regression", eval_metric="root_mean_squared_error")`
- Preset: `best_quality`
- Time limit: 600 seconds (10 minutes)
- Bagging: enabled in best-quality preset (notebook run shows bagging with 8 folds in the fit summary)
Hyperparameter optimization (HPO) run:
- Search controlled via `hyperparameter_tune_kwargs`:
- `num_trials = 20`
- `searcher = "auto"`
- `scheduler = "local"`
- Hyperparameters were provided for:
- GBM (including extra-trees style trials + a larger preset config)
- XT (ExtraTrees)
- XGB (XGBoost)
## Evaluation
Important note about AutoGluon leaderboard scores:
- AutoGluon’s leaderboard displays metrics in “higher is better” format.
- For RMSE, the displayed `score_val` is the negative RMSE (sign-flipped), so you can interpret:
- Validation RMSE ≈ absolute value of `score_val`
Offline validation (AutoGluon internal validation; best run from the notebook):
- Best validation `score_val`: -39.953761 (root_mean_squared_error)
- Interpreted validation RMSE: 39.953761
Kaggle public leaderboard (submissions generated from notebook):
- Initial submission RMSLE: 1.42139
- With added features submission RMSLE: 1.41560
- With HPO submission RMSLE: 0.49145
## How to Use
Recommendation: Upload the entire AutoGluon model directory produced by training (commonly something like `AutogluonModels/<run_name>/`) to your Hugging Face model repo.
Example inference pattern:
import pandas as pd
from huggingface_hub import snapshot_download
from autogluon.tabular import TabularPredictor
repo_id = "YOUR_USERNAME/YOUR_MODEL_REPO"
# Download the whole repo snapshot (works well for AutoGluon folders)
local_dir = snapshot_download(repo_id=repo_id)
# Point this to the directory that contains the AutoGluon predictor artifacts
predictor = TabularPredictor.load(local_dir)
# Example input (use correct values and columns)
X = pd.DataFrame([{
"datetime": "2012-12-19 17:00:00",
"season": 4,
"holiday": 0,
"workingday": 1,
"weather": 1,
"temp": 10.0,
"atemp": 12.0,
"humidity": 60,
"windspeed": 15.0
}])
preds = predictor.predict(X)
print(float(preds.iloc[0]))
If your trained model expects engineered columns (like `year`, `month`, `day`, `hour`), ensure you create them exactly the same way before calling `predict()`.
## Input Requirements
- Input must be a tabular dataframe (pandas DataFrame recommended).
- Required columns should match the Kaggle test schema used for training:
- `datetime`, `season`, `holiday`, `workingday`, `weather`, `temp`, `atemp`, `humidity`, `windspeed`
- Do not include the ignored leakage columns at inference:
- `casual`, `registered`
- If using engineered datetime columns in your final training run, ensure consistent feature generation:
- `year`, `month`, `day`, `hour`
- Datatypes:
- numeric columns should be valid numeric types (int/float)
- missing values should be handled consistently (AutoGluon can handle many missing values, but consistent preprocessing is recommended)
## Bias, Risks, and Limitations
- This model is trained on a specific city/time period dataset; performance may degrade when applied to other geographies or changed mobility patterns (distribution shift).
- Kaggle data can contain seasonal/holiday patterns that may not generalize.
- RMSLE heavily penalizes under-prediction at higher counts; depending on your application, you may need different objectives/metrics.
- If `datetime` parsing or feature generation differs from training, predictions may be unreliable.
## Environmental Impact
AutoGluon tabular training for this project is typically CPU-friendly and time-bounded (10 minutes in the notebook). Compute footprint is modest compared to deep learning workloads, but best-quality presets can still train multiple models and ensembles.
## Technical Specifications
- Framework: AutoGluon Tabular (`TabularPredictor`)
- Task: Tabular regression
- Eval metric used in training: root mean squared error (RMSE)
- Ensembling: weighted ensemble over base learners may be used (AutoGluon best-quality preset)
## Model Card Authors
- BrejBala
## Contact
For questions/feedback, please open an issue on the GitHub repository:
https://github.com/brej-29/udacity-AWS-ml-engineer-nanodegree/tree/main/Bike%20Sharing%20Demand%20with%20AutoGluon