File size: 7,932 Bytes
46e7f0b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 |
---
language: en
pipeline_tag: tabular-regression
library_name: autogluon
tags:
- autogluon
- tabular-regression
- regression
- automl
- aws-sagemaker
- udacity
- kaggle
- bike-sharing-demand
- time-series
- feature-engineering
metrics:
- rmse
- rmsle
model-index:
- name: Bike Sharing Demand Prediction (AutoGluon TabularPredictor)
results:
- task:
type: tabular-regression
name: Tabular Regression
dataset:
name: Kaggle Bike Sharing Demand (train.csv / test.csv)
type: csv
metrics:
- name: Validation RMSE (best run, internal AutoGluon validation)
type: rmse
value: 39.953761
- name: Kaggle Public Score (RMSLE, best submission)
type: rmsle
value: 0.49145
---
# 🚲 Bike Sharing Demand Prediction with AutoGluon (Udacity AWS MLE Nanodegree)
This model predicts hourly bike rental demand (the target column `count`) from structured historical + weather/time features using AutoGluon’s `TabularPredictor` (AutoML for tabular regression). The workflow is based on the Udacity “Predict Bike Sharing Demand with AutoGluon” project and targets the Kaggle Bike Sharing Demand competition dataset.
Repository:
https://github.com/brej-29/udacity-AWS-ml-engineer-nanodegree/tree/main/Bike%20Sharing%20Demand%20with%20AutoGluon
## Model Details
- Developed by: brej-29
- Model type: AutoGluon `TabularPredictor` (tabular regression)
- Target label: `count`
- Problem type: regression
- Core approach: AutoGluon trains and ensembles multiple models (e.g., ExtraTrees, LightGBM, CatBoost, XGBoost) and may create a weighted ensemble for best validation performance.
- Training environment: Notebook-based workflow (commonly run on AWS SageMaker Studio in the Udacity project setup)
## Intended Use
- Educational / portfolio demonstration of:
- Kaggle-style regression workflow
- AutoML with AutoGluon
- Feature engineering from datetime fields
- Hyperparameter optimization (HPO) experiments
- Baseline demand forecasting experiments on the Kaggle Bike Sharing dataset
Out of scope:
- Production forecasting without monitoring, retraining strategy, and strong input validation
- High-stakes operational decisioning (e.g., staffing, pricing) without deeper evaluation and error analysis
## Training Data
Dataset: Kaggle “Bike Sharing Demand”
Typical columns include:
- Features: `datetime`, `season`, `holiday`, `workingday`, `weather`, `temp`, `atemp`, `humidity`, `windspeed`
- Leakage columns present in train but not in test: `casual`, `registered`
- Target: `count`
Note: The Kaggle competition evaluates submissions using RMSLE (root mean squared log error). The project tracks Kaggle submission scores alongside offline validation metrics.
## Preprocessing and Feature Engineering
- `datetime` is parsed as a datetime type.
- Leakage prevention:
- The notebook sets `ignored_columns = ["casual", "registered"]` because they are not available in the Kaggle test set and would cause leakage if used.
- Feature engineering experiment:
- Additional time-derived features were created from `datetime`:
- `year`, `month`, `day`, `hour`
- These were used in a follow-up training run to measure impact on performance.
- AutoGluon also handles datetime features internally (converting datetime into numeric/date parts as needed).
## Training Procedure
Base configuration used in the notebook:
- `TabularPredictor(label="count", problem_type="regression", eval_metric="root_mean_squared_error")`
- Preset: `best_quality`
- Time limit: 600 seconds (10 minutes)
- Bagging: enabled in best-quality preset (notebook run shows bagging with 8 folds in the fit summary)
Hyperparameter optimization (HPO) run:
- Search controlled via `hyperparameter_tune_kwargs`:
- `num_trials = 20`
- `searcher = "auto"`
- `scheduler = "local"`
- Hyperparameters were provided for:
- GBM (including extra-trees style trials + a larger preset config)
- XT (ExtraTrees)
- XGB (XGBoost)
## Evaluation
Important note about AutoGluon leaderboard scores:
- AutoGluon’s leaderboard displays metrics in “higher is better” format.
- For RMSE, the displayed `score_val` is the negative RMSE (sign-flipped), so you can interpret:
- Validation RMSE ≈ absolute value of `score_val`
Offline validation (AutoGluon internal validation; best run from the notebook):
- Best validation `score_val`: -39.953761 (root_mean_squared_error)
- Interpreted validation RMSE: 39.953761
Kaggle public leaderboard (submissions generated from notebook):
- Initial submission RMSLE: 1.42139
- With added features submission RMSLE: 1.41560
- With HPO submission RMSLE: 0.49145
## How to Use
Recommendation: Upload the entire AutoGluon model directory produced by training (commonly something like `AutogluonModels/<run_name>/`) to your Hugging Face model repo.
Example inference pattern:
import pandas as pd
from huggingface_hub import snapshot_download
from autogluon.tabular import TabularPredictor
repo_id = "YOUR_USERNAME/YOUR_MODEL_REPO"
# Download the whole repo snapshot (works well for AutoGluon folders)
local_dir = snapshot_download(repo_id=repo_id)
# Point this to the directory that contains the AutoGluon predictor artifacts
predictor = TabularPredictor.load(local_dir)
# Example input (use correct values and columns)
X = pd.DataFrame([{
"datetime": "2012-12-19 17:00:00",
"season": 4,
"holiday": 0,
"workingday": 1,
"weather": 1,
"temp": 10.0,
"atemp": 12.0,
"humidity": 60,
"windspeed": 15.0
}])
preds = predictor.predict(X)
print(float(preds.iloc[0]))
If your trained model expects engineered columns (like `year`, `month`, `day`, `hour`), ensure you create them exactly the same way before calling `predict()`.
## Input Requirements
- Input must be a tabular dataframe (pandas DataFrame recommended).
- Required columns should match the Kaggle test schema used for training:
- `datetime`, `season`, `holiday`, `workingday`, `weather`, `temp`, `atemp`, `humidity`, `windspeed`
- Do not include the ignored leakage columns at inference:
- `casual`, `registered`
- If using engineered datetime columns in your final training run, ensure consistent feature generation:
- `year`, `month`, `day`, `hour`
- Datatypes:
- numeric columns should be valid numeric types (int/float)
- missing values should be handled consistently (AutoGluon can handle many missing values, but consistent preprocessing is recommended)
## Bias, Risks, and Limitations
- This model is trained on a specific city/time period dataset; performance may degrade when applied to other geographies or changed mobility patterns (distribution shift).
- Kaggle data can contain seasonal/holiday patterns that may not generalize.
- RMSLE heavily penalizes under-prediction at higher counts; depending on your application, you may need different objectives/metrics.
- If `datetime` parsing or feature generation differs from training, predictions may be unreliable.
## Environmental Impact
AutoGluon tabular training for this project is typically CPU-friendly and time-bounded (10 minutes in the notebook). Compute footprint is modest compared to deep learning workloads, but best-quality presets can still train multiple models and ensembles.
## Technical Specifications
- Framework: AutoGluon Tabular (`TabularPredictor`)
- Task: Tabular regression
- Eval metric used in training: root mean squared error (RMSE)
- Ensembling: weighted ensemble over base learners may be used (AutoGluon best-quality preset)
## Model Card Authors
- BrejBala
## Contact
For questions/feedback, please open an issue on the GitHub repository:
https://github.com/brej-29/udacity-AWS-ml-engineer-nanodegree/tree/main/Bike%20Sharing%20Demand%20with%20AutoGluon
|