|
|
--- |
|
|
language: en |
|
|
pipeline_tag: tabular-regression |
|
|
library_name: autogluon |
|
|
tags: |
|
|
- autogluon |
|
|
- tabular-regression |
|
|
- regression |
|
|
- automl |
|
|
- aws-sagemaker |
|
|
- udacity |
|
|
- kaggle |
|
|
- bike-sharing-demand |
|
|
- time-series |
|
|
- feature-engineering |
|
|
metrics: |
|
|
- rmse |
|
|
- rmsle |
|
|
model-index: |
|
|
- name: Bike Sharing Demand Prediction (AutoGluon TabularPredictor) |
|
|
results: |
|
|
- task: |
|
|
type: tabular-regression |
|
|
name: Tabular Regression |
|
|
dataset: |
|
|
name: Kaggle Bike Sharing Demand (train.csv / test.csv) |
|
|
type: csv |
|
|
metrics: |
|
|
- name: Validation RMSE (best run, internal AutoGluon validation) |
|
|
type: rmse |
|
|
value: 39.953761 |
|
|
- name: Kaggle Public Score (RMSLE, best submission) |
|
|
type: rmsle |
|
|
value: 0.49145 |
|
|
--- |
|
|
|
|
|
# 🚲 Bike Sharing Demand Prediction with AutoGluon (Udacity AWS MLE Nanodegree) |
|
|
|
|
|
This model predicts hourly bike rental demand (the target column `count`) from structured historical + weather/time features using AutoGluon’s `TabularPredictor` (AutoML for tabular regression). The workflow is based on the Udacity “Predict Bike Sharing Demand with AutoGluon” project and targets the Kaggle Bike Sharing Demand competition dataset. |
|
|
|
|
|
Repository: |
|
|
https://github.com/brej-29/udacity-AWS-ml-engineer-nanodegree/tree/main/Bike%20Sharing%20Demand%20with%20AutoGluon |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- Developed by: brej-29 |
|
|
- Model type: AutoGluon `TabularPredictor` (tabular regression) |
|
|
- Target label: `count` |
|
|
- Problem type: regression |
|
|
- Core approach: AutoGluon trains and ensembles multiple models (e.g., ExtraTrees, LightGBM, CatBoost, XGBoost) and may create a weighted ensemble for best validation performance. |
|
|
- Training environment: Notebook-based workflow (commonly run on AWS SageMaker Studio in the Udacity project setup) |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
- Educational / portfolio demonstration of: |
|
|
- Kaggle-style regression workflow |
|
|
- AutoML with AutoGluon |
|
|
- Feature engineering from datetime fields |
|
|
- Hyperparameter optimization (HPO) experiments |
|
|
- Baseline demand forecasting experiments on the Kaggle Bike Sharing dataset |
|
|
|
|
|
Out of scope: |
|
|
- Production forecasting without monitoring, retraining strategy, and strong input validation |
|
|
- High-stakes operational decisioning (e.g., staffing, pricing) without deeper evaluation and error analysis |
|
|
|
|
|
## Training Data |
|
|
|
|
|
Dataset: Kaggle “Bike Sharing Demand” |
|
|
|
|
|
Typical columns include: |
|
|
- Features: `datetime`, `season`, `holiday`, `workingday`, `weather`, `temp`, `atemp`, `humidity`, `windspeed` |
|
|
- Leakage columns present in train but not in test: `casual`, `registered` |
|
|
- Target: `count` |
|
|
|
|
|
Note: The Kaggle competition evaluates submissions using RMSLE (root mean squared log error). The project tracks Kaggle submission scores alongside offline validation metrics. |
|
|
|
|
|
## Preprocessing and Feature Engineering |
|
|
|
|
|
- `datetime` is parsed as a datetime type. |
|
|
- Leakage prevention: |
|
|
- The notebook sets `ignored_columns = ["casual", "registered"]` because they are not available in the Kaggle test set and would cause leakage if used. |
|
|
- Feature engineering experiment: |
|
|
- Additional time-derived features were created from `datetime`: |
|
|
- `year`, `month`, `day`, `hour` |
|
|
- These were used in a follow-up training run to measure impact on performance. |
|
|
- AutoGluon also handles datetime features internally (converting datetime into numeric/date parts as needed). |
|
|
|
|
|
## Training Procedure |
|
|
|
|
|
Base configuration used in the notebook: |
|
|
- `TabularPredictor(label="count", problem_type="regression", eval_metric="root_mean_squared_error")` |
|
|
- Preset: `best_quality` |
|
|
- Time limit: 600 seconds (10 minutes) |
|
|
- Bagging: enabled in best-quality preset (notebook run shows bagging with 8 folds in the fit summary) |
|
|
|
|
|
Hyperparameter optimization (HPO) run: |
|
|
- Search controlled via `hyperparameter_tune_kwargs`: |
|
|
- `num_trials = 20` |
|
|
- `searcher = "auto"` |
|
|
- `scheduler = "local"` |
|
|
- Hyperparameters were provided for: |
|
|
- GBM (including extra-trees style trials + a larger preset config) |
|
|
- XT (ExtraTrees) |
|
|
- XGB (XGBoost) |
|
|
|
|
|
## Evaluation |
|
|
|
|
|
Important note about AutoGluon leaderboard scores: |
|
|
- AutoGluon’s leaderboard displays metrics in “higher is better” format. |
|
|
- For RMSE, the displayed `score_val` is the negative RMSE (sign-flipped), so you can interpret: |
|
|
- Validation RMSE ≈ absolute value of `score_val` |
|
|
|
|
|
Offline validation (AutoGluon internal validation; best run from the notebook): |
|
|
- Best validation `score_val`: -39.953761 (root_mean_squared_error) |
|
|
- Interpreted validation RMSE: 39.953761 |
|
|
|
|
|
Kaggle public leaderboard (submissions generated from notebook): |
|
|
- Initial submission RMSLE: 1.42139 |
|
|
- With added features submission RMSLE: 1.41560 |
|
|
- With HPO submission RMSLE: 0.49145 |
|
|
|
|
|
## How to Use |
|
|
|
|
|
Recommendation: Upload the entire AutoGluon model directory produced by training (commonly something like `AutogluonModels/<run_name>/`) to your Hugging Face model repo. |
|
|
|
|
|
Example inference pattern: |
|
|
|
|
|
import pandas as pd |
|
|
from huggingface_hub import snapshot_download |
|
|
from autogluon.tabular import TabularPredictor |
|
|
|
|
|
repo_id = "YOUR_USERNAME/YOUR_MODEL_REPO" |
|
|
|
|
|
# Download the whole repo snapshot (works well for AutoGluon folders) |
|
|
local_dir = snapshot_download(repo_id=repo_id) |
|
|
|
|
|
# Point this to the directory that contains the AutoGluon predictor artifacts |
|
|
predictor = TabularPredictor.load(local_dir) |
|
|
|
|
|
# Example input (use correct values and columns) |
|
|
X = pd.DataFrame([{ |
|
|
"datetime": "2012-12-19 17:00:00", |
|
|
"season": 4, |
|
|
"holiday": 0, |
|
|
"workingday": 1, |
|
|
"weather": 1, |
|
|
"temp": 10.0, |
|
|
"atemp": 12.0, |
|
|
"humidity": 60, |
|
|
"windspeed": 15.0 |
|
|
}]) |
|
|
|
|
|
preds = predictor.predict(X) |
|
|
print(float(preds.iloc[0])) |
|
|
|
|
|
If your trained model expects engineered columns (like `year`, `month`, `day`, `hour`), ensure you create them exactly the same way before calling `predict()`. |
|
|
|
|
|
## Input Requirements |
|
|
|
|
|
- Input must be a tabular dataframe (pandas DataFrame recommended). |
|
|
- Required columns should match the Kaggle test schema used for training: |
|
|
- `datetime`, `season`, `holiday`, `workingday`, `weather`, `temp`, `atemp`, `humidity`, `windspeed` |
|
|
- Do not include the ignored leakage columns at inference: |
|
|
- `casual`, `registered` |
|
|
- If using engineered datetime columns in your final training run, ensure consistent feature generation: |
|
|
- `year`, `month`, `day`, `hour` |
|
|
- Datatypes: |
|
|
- numeric columns should be valid numeric types (int/float) |
|
|
- missing values should be handled consistently (AutoGluon can handle many missing values, but consistent preprocessing is recommended) |
|
|
|
|
|
## Bias, Risks, and Limitations |
|
|
|
|
|
- This model is trained on a specific city/time period dataset; performance may degrade when applied to other geographies or changed mobility patterns (distribution shift). |
|
|
- Kaggle data can contain seasonal/holiday patterns that may not generalize. |
|
|
- RMSLE heavily penalizes under-prediction at higher counts; depending on your application, you may need different objectives/metrics. |
|
|
- If `datetime` parsing or feature generation differs from training, predictions may be unreliable. |
|
|
|
|
|
## Environmental Impact |
|
|
|
|
|
AutoGluon tabular training for this project is typically CPU-friendly and time-bounded (10 minutes in the notebook). Compute footprint is modest compared to deep learning workloads, but best-quality presets can still train multiple models and ensembles. |
|
|
|
|
|
## Technical Specifications |
|
|
|
|
|
- Framework: AutoGluon Tabular (`TabularPredictor`) |
|
|
- Task: Tabular regression |
|
|
- Eval metric used in training: root mean squared error (RMSE) |
|
|
- Ensembling: weighted ensemble over base learners may be used (AutoGluon best-quality preset) |
|
|
|
|
|
## Model Card Authors |
|
|
|
|
|
- BrejBala |
|
|
|
|
|
## Contact |
|
|
|
|
|
For questions/feedback, please open an issue on the GitHub repository: |
|
|
https://github.com/brej-29/udacity-AWS-ml-engineer-nanodegree/tree/main/Bike%20Sharing%20Demand%20with%20AutoGluon |
|
|
|