--- language: en pipeline_tag: tabular-regression library_name: autogluon tags: - autogluon - tabular-regression - regression - automl - aws-sagemaker - udacity - kaggle - bike-sharing-demand - time-series - feature-engineering metrics: - rmse - rmsle model-index: - name: Bike Sharing Demand Prediction (AutoGluon TabularPredictor) results: - task: type: tabular-regression name: Tabular Regression dataset: name: Kaggle Bike Sharing Demand (train.csv / test.csv) type: csv metrics: - name: Validation RMSE (best run, internal AutoGluon validation) type: rmse value: 39.953761 - name: Kaggle Public Score (RMSLE, best submission) type: rmsle value: 0.49145 --- # 🚲 Bike Sharing Demand Prediction with AutoGluon (Udacity AWS MLE Nanodegree) This model predicts hourly bike rental demand (the target column `count`) from structured historical + weather/time features using AutoGluon’s `TabularPredictor` (AutoML for tabular regression). The workflow is based on the Udacity “Predict Bike Sharing Demand with AutoGluon” project and targets the Kaggle Bike Sharing Demand competition dataset. Repository: https://github.com/brej-29/udacity-AWS-ml-engineer-nanodegree/tree/main/Bike%20Sharing%20Demand%20with%20AutoGluon ## Model Details - Developed by: brej-29 - Model type: AutoGluon `TabularPredictor` (tabular regression) - Target label: `count` - Problem type: regression - Core approach: AutoGluon trains and ensembles multiple models (e.g., ExtraTrees, LightGBM, CatBoost, XGBoost) and may create a weighted ensemble for best validation performance. - Training environment: Notebook-based workflow (commonly run on AWS SageMaker Studio in the Udacity project setup) ## Intended Use - Educational / portfolio demonstration of: - Kaggle-style regression workflow - AutoML with AutoGluon - Feature engineering from datetime fields - Hyperparameter optimization (HPO) experiments - Baseline demand forecasting experiments on the Kaggle Bike Sharing dataset Out of scope: - Production forecasting without monitoring, retraining strategy, and strong input validation - High-stakes operational decisioning (e.g., staffing, pricing) without deeper evaluation and error analysis ## Training Data Dataset: Kaggle “Bike Sharing Demand” Typical columns include: - Features: `datetime`, `season`, `holiday`, `workingday`, `weather`, `temp`, `atemp`, `humidity`, `windspeed` - Leakage columns present in train but not in test: `casual`, `registered` - Target: `count` Note: The Kaggle competition evaluates submissions using RMSLE (root mean squared log error). The project tracks Kaggle submission scores alongside offline validation metrics. ## Preprocessing and Feature Engineering - `datetime` is parsed as a datetime type. - Leakage prevention: - The notebook sets `ignored_columns = ["casual", "registered"]` because they are not available in the Kaggle test set and would cause leakage if used. - Feature engineering experiment: - Additional time-derived features were created from `datetime`: - `year`, `month`, `day`, `hour` - These were used in a follow-up training run to measure impact on performance. - AutoGluon also handles datetime features internally (converting datetime into numeric/date parts as needed). ## Training Procedure Base configuration used in the notebook: - `TabularPredictor(label="count", problem_type="regression", eval_metric="root_mean_squared_error")` - Preset: `best_quality` - Time limit: 600 seconds (10 minutes) - Bagging: enabled in best-quality preset (notebook run shows bagging with 8 folds in the fit summary) Hyperparameter optimization (HPO) run: - Search controlled via `hyperparameter_tune_kwargs`: - `num_trials = 20` - `searcher = "auto"` - `scheduler = "local"` - Hyperparameters were provided for: - GBM (including extra-trees style trials + a larger preset config) - XT (ExtraTrees) - XGB (XGBoost) ## Evaluation Important note about AutoGluon leaderboard scores: - AutoGluon’s leaderboard displays metrics in “higher is better” format. - For RMSE, the displayed `score_val` is the negative RMSE (sign-flipped), so you can interpret: - Validation RMSE ≈ absolute value of `score_val` Offline validation (AutoGluon internal validation; best run from the notebook): - Best validation `score_val`: -39.953761 (root_mean_squared_error) - Interpreted validation RMSE: 39.953761 Kaggle public leaderboard (submissions generated from notebook): - Initial submission RMSLE: 1.42139 - With added features submission RMSLE: 1.41560 - With HPO submission RMSLE: 0.49145 ## How to Use Recommendation: Upload the entire AutoGluon model directory produced by training (commonly something like `AutogluonModels//`) to your Hugging Face model repo. Example inference pattern: import pandas as pd from huggingface_hub import snapshot_download from autogluon.tabular import TabularPredictor repo_id = "YOUR_USERNAME/YOUR_MODEL_REPO" # Download the whole repo snapshot (works well for AutoGluon folders) local_dir = snapshot_download(repo_id=repo_id) # Point this to the directory that contains the AutoGluon predictor artifacts predictor = TabularPredictor.load(local_dir) # Example input (use correct values and columns) X = pd.DataFrame([{ "datetime": "2012-12-19 17:00:00", "season": 4, "holiday": 0, "workingday": 1, "weather": 1, "temp": 10.0, "atemp": 12.0, "humidity": 60, "windspeed": 15.0 }]) preds = predictor.predict(X) print(float(preds.iloc[0])) If your trained model expects engineered columns (like `year`, `month`, `day`, `hour`), ensure you create them exactly the same way before calling `predict()`. ## Input Requirements - Input must be a tabular dataframe (pandas DataFrame recommended). - Required columns should match the Kaggle test schema used for training: - `datetime`, `season`, `holiday`, `workingday`, `weather`, `temp`, `atemp`, `humidity`, `windspeed` - Do not include the ignored leakage columns at inference: - `casual`, `registered` - If using engineered datetime columns in your final training run, ensure consistent feature generation: - `year`, `month`, `day`, `hour` - Datatypes: - numeric columns should be valid numeric types (int/float) - missing values should be handled consistently (AutoGluon can handle many missing values, but consistent preprocessing is recommended) ## Bias, Risks, and Limitations - This model is trained on a specific city/time period dataset; performance may degrade when applied to other geographies or changed mobility patterns (distribution shift). - Kaggle data can contain seasonal/holiday patterns that may not generalize. - RMSLE heavily penalizes under-prediction at higher counts; depending on your application, you may need different objectives/metrics. - If `datetime` parsing or feature generation differs from training, predictions may be unreliable. ## Environmental Impact AutoGluon tabular training for this project is typically CPU-friendly and time-bounded (10 minutes in the notebook). Compute footprint is modest compared to deep learning workloads, but best-quality presets can still train multiple models and ensembles. ## Technical Specifications - Framework: AutoGluon Tabular (`TabularPredictor`) - Task: Tabular regression - Eval metric used in training: root mean squared error (RMSE) - Ensembling: weighted ensemble over base learners may be used (AutoGluon best-quality preset) ## Model Card Authors - BrejBala ## Contact For questions/feedback, please open an issue on the GitHub repository: https://github.com/brej-29/udacity-AWS-ml-engineer-nanodegree/tree/main/Bike%20Sharing%20Demand%20with%20AutoGluon