Update README.md

46e7f0b verified 28 days ago

7.93 kB

	---
	language: en
	pipeline_tag: tabular-regression
	library_name: autogluon
	tags:
	- autogluon
	- tabular-regression
	- regression
	- automl
	- aws-sagemaker
	- udacity
	- kaggle
	- bike-sharing-demand
	- time-series
	- feature-engineering
	metrics:
	- rmse
	- rmsle
	model-index:
	- name: Bike Sharing Demand Prediction (AutoGluon TabularPredictor)
	results:
	- task:
	type: tabular-regression
	name: Tabular Regression
	dataset:
	name: Kaggle Bike Sharing Demand (train.csv / test.csv)
	type: csv
	metrics:
	- name: Validation RMSE (best run, internal AutoGluon validation)
	type: rmse
	value: 39.953761
	- name: Kaggle Public Score (RMSLE, best submission)
	type: rmsle
	value: 0.49145
	---

	# 🚲 Bike Sharing Demand Prediction with AutoGluon (Udacity AWS MLE Nanodegree)

	This model predicts hourly bike rental demand (the target column `count`) from structured historical + weather/time features using AutoGluon’s `TabularPredictor` (AutoML for tabular regression). The workflow is based on the Udacity “Predict Bike Sharing Demand with AutoGluon” project and targets the Kaggle Bike Sharing Demand competition dataset.

	Repository:
	https://github.com/brej-29/udacity-AWS-ml-engineer-nanodegree/tree/main/Bike%20Sharing%20Demand%20with%20AutoGluon

	## Model Details

	- Developed by: brej-29
	- Model type: AutoGluon `TabularPredictor` (tabular regression)
	- Target label: `count`
	- Problem type: regression
	- Core approach: AutoGluon trains and ensembles multiple models (e.g., ExtraTrees, LightGBM, CatBoost, XGBoost) and may create a weighted ensemble for best validation performance.
	- Training environment: Notebook-based workflow (commonly run on AWS SageMaker Studio in the Udacity project setup)

	## Intended Use

	- Educational / portfolio demonstration of:
	- Kaggle-style regression workflow
	- AutoML with AutoGluon
	- Feature engineering from datetime fields
	- Hyperparameter optimization (HPO) experiments
	- Baseline demand forecasting experiments on the Kaggle Bike Sharing dataset

	Out of scope:
	- Production forecasting without monitoring, retraining strategy, and strong input validation
	- High-stakes operational decisioning (e.g., staffing, pricing) without deeper evaluation and error analysis

	## Training Data

	Dataset: Kaggle “Bike Sharing Demand”

	Typical columns include:
	- Features: `datetime`, `season`, `holiday`, `workingday`, `weather`, `temp`, `atemp`, `humidity`, `windspeed`
	- Leakage columns present in train but not in test: `casual`, `registered`
	- Target: `count`

	Note: The Kaggle competition evaluates submissions using RMSLE (root mean squared log error). The project tracks Kaggle submission scores alongside offline validation metrics.

	## Preprocessing and Feature Engineering

	- `datetime` is parsed as a datetime type.
	- Leakage prevention:
	- The notebook sets `ignored_columns = ["casual", "registered"]` because they are not available in the Kaggle test set and would cause leakage if used.
	- Feature engineering experiment:
	- Additional time-derived features were created from `datetime`:
	- `year`, `month`, `day`, `hour`
	- These were used in a follow-up training run to measure impact on performance.
	- AutoGluon also handles datetime features internally (converting datetime into numeric/date parts as needed).

	## Training Procedure

	Base configuration used in the notebook:
	- `TabularPredictor(label="count", problem_type="regression", eval_metric="root_mean_squared_error")`
	- Preset: `best_quality`
	- Time limit: 600 seconds (10 minutes)
	- Bagging: enabled in best-quality preset (notebook run shows bagging with 8 folds in the fit summary)

	Hyperparameter optimization (HPO) run:
	- Search controlled via `hyperparameter_tune_kwargs`:
	- `num_trials = 20`
	- `searcher = "auto"`
	- `scheduler = "local"`
	- Hyperparameters were provided for:
	- GBM (including extra-trees style trials + a larger preset config)
	- XT (ExtraTrees)
	- XGB (XGBoost)

	## Evaluation

	Important note about AutoGluon leaderboard scores:
	- AutoGluon’s leaderboard displays metrics in “higher is better” format.
	- For RMSE, the displayed `score_val` is the negative RMSE (sign-flipped), so you can interpret:
	- Validation RMSE ≈ absolute value of `score_val`

	Offline validation (AutoGluon internal validation; best run from the notebook):
	- Best validation `score_val`: -39.953761 (root_mean_squared_error)
	- Interpreted validation RMSE: 39.953761

	Kaggle public leaderboard (submissions generated from notebook):
	- Initial submission RMSLE: 1.42139
	- With added features submission RMSLE: 1.41560
	- With HPO submission RMSLE: 0.49145

	## How to Use

	Recommendation: Upload the entire AutoGluon model directory produced by training (commonly something like `AutogluonModels/<run_name>/`) to your Hugging Face model repo.

	Example inference pattern:

	import pandas as pd
	from huggingface_hub import snapshot_download
	from autogluon.tabular import TabularPredictor

	repo_id = "YOUR_USERNAME/YOUR_MODEL_REPO"

	# Download the whole repo snapshot (works well for AutoGluon folders)
	local_dir = snapshot_download(repo_id=repo_id)

	# Point this to the directory that contains the AutoGluon predictor artifacts
	predictor = TabularPredictor.load(local_dir)

	# Example input (use correct values and columns)
	X = pd.DataFrame([{
	"datetime": "2012-12-19 17:00:00",
	"season": 4,
	"holiday": 0,
	"workingday": 1,
	"weather": 1,
	"temp": 10.0,
	"atemp": 12.0,
	"humidity": 60,
	"windspeed": 15.0
	}])

	preds = predictor.predict(X)
	print(float(preds.iloc[0]))

	If your trained model expects engineered columns (like `year`, `month`, `day`, `hour`), ensure you create them exactly the same way before calling `predict()`.

	## Input Requirements

	- Input must be a tabular dataframe (pandas DataFrame recommended).
	- Required columns should match the Kaggle test schema used for training:
	- `datetime`, `season`, `holiday`, `workingday`, `weather`, `temp`, `atemp`, `humidity`, `windspeed`
	- Do not include the ignored leakage columns at inference:
	- `casual`, `registered`
	- If using engineered datetime columns in your final training run, ensure consistent feature generation:
	- `year`, `month`, `day`, `hour`
	- Datatypes:
	- numeric columns should be valid numeric types (int/float)
	- missing values should be handled consistently (AutoGluon can handle many missing values, but consistent preprocessing is recommended)

	## Bias, Risks, and Limitations

	- This model is trained on a specific city/time period dataset; performance may degrade when applied to other geographies or changed mobility patterns (distribution shift).
	- Kaggle data can contain seasonal/holiday patterns that may not generalize.
	- RMSLE heavily penalizes under-prediction at higher counts; depending on your application, you may need different objectives/metrics.
	- If `datetime` parsing or feature generation differs from training, predictions may be unreliable.

	## Environmental Impact

	AutoGluon tabular training for this project is typically CPU-friendly and time-bounded (10 minutes in the notebook). Compute footprint is modest compared to deep learning workloads, but best-quality presets can still train multiple models and ensembles.

	## Technical Specifications

	- Framework: AutoGluon Tabular (`TabularPredictor`)
	- Task: Tabular regression
	- Eval metric used in training: root mean squared error (RMSE)
	- Ensembling: weighted ensemble over base learners may be used (AutoGluon best-quality preset)

	## Model Card Authors

	- BrejBala

	## Contact

	For questions/feedback, please open an issue on the GitHub repository:
	https://github.com/brej-29/udacity-AWS-ml-engineer-nanodegree/tree/main/Bike%20Sharing%20Demand%20with%20AutoGluon