File size: 7,932 Bytes
46e7f0b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
---
language: en
pipeline_tag: tabular-regression
library_name: autogluon
tags:
  - autogluon
  - tabular-regression
  - regression
  - automl
  - aws-sagemaker
  - udacity
  - kaggle
  - bike-sharing-demand
  - time-series
  - feature-engineering
metrics:
  - rmse
  - rmsle
model-index:
  - name: Bike Sharing Demand Prediction (AutoGluon TabularPredictor)
    results:
      - task:
          type: tabular-regression
          name: Tabular Regression
        dataset:
          name: Kaggle Bike Sharing Demand (train.csv / test.csv)
          type: csv
        metrics:
          - name: Validation RMSE (best run, internal AutoGluon validation)
            type: rmse
            value: 39.953761
          - name: Kaggle Public Score (RMSLE, best submission)
            type: rmsle
            value: 0.49145
---

# 🚲 Bike Sharing Demand Prediction with AutoGluon (Udacity AWS MLE Nanodegree)

This model predicts hourly bike rental demand (the target column `count`) from structured historical + weather/time features using AutoGluon’s `TabularPredictor` (AutoML for tabular regression). The workflow is based on the Udacity “Predict Bike Sharing Demand with AutoGluon” project and targets the Kaggle Bike Sharing Demand competition dataset.

Repository:
https://github.com/brej-29/udacity-AWS-ml-engineer-nanodegree/tree/main/Bike%20Sharing%20Demand%20with%20AutoGluon

## Model Details

- Developed by: brej-29
- Model type: AutoGluon `TabularPredictor` (tabular regression)
- Target label: `count`
- Problem type: regression
- Core approach: AutoGluon trains and ensembles multiple models (e.g., ExtraTrees, LightGBM, CatBoost, XGBoost) and may create a weighted ensemble for best validation performance.
- Training environment: Notebook-based workflow (commonly run on AWS SageMaker Studio in the Udacity project setup)

## Intended Use

- Educational / portfolio demonstration of:
  - Kaggle-style regression workflow
  - AutoML with AutoGluon
  - Feature engineering from datetime fields
  - Hyperparameter optimization (HPO) experiments
- Baseline demand forecasting experiments on the Kaggle Bike Sharing dataset

Out of scope:
- Production forecasting without monitoring, retraining strategy, and strong input validation
- High-stakes operational decisioning (e.g., staffing, pricing) without deeper evaluation and error analysis

## Training Data

Dataset: Kaggle “Bike Sharing Demand”

Typical columns include:
- Features: `datetime`, `season`, `holiday`, `workingday`, `weather`, `temp`, `atemp`, `humidity`, `windspeed`
- Leakage columns present in train but not in test: `casual`, `registered`
- Target: `count`

Note: The Kaggle competition evaluates submissions using RMSLE (root mean squared log error). The project tracks Kaggle submission scores alongside offline validation metrics.

## Preprocessing and Feature Engineering

- `datetime` is parsed as a datetime type.
- Leakage prevention:
  - The notebook sets `ignored_columns = ["casual", "registered"]` because they are not available in the Kaggle test set and would cause leakage if used.
- Feature engineering experiment:
  - Additional time-derived features were created from `datetime`:
    - `year`, `month`, `day`, `hour`
  - These were used in a follow-up training run to measure impact on performance.
- AutoGluon also handles datetime features internally (converting datetime into numeric/date parts as needed).

## Training Procedure

Base configuration used in the notebook:
- `TabularPredictor(label="count", problem_type="regression", eval_metric="root_mean_squared_error")`
- Preset: `best_quality`
- Time limit: 600 seconds (10 minutes)
- Bagging: enabled in best-quality preset (notebook run shows bagging with 8 folds in the fit summary)

Hyperparameter optimization (HPO) run:
- Search controlled via `hyperparameter_tune_kwargs`:
  - `num_trials = 20`
  - `searcher = "auto"`
  - `scheduler = "local"`
- Hyperparameters were provided for:
  - GBM (including extra-trees style trials + a larger preset config)
  - XT (ExtraTrees)
  - XGB (XGBoost)

## Evaluation

Important note about AutoGluon leaderboard scores:
- AutoGluon’s leaderboard displays metrics in “higher is better” format.
- For RMSE, the displayed `score_val` is the negative RMSE (sign-flipped), so you can interpret:
  - Validation RMSE ≈ absolute value of `score_val`

Offline validation (AutoGluon internal validation; best run from the notebook):
- Best validation `score_val`: -39.953761 (root_mean_squared_error)
- Interpreted validation RMSE: 39.953761

Kaggle public leaderboard (submissions generated from notebook):
- Initial submission RMSLE: 1.42139
- With added features submission RMSLE: 1.41560
- With HPO submission RMSLE: 0.49145

## How to Use

Recommendation: Upload the entire AutoGluon model directory produced by training (commonly something like `AutogluonModels/<run_name>/`) to your Hugging Face model repo.

Example inference pattern:

    import pandas as pd
    from huggingface_hub import snapshot_download
    from autogluon.tabular import TabularPredictor

    repo_id = "YOUR_USERNAME/YOUR_MODEL_REPO"

    # Download the whole repo snapshot (works well for AutoGluon folders)
    local_dir = snapshot_download(repo_id=repo_id)

    # Point this to the directory that contains the AutoGluon predictor artifacts
    predictor = TabularPredictor.load(local_dir)

    # Example input (use correct values and columns)
    X = pd.DataFrame([{
        "datetime": "2012-12-19 17:00:00",
        "season": 4,
        "holiday": 0,
        "workingday": 1,
        "weather": 1,
        "temp": 10.0,
        "atemp": 12.0,
        "humidity": 60,
        "windspeed": 15.0
    }])

    preds = predictor.predict(X)
    print(float(preds.iloc[0]))

If your trained model expects engineered columns (like `year`, `month`, `day`, `hour`), ensure you create them exactly the same way before calling `predict()`.

## Input Requirements

- Input must be a tabular dataframe (pandas DataFrame recommended).
- Required columns should match the Kaggle test schema used for training:
  - `datetime`, `season`, `holiday`, `workingday`, `weather`, `temp`, `atemp`, `humidity`, `windspeed`
- Do not include the ignored leakage columns at inference:
  - `casual`, `registered`
- If using engineered datetime columns in your final training run, ensure consistent feature generation:
  - `year`, `month`, `day`, `hour`
- Datatypes:
  - numeric columns should be valid numeric types (int/float)
  - missing values should be handled consistently (AutoGluon can handle many missing values, but consistent preprocessing is recommended)

## Bias, Risks, and Limitations

- This model is trained on a specific city/time period dataset; performance may degrade when applied to other geographies or changed mobility patterns (distribution shift).
- Kaggle data can contain seasonal/holiday patterns that may not generalize.
- RMSLE heavily penalizes under-prediction at higher counts; depending on your application, you may need different objectives/metrics.
- If `datetime` parsing or feature generation differs from training, predictions may be unreliable.

## Environmental Impact

AutoGluon tabular training for this project is typically CPU-friendly and time-bounded (10 minutes in the notebook). Compute footprint is modest compared to deep learning workloads, but best-quality presets can still train multiple models and ensembles.

## Technical Specifications

- Framework: AutoGluon Tabular (`TabularPredictor`)
- Task: Tabular regression
- Eval metric used in training: root mean squared error (RMSE)
- Ensembling: weighted ensemble over base learners may be used (AutoGluon best-quality preset)

## Model Card Authors

- BrejBala

## Contact

For questions/feedback, please open an issue on the GitHub repository:
https://github.com/brej-29/udacity-AWS-ml-engineer-nanodegree/tree/main/Bike%20Sharing%20Demand%20with%20AutoGluon