File size: 7,932 Bytes

46e7f0b

---
language: en
pipeline_tag: tabular-regression
library_name: autogluon
tags:
  - autogluon
  - tabular-regression
  - regression
  - automl
  - aws-sagemaker
  - udacity
  - kaggle
  - bike-sharing-demand
  - time-series
  - feature-engineering
metrics:
  - rmse
  - rmsle
model-index:
  - name: Bike Sharing Demand Prediction (AutoGluon TabularPredictor)
    results:
      - task:
          type: tabular-regression
          name: Tabular Regression
        dataset:
          name: Kaggle Bike Sharing Demand (train.csv / test.csv)
          type: csv
        metrics:
          - name: Validation RMSE (best run, internal AutoGluon validation)
            type: rmse
            value: 39.953761
          - name: Kaggle Public Score (RMSLE, best submission)
            type: rmsle
            value: 0.49145
---

# 🚲 Bike Sharing Demand Prediction with AutoGluon (Udacity AWS MLE Nanodegree)

This model predicts hourly bike rental demand (the target column `count`) from structured historical + weather/time features using AutoGluon’s `TabularPredictor` (AutoML for tabular regression). The workflow is based on the Udacity “Predict Bike Sharing Demand with AutoGluon” project and targets the Kaggle Bike Sharing Demand competition dataset.

Repository:
https://github.com/brej-29/udacity-AWS-ml-engineer-nanodegree/tree/main/Bike%20Sharing%20Demand%20with%20AutoGluon

## Model Details

- Developed by: brej-29
- Model type: AutoGluon `TabularPredictor` (tabular regression)
- Target label: `count`
- Problem type: regression
- Core approach: AutoGluon trains and ensembles multiple models (e.g., ExtraTrees, LightGBM, CatBoost, XGBoost) and may create a weighted ensemble for best validation performance.
- Training environment: Notebook-based workflow (commonly run on AWS SageMaker Studio in the Udacity project setup)

## Intended Use

- Educational / portfolio demonstration of:
  - Kaggle-style regression workflow
  - AutoML with AutoGluon
  - Feature engineering from datetime fields
  - Hyperparameter optimization (HPO) experiments
- Baseline demand forecasting experiments on the Kaggle Bike Sharing dataset

Out of scope:
- Production forecasting without monitoring, retraining strategy, and strong input validation
- High-stakes operational decisioning (e.g., staffing, pricing) without deeper evaluation and error analysis

## Training Data

Dataset: Kaggle “Bike Sharing Demand”

Typical columns include:
- Features: `datetime`, `season`, `holiday`, `workingday`, `weather`, `temp`, `atemp`, `humidity`, `windspeed`
- Leakage columns present in train but not in test: `casual`, `registered`
- Target: `count`

Note: The Kaggle competition evaluates submissions using RMSLE (root mean squared log error). The project tracks Kaggle submission scores alongside offline validation metrics.

## Preprocessing and Feature Engineering

- `datetime` is parsed as a datetime type.
- Leakage prevention:
  - The notebook sets `ignored_columns = ["casual", "registered"]` because they are not available in the Kaggle test set and would cause leakage if used.
- Feature engineering experiment:
  - Additional time-derived features were created from `datetime`:
    - `year`, `month`, `day`, `hour`
  - These were used in a follow-up training run to measure impact on performance.
- AutoGluon also handles datetime features internally (converting datetime into numeric/date parts as needed).

## Training Procedure

Base configuration used in the notebook:
- `TabularPredictor(label="count", problem_type="regression", eval_metric="root_mean_squared_error")`
- Preset: `best_quality`
- Time limit: 600 seconds (10 minutes)
- Bagging: enabled in best-quality preset (notebook run shows bagging with 8 folds in the fit summary)

Hyperparameter optimization (HPO) run:
- Search controlled via `hyperparameter_tune_kwargs`:
  - `num_trials = 20`
  - `searcher = "auto"`
  - `scheduler = "local"`
- Hyperparameters were provided for:
  - GBM (including extra-trees style trials + a larger preset config)
  - XT (ExtraTrees)
  - XGB (XGBoost)

## Evaluation

Important note about AutoGluon leaderboard scores:
- AutoGluon’s leaderboard displays metrics in “higher is better” format.
- For RMSE, the displayed `score_val` is the negative RMSE (sign-flipped), so you can interpret:
  - Validation RMSE ≈ absolute value of `score_val`

Offline validation (AutoGluon internal validation; best run from the notebook):
- Best validation `score_val`: -39.953761 (root_mean_squared_error)
- Interpreted validation RMSE: 39.953761

Kaggle public leaderboard (submissions generated from notebook):
- Initial submission RMSLE: 1.42139
- With added features submission RMSLE: 1.41560
- With HPO submission RMSLE: 0.49145

## How to Use

Recommendation: Upload the entire AutoGluon model directory produced by training (commonly something like `AutogluonModels/<run_name>/`) to your Hugging Face model repo.

Example inference pattern:

    import pandas as pd
    from huggingface_hub import snapshot_download
    from autogluon.tabular import TabularPredictor

    repo_id = "YOUR_USERNAME/YOUR_MODEL_REPO"

    # Download the whole repo snapshot (works well for AutoGluon folders)
    local_dir = snapshot_download(repo_id=repo_id)

    # Point this to the directory that contains the AutoGluon predictor artifacts
    predictor = TabularPredictor.load(local_dir)

    # Example input (use correct values and columns)
    X = pd.DataFrame([{
        "datetime": "2012-12-19 17:00:00",
        "season": 4,
        "holiday": 0,
        "workingday": 1,
        "weather": 1,
        "temp": 10.0,
        "atemp": 12.0,
        "humidity": 60,
        "windspeed": 15.0
    }])

    preds = predictor.predict(X)
    print(float(preds.iloc[0]))

If your trained model expects engineered columns (like `year`, `month`, `day`, `hour`), ensure you create them exactly the same way before calling `predict()`.

## Input Requirements

- Input must be a tabular dataframe (pandas DataFrame recommended).
- Required columns should match the Kaggle test schema used for training:
  - `datetime`, `season`, `holiday`, `workingday`, `weather`, `temp`, `atemp`, `humidity`, `windspeed`
- Do not include the ignored leakage columns at inference:
  - `casual`, `registered`
- If using engineered datetime columns in your final training run, ensure consistent feature generation:
  - `year`, `month`, `day`, `hour`
- Datatypes:
  - numeric columns should be valid numeric types (int/float)
  - missing values should be handled consistently (AutoGluon can handle many missing values, but consistent preprocessing is recommended)

## Bias, Risks, and Limitations

- This model is trained on a specific city/time period dataset; performance may degrade when applied to other geographies or changed mobility patterns (distribution shift).
- Kaggle data can contain seasonal/holiday patterns that may not generalize.
- RMSLE heavily penalizes under-prediction at higher counts; depending on your application, you may need different objectives/metrics.
- If `datetime` parsing or feature generation differs from training, predictions may be unreliable.

## Environmental Impact

AutoGluon tabular training for this project is typically CPU-friendly and time-bounded (10 minutes in the notebook). Compute footprint is modest compared to deep learning workloads, but best-quality presets can still train multiple models and ensembles.

## Technical Specifications

- Framework: AutoGluon Tabular (`TabularPredictor`)
- Task: Tabular regression
- Eval metric used in training: root mean squared error (RMSE)
- Ensembling: weighted ensemble over base learners may be used (AutoGluon best-quality preset)

## Model Card Authors

- BrejBala

## Contact

For questions/feedback, please open an issue on the GitHub repository:
https://github.com/brej-29/udacity-AWS-ml-engineer-nanodegree/tree/main/Bike%20Sharing%20Demand%20with%20AutoGluon