YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Assignment #2 – Classification, Regression, Clustering & Evaluation

Video Presentation

https://www.youtube.com/watch?v=ASl7FbvRaSg

Author: Orian Rivlin
Course: Data Science – Assignment #2
Goal: Build a complete ML pipeline – EDA, regression, feature engineering, clustering, improved models, classification, HF models, and insights.


1. Project Overview

This project uses a cleaned subset of the Chicago Crimes dataset (~19,500 rows) to build:

  • A regression model predicting crime latitude.
  • An improved model using feature engineering and clustering.
  • A classification pipeline converting latitude into 3 geographic regions.
  • Full evaluation, insights, HF deployment, and video walkthrough.

Target variables:

  • Regression: Latitude
  • Classification: Region class (0=South, 1=Central, 2=North)

2. Dataset Description

The dataset includes:

  • Crime type
  • Location description
  • Police district, ward, community area
  • Coordinates (X, Y, Latitude, Longitude)
  • Time information (Date parsed to Month, DayOfWeek, Hour)
  • Flags (Domestic, Arrest)

Final shape after cleaning: 19,496 rows × 22 columns

EDA Summary

Key Steps

  • Removed missing coordinates, duplicates.
  • Parsed date column into time features.
  • Removed extreme latitude outliers (IQR-based).

Insights

Latitude by Police District

Domestic vs Non-Domestic Locations

Latitude Stability Over Years

Crime Type Frequency

Latitude by Crime Type


3. Baseline Regression Model

Model: Linear Regression with one‑hot encoded categorical features.

Performance

  • MAE: ~1.20e‑05
  • RMSE: ~1.76e‑05
  • R²: 0.99999996

Latitude is nearly a linear function of the Y coordinate → high accuracy expected.

Plots


4. Feature Engineering

Added:

Temporal Features

  • Month
  • DayOfWeek
  • Hour
  • IsWeekend
  • YearsSince2012

Geographic Feature

  • Distance from city center (DistFromCenter)

Clustering Feature

Applied K-Means (k=6) on scaled spatial & time data → added ClusterID.


5. Extended Regression Model

Includes all engineered features + clustering.

Performance

  • MAE: 9.5e‑06
  • RMSE: 1.31e‑05
  • R²: 0.99999997

An improvement over the baseline.

Plot


6. Model Comparison

Model MAE RMSE R²
Baseline Linear Regression 1.20e‑05 1.76e‑05 0.99999996
Improved Linear Regression 9.53e‑06 1.31e‑05 0.99999997
Random Forest 7.8e‑05 1.21e‑04 0.999998
Gradient Boosting 5.19e‑04 6.84e‑04 0.999938

Winner (Regression):

Improved Linear Regression – simplest, fastest, most accurate.


8. Classification Pipeline

8.1 Creating Classes

Latitude divided into 3 balanced regions using 33% and 66% quantiles.

Classes:

  • 0 – Southern region
  • 1 – Central region
  • 2 – Northern region

Balanced distribution → accuracy is meaningful.

8.2 Models Trained

  • Logistic Regression
  • Random Forest Classifier
  • Gradient Boosting Classifier

8.3 Evaluation

Logistic Regression

Accuracy: ~97%
Errors are between adjacent regions.

Random Forest

Accuracy: 100%

Gradient Boosting

Accuracy: 100%

8.4 Classification Winner

Gradient Boosting Classifier – perfect accuracy, compact, efficient.

A pickle file was exported:
best_classification_model.pkl


9. HuggingFace Repository

HF Repo Link

Contains:

  • README
  • Notebook
  • Regression pickle model
  • Classification pickle model
  • materials/

Final Notes & Insights

  • Latitude is extremely predictable from spatial coordinates.
  • Feature engineering provided meaningful improvements despite high baseline accuracy.
  • Clustering added interpretable structure for classification models.
  • Both regression and classification tasks achieved near-perfect performance.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support