YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Assignment #2 – Classification, Regression, Clustering & Evaluation

Video Presentation

https://www.youtube.com/watch?v=ASl7FbvRaSg

Author: Orian Rivlin
Course: Data Science – Assignment #2
Goal: Build a complete ML pipeline – EDA, regression, feature engineering, clustering, improved models, classification, HF models, and insights.

1. Project Overview

This project uses a cleaned subset of the Chicago Crimes dataset (~19,500 rows) to build:

A regression model predicting crime latitude.
An improved model using feature engineering and clustering.
A classification pipeline converting latitude into 3 geographic regions.
Full evaluation, insights, HF deployment, and video walkthrough.

Target variables:

Regression: Latitude
Classification: Region class (0=South, 1=Central, 2=North)

2. Dataset Description

The dataset includes:

Crime type
Location description
Police district, ward, community area
Coordinates (X, Y, Latitude, Longitude)
Time information (Date parsed to Month, DayOfWeek, Hour)
Flags (Domestic, Arrest)

Final shape after cleaning: 19,496 rows × 22 columns

EDA Summary

Key Steps

Removed missing coordinates, duplicates.
Parsed date column into time features.
Removed extreme latitude outliers (IQR-based).

Insights

Latitude by Police District

Domestic vs Non-Domestic Locations

Latitude Stability Over Years

Crime Type Frequency

Latitude by Crime Type

3. Baseline Regression Model

Model: Linear Regression with one‑hot encoded categorical features.

Performance

MAE: ~1.20e‑05
RMSE: ~1.76e‑05
R²: 0.99999996

Latitude is nearly a linear function of the Y coordinate → high accuracy expected.

Plots

4. Feature Engineering

Added:

Temporal Features

Month
DayOfWeek
Hour
IsWeekend
YearsSince2012

Geographic Feature

Distance from city center (DistFromCenter)

Clustering Feature

Applied K-Means (k=6) on scaled spatial & time data → added ClusterID.

5. Extended Regression Model

Includes all engineered features + clustering.

Performance

MAE: 9.5e‑06
RMSE: 1.31e‑05
R²: 0.99999997

An improvement over the baseline.

Plot

6. Model Comparison

Model	MAE	RMSE	R²
Baseline Linear Regression	1.20e‑05	1.76e‑05	0.99999996
Improved Linear Regression	9.53e‑06	1.31e‑05	0.99999997
Random Forest	7.8e‑05	1.21e‑04	0.999998
Gradient Boosting	5.19e‑04	6.84e‑04	0.999938

Winner (Regression):

Improved Linear Regression – simplest, fastest, most accurate.

8. Classification Pipeline

8.1 Creating Classes

Latitude divided into 3 balanced regions using 33% and 66% quantiles.

Classes:

0 – Southern region
1 – Central region
2 – Northern region

Balanced distribution → accuracy is meaningful.

8.2 Models Trained

Logistic Regression
Random Forest Classifier
Gradient Boosting Classifier

8.3 Evaluation

Logistic Regression

Accuracy: ~97%
Errors are between adjacent regions.

Random Forest

Accuracy: 100%

Gradient Boosting

Accuracy: 100%

8.4 Classification Winner

Gradient Boosting Classifier – perfect accuracy, compact, efficient.

A pickle file was exported:
best_classification_model.pkl

9. HuggingFace Repository

HF Repo Link

Contains:

README
Notebook
Regression pickle model
Classification pickle model
materials/

Final Notes & Insights

Latitude is extremely predictable from spatial coordinates.
Feature engineering provided meaningful improvements despite high baseline accuracy.
Clustering added interpretable structure for classification models.
Both regression and classification tasks achieved near-perfect performance.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support