YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Assignment #2 – Classification, Regression, Clustering & Evaluation
Video Presentation
https://www.youtube.com/watch?v=ASl7FbvRaSg
Author: Orian Rivlin
Course: Data Science – Assignment #2
Goal: Build a complete ML pipeline – EDA, regression, feature engineering, clustering, improved models, classification, HF models, and insights.
1. Project Overview
This project uses a cleaned subset of the Chicago Crimes dataset (~19,500 rows) to build:
- A regression model predicting crime latitude.
- An improved model using feature engineering and clustering.
- A classification pipeline converting latitude into 3 geographic regions.
- Full evaluation, insights, HF deployment, and video walkthrough.
Target variables:
- Regression: Latitude
- Classification: Region class (0=South, 1=Central, 2=North)
2. Dataset Description
The dataset includes:
- Crime type
- Location description
- Police district, ward, community area
- Coordinates (X, Y, Latitude, Longitude)
- Time information (Date parsed to Month, DayOfWeek, Hour)
- Flags (Domestic, Arrest)
Final shape after cleaning: 19,496 rows × 22 columns
EDA Summary
Key Steps
- Removed missing coordinates, duplicates.
- Parsed date column into time features.
- Removed extreme latitude outliers (IQR-based).
Insights
Latitude by Police District
Domestic vs Non-Domestic Locations
Latitude Stability Over Years
Crime Type Frequency
Latitude by Crime Type
3. Baseline Regression Model
Model: Linear Regression with one‑hot encoded categorical features.
Performance
- MAE: ~1.20e‑05
- RMSE: ~1.76e‑05
- R²: 0.99999996
Latitude is nearly a linear function of the Y coordinate → high accuracy expected.
Plots
4. Feature Engineering
Added:
Temporal Features
- Month
- DayOfWeek
- Hour
- IsWeekend
- YearsSince2012
Geographic Feature
- Distance from city center (
DistFromCenter)
Clustering Feature
Applied K-Means (k=6) on scaled spatial & time data → added ClusterID.
5. Extended Regression Model
Includes all engineered features + clustering.
Performance
- MAE: 9.5e‑06
- RMSE: 1.31e‑05
- R²: 0.99999997
An improvement over the baseline.
Plot
6. Model Comparison
| Model | MAE | RMSE | R² |
|---|---|---|---|
| Baseline Linear Regression | 1.20e‑05 | 1.76e‑05 | 0.99999996 |
| Improved Linear Regression | 9.53e‑06 | 1.31e‑05 | 0.99999997 |
| Random Forest | 7.8e‑05 | 1.21e‑04 | 0.999998 |
| Gradient Boosting | 5.19e‑04 | 6.84e‑04 | 0.999938 |
Winner (Regression):
Improved Linear Regression – simplest, fastest, most accurate.
8. Classification Pipeline
8.1 Creating Classes
Latitude divided into 3 balanced regions using 33% and 66% quantiles.
Classes:
- 0 – Southern region
- 1 – Central region
- 2 – Northern region
Balanced distribution → accuracy is meaningful.
8.2 Models Trained
- Logistic Regression
- Random Forest Classifier
- Gradient Boosting Classifier
8.3 Evaluation
Logistic Regression
Accuracy: ~97%
Errors are between adjacent regions.
Random Forest
Accuracy: 100%
Gradient Boosting
Accuracy: 100%
8.4 Classification Winner
Gradient Boosting Classifier – perfect accuracy, compact, efficient.
A pickle file was exported:best_classification_model.pkl
9. HuggingFace Repository
Contains:
- README
- Notebook
- Regression pickle model
- Classification pickle model
- materials/
Final Notes & Insights
- Latitude is extremely predictable from spatial coordinates.
- Feature engineering provided meaningful improvements despite high baseline accuracy.
- Clustering added interpretable structure for classification models.
- Both regression and classification tasks achieved near-perfect performance.













