odedf2001

Update README.md

2c86b0f verified 3 months ago

9.98 kB

	---
	language:
	- en
	metrics:
	- mae
	- r_squared
	- accuracy
	- precision
	- recall
	- f1
	pipeline_tag: tabular-classification
	library_name: sklearn
	tags:
	- movies
	- regression
	- classification
	---
	# 🎬 Movie Revenue Prediction — Full ML Pipeline

	This project builds a complete machine learning workflow using real movie metadata.
	It includes data cleaning, exploratory data analysis (EDA), feature engineering, clustering, visualization, regression models, classification models — and full performance evaluation.

	---

	## 🧪 Part 0 — Initial Research Questions (EDA)

	Before any modeling, I asked a few basic questions about the dataset:

	1️⃣ What is the relationship between budget and revenue?
	- Hypothesis: Higher budget → higher revenue.
	- Result: A clear positive trend, but with many outliers. Big-budget movies tend to earn more, but not always.

	2️⃣ Is there a strong relationship between runtime and revenue?
	- Hypothesis: Longer movies might earn more.
	- Result: No strong pattern. Most successful movies fall in a “normal” runtime range (around 90–150 minutes), but runtime alone does not explain revenue.

	3️⃣ What are the most common original languages in the dataset?
	- Result: English dominates by far as the main original_language, with a long tail of other languages (French, Spanish, Hindi, etc.).

	These EDA steps helped build intuition before moving into modeling.

	---

	## 🧪 Main ML Research Questions

	### 1️⃣ Can we accurately predict a movie’s revenue using metadata alone?
	We test multiple regression models (Linear, Random Forest, Gradient Boosting) and evaluate how well different features explain revenue.

	### 2️⃣ Which features have the strongest impact on movie revenue?
	We explore the importance of:
	- budget
	- vote counts & vote average
	- popularity
	- profit & profit ratio
	- release year & decade
	- cluster-based features (cluster_group, distance_to_centroid)

	### 3️⃣ Can we classify movies into “high revenue” vs. “low revenue” groups effectively?
	We convert revenue into a balanced binary target and apply classification models.

	### 4️⃣ Do clustering and unsupervised learning reveal meaningful structure in the dataset?
	We use K-Means + PCA to explore hidden groups, outliers, and natural segmentation of movies.

	---

	# 🧱 Part 1 — Dataset & Basic Cleaning (Before Any Regression)

	### 🔹 1. Loading the Data

	- Dataset: `movies_metadata.csv` (from Kaggle)
	- Target variable: `revenue` (continuous)

	### 🔹 2. Basic Cleaning

	- Converted string columns like `budget`, `revenue`, `runtime`, `popularity` to numeric.
	- Parsed `release_date` as a datetime.
	- Removed clearly invalid rows, such as:
	- `budget == 0`
	- `revenue == 0`
	- `runtime == 0`

	This produced a smaller but more reliable dataset.

	---

	# 📊 Part 2 — Initial EDA (Before Any Model)

	Key insights:

	- Budget vs Revenue
	- Positive trend: higher budgets tend to lead to higher revenue, but with big variability and outliers.
	![image](https://cdn-uploads.huggingface.co/production/uploads/6909b9db75ba88c195adac42/BOkbMfLzBaHIxgj8nU7MF.png)

	- Runtime vs Revenue
	- No strong linear correlation. Being "very long" or "very short" does not guarantee success.
	![image](https://cdn-uploads.huggingface.co/production/uploads/6909b9db75ba88c195adac42/NZQWe3X0kUNUXD3coeibM.png)

	- Original Language Distribution
	- English is by far the most common language; most of the dataset is dominated by English-language films.
	![image](https://cdn-uploads.huggingface.co/production/uploads/6909b9db75ba88c195adac42/KCROsSBSS7zd9iQ2HIzjS.png)

	These findings motivated the next steps: building a simple baseline model and then adding smarter features.

	---

	# 🧪 Part 3 — Baseline Regression (Before Feature Engineering)

	### 🎯 Goal
	Build a simple baseline model that predicts movie revenue using only a few basic features:

	- `budget`
	- `runtime`
	- `vote_average`
	- `vote_count`

	### ⚙️ Model

	- Linear Regression on the 4 basic features.
	- Train/Test split: 80% train / 20% test.

	### 📊 Baseline Regression Results

	Using only the basic features:

	- MAE ≈ 45,652,741
	- RMSE ≈ 79,524,121
	- R² ≈ 0.715

	📌 Interpretation:
	- The model explains about 71.5% of the variance in revenue, which is quite strong for a first, simple model.
	- However, the errors (tens of millions) show there is still a lot of noise and missing information — which is expected in movie revenue prediction.

	This baseline serves as a reference point before introducing engineered features.

	---

	# 🧱 Part 4 — Feature Engineering (Upgrading the Dataset)

	To improve model performance, several new features were engineered:

	### 🔹 New Numeric Features

	- `profit = revenue - budget`
	- `profit_ratio = profit / budget`
	- `overview_length` = length of the movie overview text
	- `release_year` = year extracted from `release_date`
	- `decade` = grouped release year by decade (e.g., 1980, 1990, 2000)

	### 🔹 Categorical Encoding

	- `adult` converted from `"True"/"False"` to `1/0`.
	- `original_language` and `status` encoded using One-Hot Encoding (with `drop_first=True` to avoid dummy variable trap).

	### 🔹 Scaling Numerical Features

	Used `StandardScaler` to standardize numeric columns:
	- `budget`, `runtime`, `vote_average`, `vote_count`,
	`popularity`, `profit`, `profit_ratio`, `overview_length`

	Each feature was transformed to have:
	- mean ≈ 0
	- standard deviation ≈ 1

	---

	# 🧩 Part 5 — Clustering & PCA (Unsupervised Learning)

	### 🔹 K-Means Clustering

	- Features used:
	`budget`, `runtime`, `vote_average`, `vote_count`, `popularity`, `profit`
	- Algorithm: K-Means with `n_clusters=4`.
	- New feature: `cluster_group` — each movie assigned to one of 4 clusters.

	Rough interpretation of clusters:
	- Cluster 0 — low-budget, low-revenue films
	- Cluster 1 — mid-range films
	- Cluster 2 — big-budget / blockbuster-style movies
	- Cluster 3 — more unusual / outlier-like cases

	### 🔹 PCA for Visualization

	- Applied PCA (n_components=2) on `cluster_features` to reduce dimensionality.
	- Created `pca1` and `pca2` for each movie.
	- Plotted the movies in 2D using PCA, colored by `cluster_group`.

	This allowed visual inspection of:
	- Cluster separation
	- Overlaps
	- Global structure in the data
	![image](https://cdn-uploads.huggingface.co/production/uploads/6909b9db75ba88c195adac42/f7yf-UcFtEc-JSdSqtGKa.png)

	### 🔹 Distance to Centroid (Outlier Feature)

	Computed:
	- `distance_to_centroid` for each movie = Euclidean distance between the movie and its cluster center.

	Interpretation:
	- Small distance → movie is “typical” for its cluster.
	- Large distance → movie is an outlier within its cluster.

	This feature was later used as an additional signal for modeling.

	![image](https://cdn-uploads.huggingface.co/production/uploads/6909b9db75ba88c195adac42/aFktxtXzdNarGtb5eDR2h.png)
	---

	# 🧱 Part 6 — Advanced Regression (With Engineered Features)

	### 🎯 Goal
	Use the engineered features + clustering-based features to improve regression performance.

	### 🔹 Final Feature Set

	Included:

	- Base numeric:
	`budget`, `runtime`, `vote_average`, `vote_count`, `popularity`
	- Engineered:
	`profit`, `profit_ratio`, `overview_length`, `release_year`, `decade`
	- Clustering:
	`cluster_group`, `distance_to_centroid`
	- One-Hot columns:
	All `original_language_...` and `status_...`

	### 🔹 Models Trained

	- Linear Regression (on the enriched feature set)
	- Random Forest Regressor
	- Gradient Boosting Regressor

	### 📊 Regression Results (With Engineered Features)

	\| Model \| MAE \| RMSE \| R² \|
	\|--------------------\|------------\|------------\|----------\|
	\| Linear Regression \| ~0 (leakage) \| ~0 \| 1.00 \|
	\| Random Forest \| 1,964,109 \| 7,414,303 \| 0.9975 \|
	\| Gradient Boosting \| 2,255,268 \| 5,199,504 \| 0.9988 \|

	📌 Note:
	- The Linear Regression result is unrealistically perfect due to data leakage (features like `profit` are directly derived from `revenue`).
	- The real, meaningful comparison is between Random Forest and Gradient Boosting.

	### 🏆 Regression Winner

	🔥 Gradient Boosting Regressor
	- Highest R²
	- Lowest RMSE
	- Best at capturing non-linear relationships

	---

	# 🧱 Part 7 — Turning Regression into Classification

	Instead of predicting the exact revenue, we converted the problem to a binary classification task:

	- Class 0: revenue < median(revenue)
	- Class 1: revenue ≥ median(revenue)

	### 📊 Class Balance

	```text
	Class 1 (high revenue): 2687
	Class 0 (low revenue): 2682


	### 📊 Classification Results

	#### Logistic Regression
	- Accuracy: 0.977
	- Precision: 0.984
	- Recall: 0.968
	- F1: 0.976

	#### Random Forest
	- Accuracy: 0.986
	- Precision: 0.988
	- Recall: 0.982
	- F1: 0.985

	#### Gradient Boosting Classifier
	- Accuracy: 0.990
	- Precision: 0.990
	- Recall: 0.990
	- F1: 0.990

	---

	## 🏆 Classification Winner
	🔥 Gradient Boosting Classifier
	- Highest accuracy
	- Balanced precision & recall
	- Best overall performance

	---

	## 📌 Tools Used
	- Python
	- pandas / numpy
	- scikit-learn
	- seaborn / matplotlib
	- Google Colab

	---

	## 🎯 Final Summary
	This project demonstrates a complete machine learning workflow:
	- Data preprocessing
	- Feature engineering
	- K-Means clustering
	- PCA visualization
	- Regression models
	- Classification models
	- Full evaluation and comparison

	The strongest model in both regression and classification tasks was Gradient Boosting, delivering state-of-the-art performance.

	---
	```

	🎥 Watch the full project here:

	https://www.loom.com/share/303dfe317514455db992438357cf8cb4