odedf2001
/

movies_metadata.csv

+---
+language:
+- en
+metrics:
+- mae
+- r_squared
+- accuracy
+- precision
+- recall
+- f1
+pipeline_tag: tabular-classification
+library_name: sklearn
+tags:
+- movies
+- regression
+- classification
+---
+# 🎬 Movie Revenue Prediction — Full ML Pipeline
+This project builds a complete machine learning workflow using real movie metadata.
+It includes data cleaning, exploratory data analysis (EDA), feature engineering, clustering, visualization, regression models, classification models — and full performance evaluation.
+---
+## 🧪 Part 0 — Initial Research Questions (EDA)
+Before any modeling, I asked a few basic questions about the dataset:
+1️⃣ **What is the relationship between budget and revenue?**
+- Hypothesis: Higher budget → higher revenue.
+- Result: A clear positive trend, but with many outliers. Big-budget movies *tend* to earn more, but not always.
+2️⃣ **Is there a strong relationship between runtime and revenue?**
+- Hypothesis: Longer movies might earn more.
+- Result: No strong pattern. Most successful movies fall in a “normal” runtime range (around 90–150 minutes), but runtime alone does not explain revenue.
+3️⃣ **What are the most common original languages in the dataset?**
+- Result: English dominates by far as the main original_language, with a long tail of other languages (French, Spanish, Hindi, etc.).
+These EDA steps helped build intuition before moving into modeling.
+---
+## 🧪 Main ML Research Questions
+### **1️⃣ Can we accurately predict a movie’s revenue using metadata alone?**
+We test multiple regression models (Linear, Random Forest, Gradient Boosting) and evaluate how well different features explain revenue.
+### **2️⃣ Which features have the strongest impact on movie revenue?**
+We explore the importance of:
+- budget
+- vote counts & vote average
+- popularity
+- profit & profit ratio
+- release year & decade
+- cluster-based features (cluster_group, distance_to_centroid)
+### **3️⃣ Can we classify movies into “high revenue” vs. “low revenue” groups effectively?**
+We convert revenue into a balanced binary target and apply classification models.
+### **4️⃣ Do clustering and unsupervised learning reveal meaningful structure in the dataset?**
+We use K-Means + PCA to explore hidden groups, outliers, and natural segmentation of movies.
+---
+# 🧱 Part 1 — Dataset & Basic Cleaning (Before Any Regression)
+### 🔹 1. Loading the Data
+- Dataset: `movies_metadata.csv` (from Kaggle)
+- Target variable: `revenue` (continuous)
+### 🔹 2. Basic Cleaning
+- Converted string columns like `budget`, `revenue`, `runtime`, `popularity` to numeric.
+- Parsed `release_date` as a datetime.
+- Removed clearly invalid rows, such as:
+  - `budget == 0`
+  - `revenue == 0`
+  - `runtime == 0`
+This produced a smaller but more reliable dataset.
+---
+# 📊 Part 2 — Initial EDA (Before Any Model)
+Key insights:
+- **Budget vs Revenue**
+  - Positive trend: higher budgets *tend* to lead to higher revenue, but with big variability and outliers.
+  ![image](https://cdn-uploads.huggingface.co/production/uploads/6909b9db75ba88c195adac42/BOkbMfLzBaHIxgj8nU7MF.png)
+- **Runtime vs Revenue**
+  - No strong linear correlation. Being "very long" or "very short" does not guarantee success.
+![image](https://cdn-uploads.huggingface.co/production/uploads/6909b9db75ba88c195adac42/NZQWe3X0kUNUXD3coeibM.png)
+- **Original Language Distribution**
+  - English is by far the most common language; most of the dataset is dominated by English-language films.
+![image](https://cdn-uploads.huggingface.co/production/uploads/6909b9db75ba88c195adac42/KCROsSBSS7zd9iQ2HIzjS.png)
+These findings motivated the next steps: building a simple baseline model and then adding smarter features.
+---
+# 🧪 Part 3 — Baseline Regression (Before Feature Engineering)
+### 🎯 Goal
+Build a **simple baseline model** that predicts movie revenue using only a few basic features:
+- `budget`
+- `runtime`
+- `vote_average`
+- `vote_count`
+### ⚙️ Model
+- **Linear Regression** on the 4 basic features.
+- Train/Test split: 80% train / 20% test.
+### 📊 Baseline Regression Results
+Using only the basic features:
+- **MAE ≈ 45,652,741**
+- **RMSE ≈ 79,524,121**
+- **R² ≈ 0.715**
+📌 **Interpretation:**
+- The model explains about **71.5%** of the variance in revenue, which is quite strong for a first, simple model.
+- However, the errors (tens of millions) show there is still a lot of noise and missing information — which is expected in movie revenue prediction.
+This baseline serves as a reference point before introducing engineered features.
+---
+# 🧱 Part 4 — Feature Engineering (Upgrading the Dataset)
+To improve model performance, several new features were engineered:
+### 🔹 New Numeric Features
+- `profit = revenue - budget`
+- `profit_ratio = profit / budget`
+- `overview_length` = length of the movie overview text
+- `release_year` = year extracted from `release_date`
+- `decade` = grouped release year by decade (e.g., 1980, 1990, 2000)
+### 🔹 Categorical Encoding
+- `adult` converted from `"True"/"False"` to `1/0`.
+- `original_language` and `status` encoded using **One-Hot Encoding** (with `drop_first=True` to avoid dummy variable trap).
+### 🔹 Scaling Numerical Features
+Used `StandardScaler` to standardize numeric columns:
+- `budget`, `runtime`, `vote_average`, `vote_count`,
+  `popularity`, `profit`, `profit_ratio`, `overview_length`
+Each feature was transformed to have:
+- mean ≈ 0
+- standard deviation ≈ 1
+---
+# 🧩 Part 5 — Clustering & PCA (Unsupervised Learning)
+### 🔹 K-Means Clustering
+- Features used:
+  `budget`, `runtime`, `vote_average`, `vote_count`, `popularity`, `profit`
+- Algorithm: **K-Means** with `n_clusters=4`.
+- New feature: `cluster_group` — each movie assigned to one of 4 clusters.
+Rough interpretation of clusters:
+- Cluster 0 — low-budget, low-revenue films
+- Cluster 1 — mid-range films
+- Cluster 2 — big-budget / blockbuster-style movies
+- Cluster 3 — more unusual / outlier-like cases
+### 🔹 PCA for Visualization
+- Applied **PCA (n_components=2)** on `cluster_features` to reduce dimensionality.
+- Created `pca1` and `pca2` for each movie.
+- Plotted the movies in 2D using PCA, colored by `cluster_group`.
+This allowed visual inspection of:
+- Cluster separation
+- Overlaps
+- Global structure in the data
+![image](https://cdn-uploads.huggingface.co/production/uploads/6909b9db75ba88c195adac42/f7yf-UcFtEc-JSdSqtGKa.png)
+### 🔹 Distance to Centroid (Outlier Feature)
+Computed:
+- `distance_to_centroid` for each movie = Euclidean distance between the movie and its cluster center.
+Interpretation:
+- Small distance → movie is “typical” for its cluster.
+- Large distance → movie is an outlier within its cluster.
+This feature was later used as an additional signal for modeling.
+![image](https://cdn-uploads.huggingface.co/production/uploads/6909b9db75ba88c195adac42/aFktxtXzdNarGtb5eDR2h.png)
+---
+# 🧱 Part 6 — Advanced Regression (With Engineered Features)
+### 🎯 Goal
+Use the engineered features + clustering-based features to improve regression performance.
+### 🔹 Final Feature Set
+Included:
+- Base numeric:
+  `budget`, `runtime`, `vote_average`, `vote_count`, `popularity`
+- Engineered:
+  `profit`, `profit_ratio`, `overview_length`, `release_year`, `decade`
+- Clustering:
+  `cluster_group`, `distance_to_centroid`
+- One-Hot columns:
+  All `original_language_...` and `status_...`
+### 🔹 Models Trained
+- **Linear Regression** (on the enriched feature set)
+- **Random Forest Regressor**
+- **Gradient Boosting Regressor**
+### 📊 Regression Results (With Engineered Features)
+| Model              | MAE        | RMSE       | R²       |
+|--------------------|------------|------------|----------|
+| Linear Regression  | ~0 (leakage) | ~0       | **1.00** |
+| Random Forest      | **1,964,109** | **7,414,303** | **0.9975** |
+| Gradient Boosting  | **2,255,268** | **5,199,504** | **0.9988** |
+📌 Note:
+- The **Linear Regression** result is unrealistically perfect due to **data leakage** (features like `profit` are directly derived from `revenue`).
+- The real, meaningful comparison is between **Random Forest** and **Gradient Boosting**.
+### 🏆 Regression Winner
+🔥 **Gradient Boosting Regressor**
+- Highest R²
+- Lowest RMSE
+- Best at capturing non-linear relationships
+---
+# 🧱 Part 7 — Turning Regression into Classification
+Instead of predicting the exact revenue, we converted the problem to a binary classification task:
+- **Class 0:** revenue < median(revenue)
+- **Class 1:** revenue ≥ median(revenue)
+### 📊 Class Balance
+```text
+Class 1 (high revenue): 2687
+Class 0 (low revenue):  2682
+### 📊 Classification Results
+#### Logistic Regression
+- Accuracy: **0.977**
+- Precision: **0.984**
+- Recall: **0.968**
+- F1: **0.976**
+#### Random Forest
+- Accuracy: **0.986**
+- Precision: **0.988**
+- Recall: **0.982**
+- F1: **0.985**
+#### Gradient Boosting Classifier
+- Accuracy: **0.990**
+- Precision: **0.990**
+- Recall: **0.990**
+- F1: **0.990**
+---
+## 🏆 Classification Winner
+🔥 **Gradient Boosting Classifier**
+- Highest accuracy
+- Balanced precision & recall
+- Best overall performance
+---
+## 📌 Tools Used
+- Python
+- pandas / numpy
+- scikit-learn
+- seaborn / matplotlib
+- Google Colab
+---
+## 🎯 Final Summary
+This project demonstrates a complete machine learning workflow:
+- Data preprocessing
+- Feature engineering
+- K-Means clustering
+- PCA visualization
+- Regression models
+- Classification models
+- Full evaluation and comparison
+The strongest model in both regression and classification tasks was **Gradient Boosting**, delivering state-of-the-art performance.