| | --- |
| | language: |
| | - en |
| | metrics: |
| | - mae |
| | - r_squared |
| | - accuracy |
| | - precision |
| | - recall |
| | - f1 |
| | pipeline_tag: tabular-classification |
| | library_name: sklearn |
| | tags: |
| | - movies |
| | - regression |
| | - classification |
| | --- |
| | # 🎬 Movie Revenue Prediction — Full ML Pipeline |
| |
|
| | This project builds a complete machine learning workflow using real movie metadata. |
| | It includes data cleaning, exploratory data analysis (EDA), feature engineering, clustering, visualization, regression models, classification models — and full performance evaluation. |
| |
|
| | --- |
| |
|
| | ## 🧪 Part 0 — Initial Research Questions (EDA) |
| |
|
| | Before any modeling, I asked a few basic questions about the dataset: |
| |
|
| | 1️⃣ **What is the relationship between budget and revenue?** |
| | - Hypothesis: Higher budget → higher revenue. |
| | - Result: A clear positive trend, but with many outliers. Big-budget movies *tend* to earn more, but not always. |
| |
|
| | 2️⃣ **Is there a strong relationship between runtime and revenue?** |
| | - Hypothesis: Longer movies might earn more. |
| | - Result: No strong pattern. Most successful movies fall in a “normal” runtime range (around 90–150 minutes), but runtime alone does not explain revenue. |
| |
|
| | 3️⃣ **What are the most common original languages in the dataset?** |
| | - Result: English dominates by far as the main original_language, with a long tail of other languages (French, Spanish, Hindi, etc.). |
| | |
| | These EDA steps helped build intuition before moving into modeling. |
| | |
| | --- |
| | |
| | ## 🧪 Main ML Research Questions |
| | |
| | ### **1️⃣ Can we accurately predict a movie’s revenue using metadata alone?** |
| | We test multiple regression models (Linear, Random Forest, Gradient Boosting) and evaluate how well different features explain revenue. |
| | |
| | ### **2️⃣ Which features have the strongest impact on movie revenue?** |
| | We explore the importance of: |
| | - budget |
| | - vote counts & vote average |
| | - popularity |
| | - profit & profit ratio |
| | - release year & decade |
| | - cluster-based features (cluster_group, distance_to_centroid) |
| |
|
| | ### **3️⃣ Can we classify movies into “high revenue” vs. “low revenue” groups effectively?** |
| | We convert revenue into a balanced binary target and apply classification models. |
| |
|
| | ### **4️⃣ Do clustering and unsupervised learning reveal meaningful structure in the dataset?** |
| | We use K-Means + PCA to explore hidden groups, outliers, and natural segmentation of movies. |
| |
|
| | --- |
| |
|
| | # 🧱 Part 1 — Dataset & Basic Cleaning (Before Any Regression) |
| |
|
| | ### 🔹 1. Loading the Data |
| |
|
| | - Dataset: `movies_metadata.csv` (from Kaggle) |
| | - Target variable: `revenue` (continuous) |
| |
|
| | ### 🔹 2. Basic Cleaning |
| |
|
| | - Converted string columns like `budget`, `revenue`, `runtime`, `popularity` to numeric. |
| | - Parsed `release_date` as a datetime. |
| | - Removed clearly invalid rows, such as: |
| | - `budget == 0` |
| | - `revenue == 0` |
| | - `runtime == 0` |
| |
|
| | This produced a smaller but more reliable dataset. |
| |
|
| | --- |
| |
|
| | # 📊 Part 2 — Initial EDA (Before Any Model) |
| |
|
| | Key insights: |
| |
|
| | - **Budget vs Revenue** |
| | - Positive trend: higher budgets *tend* to lead to higher revenue, but with big variability and outliers. |
| |  |
| |
|
| | - **Runtime vs Revenue** |
| | - No strong linear correlation. Being "very long" or "very short" does not guarantee success. |
| |  |
| |
|
| | - **Original Language Distribution** |
| | - English is by far the most common language; most of the dataset is dominated by English-language films. |
| |  |
| |
|
| | These findings motivated the next steps: building a simple baseline model and then adding smarter features. |
| |
|
| | --- |
| |
|
| | # 🧪 Part 3 — Baseline Regression (Before Feature Engineering) |
| |
|
| | ### 🎯 Goal |
| | Build a **simple baseline model** that predicts movie revenue using only a few basic features: |
| |
|
| | - `budget` |
| | - `runtime` |
| | - `vote_average` |
| | - `vote_count` |
| |
|
| | ### ⚙️ Model |
| |
|
| | - **Linear Regression** on the 4 basic features. |
| | - Train/Test split: 80% train / 20% test. |
| |
|
| | ### 📊 Baseline Regression Results |
| |
|
| | Using only the basic features: |
| |
|
| | - **MAE ≈ 45,652,741** |
| | - **RMSE ≈ 79,524,121** |
| | - **R² ≈ 0.715** |
| |
|
| | 📌 **Interpretation:** |
| | - The model explains about **71.5%** of the variance in revenue, which is quite strong for a first, simple model. |
| | - However, the errors (tens of millions) show there is still a lot of noise and missing information — which is expected in movie revenue prediction. |
| |
|
| | This baseline serves as a reference point before introducing engineered features. |
| |
|
| | --- |
| |
|
| | # 🧱 Part 4 — Feature Engineering (Upgrading the Dataset) |
| |
|
| | To improve model performance, several new features were engineered: |
| |
|
| | ### 🔹 New Numeric Features |
| |
|
| | - `profit = revenue - budget` |
| | - `profit_ratio = profit / budget` |
| | - `overview_length` = length of the movie overview text |
| | - `release_year` = year extracted from `release_date` |
| | - `decade` = grouped release year by decade (e.g., 1980, 1990, 2000) |
| |
|
| | ### 🔹 Categorical Encoding |
| |
|
| | - `adult` converted from `"True"/"False"` to `1/0`. |
| | - `original_language` and `status` encoded using **One-Hot Encoding** (with `drop_first=True` to avoid dummy variable trap). |
| |
|
| | ### 🔹 Scaling Numerical Features |
| |
|
| | Used `StandardScaler` to standardize numeric columns: |
| | - `budget`, `runtime`, `vote_average`, `vote_count`, |
| | `popularity`, `profit`, `profit_ratio`, `overview_length` |
| |
|
| | Each feature was transformed to have: |
| | - mean ≈ 0 |
| | - standard deviation ≈ 1 |
| |
|
| | --- |
| |
|
| | # 🧩 Part 5 — Clustering & PCA (Unsupervised Learning) |
| |
|
| | ### 🔹 K-Means Clustering |
| |
|
| | - Features used: |
| | `budget`, `runtime`, `vote_average`, `vote_count`, `popularity`, `profit` |
| | - Algorithm: **K-Means** with `n_clusters=4`. |
| | - New feature: `cluster_group` — each movie assigned to one of 4 clusters. |
| |
|
| | Rough interpretation of clusters: |
| | - Cluster 0 — low-budget, low-revenue films |
| | - Cluster 1 — mid-range films |
| | - Cluster 2 — big-budget / blockbuster-style movies |
| | - Cluster 3 — more unusual / outlier-like cases |
| |
|
| | ### 🔹 PCA for Visualization |
| |
|
| | - Applied **PCA (n_components=2)** on `cluster_features` to reduce dimensionality. |
| | - Created `pca1` and `pca2` for each movie. |
| | - Plotted the movies in 2D using PCA, colored by `cluster_group`. |
| | |
| | This allowed visual inspection of: |
| | - Cluster separation |
| | - Overlaps |
| | - Global structure in the data |
| |  |
| | |
| | ### 🔹 Distance to Centroid (Outlier Feature) |
| | |
| | Computed: |
| | - `distance_to_centroid` for each movie = Euclidean distance between the movie and its cluster center. |
| | |
| | Interpretation: |
| | - Small distance → movie is “typical” for its cluster. |
| | - Large distance → movie is an outlier within its cluster. |
| | |
| | This feature was later used as an additional signal for modeling. |
| | |
| |  |
| | --- |
| | |
| | # 🧱 Part 6 — Advanced Regression (With Engineered Features) |
| | |
| | ### 🎯 Goal |
| | Use the engineered features + clustering-based features to improve regression performance. |
| | |
| | ### 🔹 Final Feature Set |
| | |
| | Included: |
| | |
| | - Base numeric: |
| | `budget`, `runtime`, `vote_average`, `vote_count`, `popularity` |
| | - Engineered: |
| | `profit`, `profit_ratio`, `overview_length`, `release_year`, `decade` |
| | - Clustering: |
| | `cluster_group`, `distance_to_centroid` |
| | - One-Hot columns: |
| | All `original_language_...` and `status_...` |
| | |
| | ### 🔹 Models Trained |
| | |
| | - **Linear Regression** (on the enriched feature set) |
| | - **Random Forest Regressor** |
| | - **Gradient Boosting Regressor** |
| |
|
| | ### 📊 Regression Results (With Engineered Features) |
| |
|
| | | Model | MAE | RMSE | R² | |
| | |--------------------|------------|------------|----------| |
| | | Linear Regression | ~0 (leakage) | ~0 | **1.00** | |
| | | Random Forest | **1,964,109** | **7,414,303** | **0.9975** | |
| | | Gradient Boosting | **2,255,268** | **5,199,504** | **0.9988** | |
| |
|
| | 📌 Note: |
| | - The **Linear Regression** result is unrealistically perfect due to **data leakage** (features like `profit` are directly derived from `revenue`). |
| | - The real, meaningful comparison is between **Random Forest** and **Gradient Boosting**. |
| |
|
| | ### 🏆 Regression Winner |
| |
|
| | 🔥 **Gradient Boosting Regressor** |
| | - Highest R² |
| | - Lowest RMSE |
| | - Best at capturing non-linear relationships |
| |
|
| | --- |
| |
|
| | # 🧱 Part 7 — Turning Regression into Classification |
| |
|
| | Instead of predicting the exact revenue, we converted the problem to a binary classification task: |
| |
|
| | - **Class 0:** revenue < median(revenue) |
| | - **Class 1:** revenue ≥ median(revenue) |
| |
|
| | ### 📊 Class Balance |
| |
|
| | ```text |
| | Class 1 (high revenue): 2687 |
| | Class 0 (low revenue): 2682 |
| | |
| | |
| | ### 📊 Classification Results |
| | |
| | #### Logistic Regression |
| | - Accuracy: **0.977** |
| | - Precision: **0.984** |
| | - Recall: **0.968** |
| | - F1: **0.976** |
| | |
| | #### Random Forest |
| | - Accuracy: **0.986** |
| | - Precision: **0.988** |
| | - Recall: **0.982** |
| | - F1: **0.985** |
| | |
| | #### Gradient Boosting Classifier |
| | - Accuracy: **0.990** |
| | - Precision: **0.990** |
| | - Recall: **0.990** |
| | - F1: **0.990** |
| | |
| | --- |
| | |
| | ## 🏆 Classification Winner |
| | 🔥 **Gradient Boosting Classifier** |
| | - Highest accuracy |
| | - Balanced precision & recall |
| | - Best overall performance |
| | |
| | --- |
| | |
| | ## 📌 Tools Used |
| | - Python |
| | - pandas / numpy |
| | - scikit-learn |
| | - seaborn / matplotlib |
| | - Google Colab |
| | |
| | --- |
| | |
| | ## 🎯 Final Summary |
| | This project demonstrates a complete machine learning workflow: |
| | - Data preprocessing |
| | - Feature engineering |
| | - K-Means clustering |
| | - PCA visualization |
| | - Regression models |
| | - Classification models |
| | - Full evaluation and comparison |
| | |
| | The strongest model in both regression and classification tasks was **Gradient Boosting**, delivering state-of-the-art performance. |
| | |
| | --- |
| | ``` |
| |
|
| | 🎥 Watch the full project here: |
| |
|
| | https://www.loom.com/share/303dfe317514455db992438357cf8cb4 |
| |
|
| |
|