--- language: - en metrics: - mae - r_squared - accuracy - precision - recall - f1 pipeline_tag: tabular-classification library_name: sklearn tags: - movies - regression - classification --- # 🎬 Movie Revenue Prediction β€” Full ML Pipeline This project builds a complete machine learning workflow using real movie metadata. It includes data cleaning, exploratory data analysis (EDA), feature engineering, clustering, visualization, regression models, classification models β€” and full performance evaluation. --- ## πŸ§ͺ Part 0 β€” Initial Research Questions (EDA) Before any modeling, I asked a few basic questions about the dataset: 1️⃣ **What is the relationship between budget and revenue?** - Hypothesis: Higher budget β†’ higher revenue. - Result: A clear positive trend, but with many outliers. Big-budget movies *tend* to earn more, but not always. 2️⃣ **Is there a strong relationship between runtime and revenue?** - Hypothesis: Longer movies might earn more. - Result: No strong pattern. Most successful movies fall in a β€œnormal” runtime range (around 90–150 minutes), but runtime alone does not explain revenue. 3️⃣ **What are the most common original languages in the dataset?** - Result: English dominates by far as the main original_language, with a long tail of other languages (French, Spanish, Hindi, etc.). These EDA steps helped build intuition before moving into modeling. --- ## πŸ§ͺ Main ML Research Questions ### **1️⃣ Can we accurately predict a movie’s revenue using metadata alone?** We test multiple regression models (Linear, Random Forest, Gradient Boosting) and evaluate how well different features explain revenue. ### **2️⃣ Which features have the strongest impact on movie revenue?** We explore the importance of: - budget - vote counts & vote average - popularity - profit & profit ratio - release year & decade - cluster-based features (cluster_group, distance_to_centroid) ### **3️⃣ Can we classify movies into β€œhigh revenue” vs. β€œlow revenue” groups effectively?** We convert revenue into a balanced binary target and apply classification models. ### **4️⃣ Do clustering and unsupervised learning reveal meaningful structure in the dataset?** We use K-Means + PCA to explore hidden groups, outliers, and natural segmentation of movies. --- # 🧱 Part 1 β€” Dataset & Basic Cleaning (Before Any Regression) ### πŸ”Ή 1. Loading the Data - Dataset: `movies_metadata.csv` (from Kaggle) - Target variable: `revenue` (continuous) ### πŸ”Ή 2. Basic Cleaning - Converted string columns like `budget`, `revenue`, `runtime`, `popularity` to numeric. - Parsed `release_date` as a datetime. - Removed clearly invalid rows, such as: - `budget == 0` - `revenue == 0` - `runtime == 0` This produced a smaller but more reliable dataset. --- # πŸ“Š Part 2 β€” Initial EDA (Before Any Model) Key insights: - **Budget vs Revenue** - Positive trend: higher budgets *tend* to lead to higher revenue, but with big variability and outliers. ![image](https://cdn-uploads.huggingface.co/production/uploads/6909b9db75ba88c195adac42/BOkbMfLzBaHIxgj8nU7MF.png) - **Runtime vs Revenue** - No strong linear correlation. Being "very long" or "very short" does not guarantee success. ![image](https://cdn-uploads.huggingface.co/production/uploads/6909b9db75ba88c195adac42/NZQWe3X0kUNUXD3coeibM.png) - **Original Language Distribution** - English is by far the most common language; most of the dataset is dominated by English-language films. ![image](https://cdn-uploads.huggingface.co/production/uploads/6909b9db75ba88c195adac42/KCROsSBSS7zd9iQ2HIzjS.png) These findings motivated the next steps: building a simple baseline model and then adding smarter features. --- # πŸ§ͺ Part 3 β€” Baseline Regression (Before Feature Engineering) ### 🎯 Goal Build a **simple baseline model** that predicts movie revenue using only a few basic features: - `budget` - `runtime` - `vote_average` - `vote_count` ### βš™οΈ Model - **Linear Regression** on the 4 basic features. - Train/Test split: 80% train / 20% test. ### πŸ“Š Baseline Regression Results Using only the basic features: - **MAE β‰ˆ 45,652,741** - **RMSE β‰ˆ 79,524,121** - **RΒ² β‰ˆ 0.715** πŸ“Œ **Interpretation:** - The model explains about **71.5%** of the variance in revenue, which is quite strong for a first, simple model. - However, the errors (tens of millions) show there is still a lot of noise and missing information β€” which is expected in movie revenue prediction. This baseline serves as a reference point before introducing engineered features. --- # 🧱 Part 4 β€” Feature Engineering (Upgrading the Dataset) To improve model performance, several new features were engineered: ### πŸ”Ή New Numeric Features - `profit = revenue - budget` - `profit_ratio = profit / budget` - `overview_length` = length of the movie overview text - `release_year` = year extracted from `release_date` - `decade` = grouped release year by decade (e.g., 1980, 1990, 2000) ### πŸ”Ή Categorical Encoding - `adult` converted from `"True"/"False"` to `1/0`. - `original_language` and `status` encoded using **One-Hot Encoding** (with `drop_first=True` to avoid dummy variable trap). ### πŸ”Ή Scaling Numerical Features Used `StandardScaler` to standardize numeric columns: - `budget`, `runtime`, `vote_average`, `vote_count`, `popularity`, `profit`, `profit_ratio`, `overview_length` Each feature was transformed to have: - mean β‰ˆ 0 - standard deviation β‰ˆ 1 --- # 🧩 Part 5 β€” Clustering & PCA (Unsupervised Learning) ### πŸ”Ή K-Means Clustering - Features used: `budget`, `runtime`, `vote_average`, `vote_count`, `popularity`, `profit` - Algorithm: **K-Means** with `n_clusters=4`. - New feature: `cluster_group` β€” each movie assigned to one of 4 clusters. Rough interpretation of clusters: - Cluster 0 β€” low-budget, low-revenue films - Cluster 1 β€” mid-range films - Cluster 2 β€” big-budget / blockbuster-style movies - Cluster 3 β€” more unusual / outlier-like cases ### πŸ”Ή PCA for Visualization - Applied **PCA (n_components=2)** on `cluster_features` to reduce dimensionality. - Created `pca1` and `pca2` for each movie. - Plotted the movies in 2D using PCA, colored by `cluster_group`. This allowed visual inspection of: - Cluster separation - Overlaps - Global structure in the data ![image](https://cdn-uploads.huggingface.co/production/uploads/6909b9db75ba88c195adac42/f7yf-UcFtEc-JSdSqtGKa.png) ### πŸ”Ή Distance to Centroid (Outlier Feature) Computed: - `distance_to_centroid` for each movie = Euclidean distance between the movie and its cluster center. Interpretation: - Small distance β†’ movie is β€œtypical” for its cluster. - Large distance β†’ movie is an outlier within its cluster. This feature was later used as an additional signal for modeling. ![image](https://cdn-uploads.huggingface.co/production/uploads/6909b9db75ba88c195adac42/aFktxtXzdNarGtb5eDR2h.png) --- # 🧱 Part 6 β€” Advanced Regression (With Engineered Features) ### 🎯 Goal Use the engineered features + clustering-based features to improve regression performance. ### πŸ”Ή Final Feature Set Included: - Base numeric: `budget`, `runtime`, `vote_average`, `vote_count`, `popularity` - Engineered: `profit`, `profit_ratio`, `overview_length`, `release_year`, `decade` - Clustering: `cluster_group`, `distance_to_centroid` - One-Hot columns: All `original_language_...` and `status_...` ### πŸ”Ή Models Trained - **Linear Regression** (on the enriched feature set) - **Random Forest Regressor** - **Gradient Boosting Regressor** ### πŸ“Š Regression Results (With Engineered Features) | Model | MAE | RMSE | RΒ² | |--------------------|------------|------------|----------| | Linear Regression | ~0 (leakage) | ~0 | **1.00** | | Random Forest | **1,964,109** | **7,414,303** | **0.9975** | | Gradient Boosting | **2,255,268** | **5,199,504** | **0.9988** | πŸ“Œ Note: - The **Linear Regression** result is unrealistically perfect due to **data leakage** (features like `profit` are directly derived from `revenue`). - The real, meaningful comparison is between **Random Forest** and **Gradient Boosting**. ### πŸ† Regression Winner πŸ”₯ **Gradient Boosting Regressor** - Highest RΒ² - Lowest RMSE - Best at capturing non-linear relationships --- # 🧱 Part 7 β€” Turning Regression into Classification Instead of predicting the exact revenue, we converted the problem to a binary classification task: - **Class 0:** revenue < median(revenue) - **Class 1:** revenue β‰₯ median(revenue) ### πŸ“Š Class Balance ```text Class 1 (high revenue): 2687 Class 0 (low revenue): 2682 ### πŸ“Š Classification Results #### Logistic Regression - Accuracy: **0.977** - Precision: **0.984** - Recall: **0.968** - F1: **0.976** #### Random Forest - Accuracy: **0.986** - Precision: **0.988** - Recall: **0.982** - F1: **0.985** #### Gradient Boosting Classifier - Accuracy: **0.990** - Precision: **0.990** - Recall: **0.990** - F1: **0.990** --- ## πŸ† Classification Winner πŸ”₯ **Gradient Boosting Classifier** - Highest accuracy - Balanced precision & recall - Best overall performance --- ## πŸ“Œ Tools Used - Python - pandas / numpy - scikit-learn - seaborn / matplotlib - Google Colab --- ## 🎯 Final Summary This project demonstrates a complete machine learning workflow: - Data preprocessing - Feature engineering - K-Means clustering - PCA visualization - Regression models - Classification models - Full evaluation and comparison The strongest model in both regression and classification tasks was **Gradient Boosting**, delivering state-of-the-art performance. --- ``` πŸŽ₯ Watch the full project here: https://www.loom.com/share/303dfe317514455db992438357cf8cb4