odedf2001
/

movies_metadata.csv

+🎬 Movie Revenue Prediction Project
+📈 Regression → Feature Engineering → Clustering → Classification → Model Deployment
+📦 Overview
+This project predicts movie revenue using both regression and classification models,
+powered by advanced feature engineering, clustering, and smart evaluation techniques.
+It was built as part of a Data Science assignment using the Movies Metadata dataset
+(Kaggle), processed and modeled in Google Colab.
+The final models are exported and published in a HuggingFace repository.
+🗂️ 1. Dataset
+Source: Kaggle’s Movies Metadata dataset
+Rows after cleaning: ~5,300
+Original target: revenue
+Classification target (later): revenue_class (high vs. low revenue)
+🔍 Main features used
+budget
+runtime
+vote_average
+vote_count
+popularity
+release_date → converted into release_year, decade
+overview → transformed into text length feature
+🧹 2. Data Cleaning & Preprocessing
+✔ Converted numeric fields to proper types
+✔ Removed impossible values (zero budget/revenue/runtime)
+✔ Parsed release_date into datetime
+✔ Handled missing values
+✔ Selected only meaningful rows for modeling
+📊 3. Exploratory Data Analysis
+📈 Budget vs Revenue
+Higher budget → generally higher revenue, though with big spread and outliers.
+⏱️ Runtime vs Revenue
+No strong linear trend, but most successful films fall within typical runtime (80–150 mins).
+🌍 Top Original Languages
+English overwhelmingly dominates the dataset.
+Each insight was supported by Matplotlib/Seaborn visualizations.
+🧱 4. Baseline Regression Model
+🎯 Goal
+Predict movie revenue using simple numeric features.
+🧩 Features
+budget, runtime, vote_average, vote_count
+⚙️ Model
+Linear Regression
+📐 Metrics
+MAE, MSE, RMSE, R²
+📝 Insight
+Good as a baseline, but not enough for real predictive power → motivates feature engineering.
+🛠️ 5. Feature Engineering
+Created new features:
+profit = revenue – budget
+profit_ratio = profit / budget
+overview_length (text length)
+release_year, decade
+Encoded categoricals (original_language, status)
+Standardized numeric features using StandardScaler
+Added cluster-based features from K-Means:
+cluster_group
+distance_to_centroid
+This significantly improved model learning capabilities.
+🎯 6. Clustering (K-Means + PCA)
+🤖 Unsupervised Learning
+K-Means with k = 4
+Features: budget, runtime, vote stats, popularity, profit
+🌀 PCA Visualization
+2D scatter plot revealing structured groups:
+Low-budget films
+Mid-tier films
+High-budget blockbusters
+Clusters later used as new predictive features.
+🚀 7. Improved Regression Models
+Trained 3 regression models:
+Linear Regression (improved)
+Random Forest Regressor
+Gradient Boosting Regressor ← 🏆 Winner
+🏆 Winning Model
+Gradient Boosting Regressor
+Why?
+Best R²
+Lowest MAE & RMSE
+Handles non-linear relationships beautifully
+Exported as:
+winning_model.pkl
+🔄 8. Regression → Classification
+The regression target was reframed into a binary classification problem:
+🎚️ Creating revenue_class
+Median split
+Class 0 → below median
+Class 1 → at or above median
+⚖️ Class Balance
+Perfectly balanced (~50/50).
+🧠 Business Reasoning
+Precision is more important than recall
+False Positives are more dangerous than False Negatives
+Predicting a movie as high-revenue when it won’t be → wastes millions.
+🤖 9. Classification Models
+Trained 3 classifiers:
+Logistic Regression
+Random Forest Classifier
+Gradient Boosting Classifier ← 🏆 Winner
+🧪 Metrics Evaluated:
+Accuracy
+Precision
+Recall
+F1-score
+Classification report
+Confusion matrix
+🏆 Winning Model: Gradient Boosting Classifier
+Highest precision (0.990)
+Highest F1-score (0.990)
+Lowest rate of harmful errors
+Exported as:
+winning_classifier.pkl