odedf2001 commited on
Commit
5286068
Β·
verified Β·
1 Parent(s): fd6d2db

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +207 -0
README.md ADDED
@@ -0,0 +1,207 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 🎬 Movie Revenue Prediction Project
2
+ πŸ“ˆ Regression β†’ Feature Engineering β†’ Clustering β†’ Classification β†’ Model Deployment
3
+ πŸ“¦ Overview
4
+
5
+ This project predicts movie revenue using both regression and classification models,
6
+ powered by advanced feature engineering, clustering, and smart evaluation techniques.
7
+
8
+ It was built as part of a Data Science assignment using the Movies Metadata dataset
9
+ (Kaggle), processed and modeled in Google Colab.
10
+
11
+ The final models are exported and published in a HuggingFace repository.
12
+
13
+ πŸ—‚οΈ 1. Dataset
14
+
15
+ Source: Kaggle’s Movies Metadata dataset
16
+
17
+ Rows after cleaning: ~5,300
18
+
19
+ Original target: revenue
20
+
21
+ Classification target (later): revenue_class (high vs. low revenue)
22
+
23
+ πŸ” Main features used
24
+
25
+ budget
26
+
27
+ runtime
28
+
29
+ vote_average
30
+
31
+ vote_count
32
+
33
+ popularity
34
+
35
+ release_date β†’ converted into release_year, decade
36
+
37
+ overview β†’ transformed into text length feature
38
+
39
+ 🧹 2. Data Cleaning & Preprocessing
40
+
41
+ βœ” Converted numeric fields to proper types
42
+ βœ” Removed impossible values (zero budget/revenue/runtime)
43
+ βœ” Parsed release_date into datetime
44
+ βœ” Handled missing values
45
+ βœ” Selected only meaningful rows for modeling
46
+
47
+ πŸ“Š 3. Exploratory Data Analysis
48
+ πŸ“ˆ Budget vs Revenue
49
+
50
+ Higher budget β†’ generally higher revenue, though with big spread and outliers.
51
+
52
+ ⏱️ Runtime vs Revenue
53
+
54
+ No strong linear trend, but most successful films fall within typical runtime (80–150 mins).
55
+
56
+ 🌍 Top Original Languages
57
+
58
+ English overwhelmingly dominates the dataset.
59
+
60
+ Each insight was supported by Matplotlib/Seaborn visualizations.
61
+
62
+ 🧱 4. Baseline Regression Model
63
+ 🎯 Goal
64
+
65
+ Predict movie revenue using simple numeric features.
66
+
67
+ 🧩 Features
68
+
69
+ budget, runtime, vote_average, vote_count
70
+
71
+ βš™οΈ Model
72
+
73
+ Linear Regression
74
+
75
+ πŸ“ Metrics
76
+
77
+ MAE, MSE, RMSE, RΒ²
78
+
79
+ πŸ“ Insight
80
+
81
+ Good as a baseline, but not enough for real predictive power β†’ motivates feature engineering.
82
+
83
+ πŸ› οΈ 5. Feature Engineering
84
+
85
+ Created new features:
86
+
87
+ profit = revenue – budget
88
+
89
+ profit_ratio = profit / budget
90
+
91
+ overview_length (text length)
92
+
93
+ release_year, decade
94
+
95
+ Encoded categoricals (original_language, status)
96
+
97
+ Standardized numeric features using StandardScaler
98
+
99
+ Added cluster-based features from K-Means:
100
+
101
+ cluster_group
102
+
103
+ distance_to_centroid
104
+
105
+ This significantly improved model learning capabilities.
106
+
107
+ 🎯 6. Clustering (K-Means + PCA)
108
+ πŸ€– Unsupervised Learning
109
+
110
+ K-Means with k = 4
111
+
112
+ Features: budget, runtime, vote stats, popularity, profit
113
+
114
+ πŸŒ€ PCA Visualization
115
+
116
+ 2D scatter plot revealing structured groups:
117
+
118
+ Low-budget films
119
+
120
+ Mid-tier films
121
+
122
+ High-budget blockbusters
123
+
124
+ Clusters later used as new predictive features.
125
+
126
+ πŸš€ 7. Improved Regression Models
127
+
128
+ Trained 3 regression models:
129
+
130
+ Linear Regression (improved)
131
+
132
+ Random Forest Regressor
133
+
134
+ Gradient Boosting Regressor ← πŸ† Winner
135
+
136
+ πŸ† Winning Model
137
+
138
+ Gradient Boosting Regressor
139
+
140
+ Why?
141
+
142
+ Best RΒ²
143
+
144
+ Lowest MAE & RMSE
145
+
146
+ Handles non-linear relationships beautifully
147
+
148
+ Exported as:
149
+ winning_model.pkl
150
+
151
+ πŸ”„ 8. Regression β†’ Classification
152
+
153
+ The regression target was reframed into a binary classification problem:
154
+
155
+ 🎚️ Creating revenue_class
156
+
157
+ Median split
158
+
159
+ Class 0 β†’ below median
160
+
161
+ Class 1 β†’ at or above median
162
+
163
+ βš–οΈ Class Balance
164
+
165
+ Perfectly balanced (~50/50).
166
+
167
+ 🧠 Business Reasoning
168
+
169
+ Precision is more important than recall
170
+
171
+ False Positives are more dangerous than False Negatives
172
+ Predicting a movie as high-revenue when it won’t be β†’ wastes millions.
173
+
174
+ πŸ€– 9. Classification Models
175
+
176
+ Trained 3 classifiers:
177
+
178
+ Logistic Regression
179
+
180
+ Random Forest Classifier
181
+
182
+ Gradient Boosting Classifier ← πŸ† Winner
183
+
184
+ πŸ§ͺ Metrics Evaluated:
185
+
186
+ Accuracy
187
+
188
+ Precision
189
+
190
+ Recall
191
+
192
+ F1-score
193
+
194
+ Classification report
195
+
196
+ Confusion matrix
197
+
198
+ πŸ† Winning Model: Gradient Boosting Classifier
199
+
200
+ Highest precision (0.990)
201
+
202
+ Highest F1-score (0.990)
203
+
204
+ Lowest rate of harmful errors
205
+
206
+ Exported as:
207
+ winning_classifier.pkl