odedf2001 commited on
Commit
482c824
Β·
verified Β·
1 Parent(s): 5286068

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +0 -207
README.md CHANGED
@@ -1,207 +0,0 @@
1
- 🎬 Movie Revenue Prediction Project
2
- πŸ“ˆ Regression β†’ Feature Engineering β†’ Clustering β†’ Classification β†’ Model Deployment
3
- πŸ“¦ Overview
4
-
5
- This project predicts movie revenue using both regression and classification models,
6
- powered by advanced feature engineering, clustering, and smart evaluation techniques.
7
-
8
- It was built as part of a Data Science assignment using the Movies Metadata dataset
9
- (Kaggle), processed and modeled in Google Colab.
10
-
11
- The final models are exported and published in a HuggingFace repository.
12
-
13
- πŸ—‚οΈ 1. Dataset
14
-
15
- Source: Kaggle’s Movies Metadata dataset
16
-
17
- Rows after cleaning: ~5,300
18
-
19
- Original target: revenue
20
-
21
- Classification target (later): revenue_class (high vs. low revenue)
22
-
23
- πŸ” Main features used
24
-
25
- budget
26
-
27
- runtime
28
-
29
- vote_average
30
-
31
- vote_count
32
-
33
- popularity
34
-
35
- release_date β†’ converted into release_year, decade
36
-
37
- overview β†’ transformed into text length feature
38
-
39
- 🧹 2. Data Cleaning & Preprocessing
40
-
41
- βœ” Converted numeric fields to proper types
42
- βœ” Removed impossible values (zero budget/revenue/runtime)
43
- βœ” Parsed release_date into datetime
44
- βœ” Handled missing values
45
- βœ” Selected only meaningful rows for modeling
46
-
47
- πŸ“Š 3. Exploratory Data Analysis
48
- πŸ“ˆ Budget vs Revenue
49
-
50
- Higher budget β†’ generally higher revenue, though with big spread and outliers.
51
-
52
- ⏱️ Runtime vs Revenue
53
-
54
- No strong linear trend, but most successful films fall within typical runtime (80–150 mins).
55
-
56
- 🌍 Top Original Languages
57
-
58
- English overwhelmingly dominates the dataset.
59
-
60
- Each insight was supported by Matplotlib/Seaborn visualizations.
61
-
62
- 🧱 4. Baseline Regression Model
63
- 🎯 Goal
64
-
65
- Predict movie revenue using simple numeric features.
66
-
67
- 🧩 Features
68
-
69
- budget, runtime, vote_average, vote_count
70
-
71
- βš™οΈ Model
72
-
73
- Linear Regression
74
-
75
- πŸ“ Metrics
76
-
77
- MAE, MSE, RMSE, RΒ²
78
-
79
- πŸ“ Insight
80
-
81
- Good as a baseline, but not enough for real predictive power β†’ motivates feature engineering.
82
-
83
- πŸ› οΈ 5. Feature Engineering
84
-
85
- Created new features:
86
-
87
- profit = revenue – budget
88
-
89
- profit_ratio = profit / budget
90
-
91
- overview_length (text length)
92
-
93
- release_year, decade
94
-
95
- Encoded categoricals (original_language, status)
96
-
97
- Standardized numeric features using StandardScaler
98
-
99
- Added cluster-based features from K-Means:
100
-
101
- cluster_group
102
-
103
- distance_to_centroid
104
-
105
- This significantly improved model learning capabilities.
106
-
107
- 🎯 6. Clustering (K-Means + PCA)
108
- πŸ€– Unsupervised Learning
109
-
110
- K-Means with k = 4
111
-
112
- Features: budget, runtime, vote stats, popularity, profit
113
-
114
- πŸŒ€ PCA Visualization
115
-
116
- 2D scatter plot revealing structured groups:
117
-
118
- Low-budget films
119
-
120
- Mid-tier films
121
-
122
- High-budget blockbusters
123
-
124
- Clusters later used as new predictive features.
125
-
126
- πŸš€ 7. Improved Regression Models
127
-
128
- Trained 3 regression models:
129
-
130
- Linear Regression (improved)
131
-
132
- Random Forest Regressor
133
-
134
- Gradient Boosting Regressor ← πŸ† Winner
135
-
136
- πŸ† Winning Model
137
-
138
- Gradient Boosting Regressor
139
-
140
- Why?
141
-
142
- Best RΒ²
143
-
144
- Lowest MAE & RMSE
145
-
146
- Handles non-linear relationships beautifully
147
-
148
- Exported as:
149
- winning_model.pkl
150
-
151
- πŸ”„ 8. Regression β†’ Classification
152
-
153
- The regression target was reframed into a binary classification problem:
154
-
155
- 🎚️ Creating revenue_class
156
-
157
- Median split
158
-
159
- Class 0 β†’ below median
160
-
161
- Class 1 β†’ at or above median
162
-
163
- βš–οΈ Class Balance
164
-
165
- Perfectly balanced (~50/50).
166
-
167
- 🧠 Business Reasoning
168
-
169
- Precision is more important than recall
170
-
171
- False Positives are more dangerous than False Negatives
172
- Predicting a movie as high-revenue when it won’t be β†’ wastes millions.
173
-
174
- πŸ€– 9. Classification Models
175
-
176
- Trained 3 classifiers:
177
-
178
- Logistic Regression
179
-
180
- Random Forest Classifier
181
-
182
- Gradient Boosting Classifier ← πŸ† Winner
183
-
184
- πŸ§ͺ Metrics Evaluated:
185
-
186
- Accuracy
187
-
188
- Precision
189
-
190
- Recall
191
-
192
- F1-score
193
-
194
- Classification report
195
-
196
- Confusion matrix
197
-
198
- πŸ† Winning Model: Gradient Boosting Classifier
199
-
200
- Highest precision (0.990)
201
-
202
- Highest F1-score (0.990)
203
-
204
- Lowest rate of harmful errors
205
-
206
- Exported as:
207
- winning_classifier.pkl