Spaces:

Bardi-ya
/

Final_ML_Project

Sleeping

App Files Files Community

Bardi-ya commited on Sep 5, 2025

Commit

c296592

verified ·

1 Parent(s): 49a962c

Upload 51 files

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.gitattributes +3 -0
README.md +75 -13
app.py +63 -0
app/practical.py +115 -0
document.pdf +0 -0
models/recommender_knn_user_based.pkl +3 -0
models/recommender_merged_df_with_tfidf.pkl +3 -0
models/recommender_popular_movies_unique.pkl +3 -0
models/recommender_svd_mf.pkl +3 -0
models/recommender_unique_movies_reduced.pkl +3 -0
models/recommender_user_profiles.pkl +3 -0
notebooks/practical.ipynb +0 -0
report/images/budget_vs_revenue.png +0 -0
report/images/budget_vs_revenue_filtered.png +0 -0
report/images/df_missing.png +3 -0
report/images/movies_by_decade_pie.png +0 -0
report/images/popularity_distribution.png +0 -0
report/images/popularity_distribution_lt10.png +0 -0
report/images/popularity_distribution_lt100.png +0 -0
report/images/rating_distribution.png +0 -0
report/images/release_year_distribution.png +0 -0
report/images/runtime_distribution.png +0 -0
report/images/top_genres.png +0 -0
report/images/top_languages.png +0 -0
report/images/top_production_companies.png +0 -0
report/images/top_production_countries.png +0 -0
report/images/vote_average_distribution.png +0 -0
report/images/vote_count_distribution.png +0 -0
report/images/vote_count_vs_average.png +0 -0
report/images/wordcloud_overview.png +3 -0
report/images/wordcloud_title.png +3 -0
report/images/world_production_map.png +0 -0
requirements.txt +12 -0
src/__pycache__/collaborative.cpython-310.pyc +0 -0
src/__pycache__/collaborative.cpython-313.pyc +0 -0
src/__pycache__/content_based.cpython-310.pyc +0 -0
src/__pycache__/content_based.cpython-313.pyc +0 -0
src/__pycache__/eda.cpython-310.pyc +0 -0
src/__pycache__/eda.cpython-313.pyc +0 -0
src/__pycache__/evaluation.cpython-310.pyc +0 -0
src/__pycache__/feature_engineering.cpython-310.pyc +0 -0
src/__pycache__/feature_engineering.cpython-313.pyc +0 -0
src/__pycache__/hybrid.cpython-310.pyc +0 -0
src/__pycache__/modeling.cpython-310.pyc +0 -0
src/__pycache__/modeling.cpython-313.pyc +0 -0
src/__pycache__/preprocessing.cpython-310.pyc +0 -0
src/__pycache__/preprocessing.cpython-313.pyc +0 -0
src/eda.py +327 -0
src/evaluation.py +121 -0
src/feature_engineering.py +224 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,6 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+report/images/df_missing.png filter=lfs diff=lfs merge=lfs -text
+report/images/wordcloud_overview.png filter=lfs diff=lfs merge=lfs -text
+report/images/wordcloud_title.png filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -1,13 +1,75 @@
----
-title: Final ML Project
-emoji: 🏆
-colorFrom: purple
-colorTo: purple
-sdk: gradio
-sdk_version: 5.44.1
-app_file: app.py
-pinned: false
-license: mit
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# MovieLens Movie Data Analysis
+This project provides a reproducible pipeline for preprocessing and exploratory data analysis (EDA) on the MovieLens movie dataset.
+## Project Structure
+```
+.
+├── app/
+│   └── Practical.py         # Main entry point for running the pipeline
+├── src/
+│   ├── preprocessing.py     # Data loading, cleaning, merging
+│   └── eda.py               # EDA and visualization (plots saved to /report/images)
+├── notebooks/
+│   └── Practical.ipynb      # Step-by-step notebook for exploration and prototyping
+├── report/
+│   └── images/              # Output directory for all generated plots and images
+├── data/
+│   ├── raw/                 # Raw input data (CSV files)
+│   ├── interim/             # Cleaned/intermediate CSVs
+│   └── processed/           # (Optional) Final processed data
+├── requirements.txt         # Python dependencies
+└── README.md                # This file
+```
+## How to Run
+1. **Install dependencies**
+   Make sure you have Python 3.8+ and run:
+   ```
+   pip install -r requirements.txt
+   ```
+2. **Prepare data**
+   Place the raw MovieLens CSV files in `data/raw/` as:
+   - `movies_metadata.csv`
+   - `credits.csv`
+   - `keywords.csv`
+   - `links.csv`
+   - `ratings.csv`
+3. **Run the pipeline**
+   ```
+   python app/Practical.py
+   ```
+   This will:
+   - Clean and merge the data
+   - Save interim cleaned CSVs to `data/interim/`
+   - Generate all EDA plots and wordclouds, saving them to `report/images/`
+   - Save interactive Plotly plots as PNG (requires [kaleido](https://github.com/plotly/Kaleido)) or HTML fallback
+## Features
+- **Modular Preprocessing**: All data cleaning, merging, and type handling in `src/preprocessing.py`
+- **Automated EDA**: All plots and wordclouds generated and saved by `src/eda.py`
+- **Reproducibility**: One-command run for the entire workflow
+- **Notebook**: `notebooks/Practical.ipynb` for step-by-step exploration
+## Requirements
+- pandas
+- numpy
+- matplotlib
+- seaborn
+- missingno
+- wordcloud
+- plotly
+- pycountry
+- kaleido (for static plotly image export)
+## Notes
+- If static Plotly image export fails, HTML versions of the plots are saved as a fallback.
+- All output images are saved in `report/images/`.
+- Adjust paths in `src/eda.py` and `src/preprocessing.py` if your

app.py ADDED Viewed

	@@ -0,0 +1,63 @@

+import gradio as gr
+import pickle
+import pandas as pd
+import os
+# Paths
+MODEL_DIR = "models"
+MOVIE_DATA_PATH = "data/movies.csv"  # adjust to your actual metadata file
+# Load models (choose what you want to demo)
+with open(os.path.join(MODEL_DIR, "recommender_svd_mf.pkl"), "rb") as f:
+    svd_model = pickle.load(f)
+# Load movie metadata
+movies_df = pd.read_csv(MOVIE_DATA_PATH)  # should include [movieId, title, poster_url, actors]
+def recommend(user_id, top_k=5):
+    """Generate top-k recommendations using SVD model."""
+    # Predict scores for all movies for this user
+    all_movie_ids = movies_df["movieId"].unique()
+    predictions = []
+    for mid in all_movie_ids:
+        try:
+            est = svd_model.predict(str(user_id), str(mid)).est
+            predictions.append((mid, est))
+        except Exception:
+            continue
+    # Sort and pick top_k
+    top_movies = sorted(predictions, key=lambda x: x[1], reverse=True)[:top_k]
+    # Build output
+    results = []
+    for mid, score in top_movies:
+        row = movies_df[movies_df["movieId"] == mid].iloc[0]
+        explanation = f"Because you liked movies with {row.get('actors', 'similar style')}."
+        results.append((row["title"], row.get("poster_url", None), explanation))
+    return results
+def format_output(results):
+    titles = [r[0] for r in results]
+    posters = [r[1] for r in results if r[1] is not None]
+    explanations = [r[2] for r in results]
+    return titles, posters, explanations
+demo = gr.Interface(
+    fn=lambda user_id, k: format_output(recommend(user_id, k)),
+    inputs=[
+        gr.Number(label="User ID"),
+        gr.Slider(1, 10, value=5, step=1, label="Top-K")
+    ],
+    outputs=[
+        gr.Textbox(label="Recommended Movies"),
+        gr.Gallery(label="Posters").style(grid=[3], height="auto"),
+        gr.Textbox(label="Explanations")
+    ],
+    title="Movie Recommender System",
+    description="Enter your User ID to get top-K movie recommendations with posters and explanations."
+)
+if __name__ == "__main__":
+    demo.launch()

app/practical.py ADDED Viewed

	@@ -0,0 +1,115 @@

+import sys
+import os
+# Add the parent directory to sys.path so 'src' can be imported
+sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), '..')))
+from src.preprocessing import Preprocessing
+from src.eda import EDA
+from src.feature_engineering import FeatureEngineering
+from src.modeling import RecommenderModels
+from src.evaluation import leave_one_out_by_timestamp, evaluate_all, summarize_results
+def main():
+    print("========== Step 1: Preprocessing ==========")
+    preprocessor = Preprocessing()
+    dfs = preprocessor.run_all()
+    # print("========== Step 2: Exploratory Data Analysis (EDA) ==========")
+    # eda = EDA(dfs)
+    # eda.run_all()
+    print("========== Step 3: Feature Engineering ==========")
+    fe = FeatureEngineering(dfs)
+    fe_outputs = fe.run_all()
+    merged_df = fe_outputs["merged_df"]
+    merged_df_with_tfidf = fe_outputs["merged_df_with_tfidf"]
+    unique_movies_reduced = fe_outputs["unique_movies_reduced"]
+    ratings_df = dfs["ratings_df"]
+    print("========== Step 4: Modeling & Recommendation ==========")
+    models = RecommenderModels(
+        merged_df_with_tfidf=merged_df_with_tfidf,
+        unique_movies_reduced=unique_movies_reduced,
+        ratings_df=ratings_df
+    )
+    models.fit_popularity()
+    models.fit_content_based()
+    models.fit_cf()
+    print("CF RMSEs (kNN, SVD):", models.evaluate_cf())
+    rmse_scores, best_alpha = models.tune_hybrid_alpha()
+    print("Best alpha:", best_alpha)
+    print("Hybrid RMSE:", models.evaluate_hybrid())
+    models.save_models()
+    # Example: get recommendations for user 1
+    print("Top 10 Content-Based Recommendations for user 1:")
+    print(models.get_content_based_recommendations(user_id=1, top_n=10))
+    print("========== Step 5: Evaluation ==========")
+    # Time-aware split
+    train_ratings, test_ratings = leave_one_out_by_timestamp(ratings_df)
+    all_items = set(merged_df_with_tfidf['movieId'].astype(str).unique())
+    item_popularity = merged_df_with_tfidf['movieId'].value_counts().to_dict()
+    item_popularity = {str(k): v for k, v in item_popularity.items()}
+    svd_cols = [col for col in unique_movies_reduced.columns if col.startswith("svd_")]
+    item_features = {
+        str(row.movieId): row[svd_cols].values
+        for _, row in unique_movies_reduced.iterrows()
+    }
+    # Generate predictions for each model
+    # Implement prediction methods if not present in RecommenderModels
+    def predict_content_based(models, test_df):
+        preds = []
+        for _, row in test_df.iterrows():
+            user_id = row['userId']
+            movie_id = row['movieId']
+            true_rating = row['rating']
+            pred_rating = models.get_content_based_score(user_id, movie_id)
+            preds.append((user_id, movie_id, true_rating, pred_rating, {}))
+        return preds
+    def predict_collaborative(models, test_df):
+        preds = []
+        for _, row in test_df.iterrows():
+            user_id = row['userId']
+            movie_id = row['movieId']
+            true_rating = row['rating']
+            # Use SVD as the collaborative model (or knn_user_based if you prefer)
+            try:
+                pred_rating = models.svd_mf.predict(str(user_id), str(movie_id)).est
+            except Exception:
+                pred_rating = 0
+            preds.append((user_id, movie_id, true_rating, pred_rating, {}))
+        return preds
+    def predict_hybrid(models, test_df, alpha):
+        preds = []
+        for _, row in test_df.iterrows():
+            user_id = row['userId']
+            movie_id = row['movieId']
+            true_rating = row['rating']
+            pred_rating = models.hybrid_prediction(user_id, movie_id, alpha)
+            preds.append((user_id, movie_id, true_rating, pred_rating, {}))
+        return preds
+    predictions_cb = predict_content_based(models, test_ratings)
+    predictions_cf = predict_collaborative(models, test_ratings)
+    predictions_hybrid = predict_hybrid(models, test_ratings, alpha=best_alpha)
+    # Evaluate
+    results_cb = evaluate_all(predictions_cb, test_ratings.values, all_items, item_popularity, item_features)
+    results_cf = evaluate_all(predictions_cf, test_ratings.values, all_items, item_popularity, item_features)
+    results_hybrid = evaluate_all(predictions_hybrid, test_ratings.values, all_items, item_popularity, item_features)
+    # Print summary table
+    summary = summarize_results({
+        "Content-Based": results_cb,
+        "Collaborative": results_cf,
+        "Hybrid": results_hybrid
+    })
+    print(summary)
+if __name__ == "__main__":
+    main()

document.pdf ADDED Viewed

Binary file (68.8 kB). View file

models/recommender_knn_user_based.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:2c2691486ff062618b6cc06aa00397ce7abc72d84a0f2f015e24d1c720ef9a6b
+size 5949691

models/recommender_merged_df_with_tfidf.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0c0ea5f78039d884abaf095cf23c3320976d8db9865e522136ae27a375c89662
+size 166955859

models/recommender_popular_movies_unique.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:095a203a1742406085858fc637e83a6f94e1860ac686c572ec75a2cae80511f6
+size 2922773

models/recommender_svd_mf.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:29962c5fd250f73d6d2003b88a13bc2b0ee452c93fa44ee2f69fcb410a2f8770
+size 9661411

models/recommender_unique_movies_reduced.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ddc6fc18ff578c47c86ff53abf985f7af9a74146b5fe265666ad99844b530c85
+size 21384963

models/recommender_user_profiles.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0fd803357ddf7dac92ecd6351beb6b980b56d7a514c540a78fa6261965170c68
+size 1490

notebooks/practical.ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff

report/images/budget_vs_revenue.png ADDED Viewed

report/images/budget_vs_revenue_filtered.png ADDED Viewed

report/images/df_missing.png ADDED Viewed

Git LFS Details

SHA256: 4a62aab31beaa4b2578bb615811000c865e963e48f875fb32be1a907faad33b6
Pointer size: 131 Bytes
Size of remote file: 474 kB

report/images/movies_by_decade_pie.png ADDED Viewed

report/images/popularity_distribution.png ADDED Viewed

report/images/popularity_distribution_lt10.png ADDED Viewed

report/images/popularity_distribution_lt100.png ADDED Viewed

report/images/rating_distribution.png ADDED Viewed

report/images/release_year_distribution.png ADDED Viewed

report/images/runtime_distribution.png ADDED Viewed

report/images/top_genres.png ADDED Viewed

report/images/top_languages.png ADDED Viewed

report/images/top_production_companies.png ADDED Viewed

report/images/top_production_countries.png ADDED Viewed

report/images/vote_average_distribution.png ADDED Viewed

report/images/vote_count_distribution.png ADDED Viewed

report/images/vote_count_vs_average.png ADDED Viewed

report/images/wordcloud_overview.png ADDED Viewed

Git LFS Details

SHA256: b9ee4349be9564ed7c5161b5ab442159fa67031589eb87b84742a5f71a8be378
Pointer size: 131 Bytes
Size of remote file: 617 kB

report/images/wordcloud_title.png ADDED Viewed

Git LFS Details

SHA256: 96d11bd23abcb7b66230c0fb596c74e6a5a65d03398b939d7ee9c352d7435a4a
Pointer size: 131 Bytes
Size of remote file: 626 kB

report/images/world_production_map.png ADDED Viewed

requirements.txt ADDED Viewed

	@@ -0,0 +1,12 @@

+pandas
+numpy
+matplotlib
+seaborn
+missingno
+wordcloud
+plotly
+pycountry
+kaleido
+scikit-learn
+scikit-surprise
+gradio

src/__pycache__/collaborative.cpython-310.pyc ADDED Viewed

Binary file (2.92 kB). View file

src/__pycache__/collaborative.cpython-313.pyc ADDED Viewed

Binary file (4.77 kB). View file

src/__pycache__/content_based.cpython-310.pyc ADDED Viewed

Binary file (4.62 kB). View file

src/__pycache__/content_based.cpython-313.pyc ADDED Viewed

Binary file (8.07 kB). View file

src/__pycache__/eda.cpython-310.pyc ADDED Viewed

Binary file (11.9 kB). View file

src/__pycache__/eda.cpython-313.pyc ADDED Viewed

Binary file (25 kB). View file

src/__pycache__/evaluation.cpython-310.pyc ADDED Viewed

Binary file (5.54 kB). View file

src/__pycache__/feature_engineering.cpython-310.pyc ADDED Viewed

Binary file (14.7 kB). View file

src/__pycache__/feature_engineering.cpython-313.pyc ADDED Viewed

Binary file (25.6 kB). View file

src/__pycache__/hybrid.cpython-310.pyc ADDED Viewed

Binary file (2.44 kB). View file

src/__pycache__/modeling.cpython-310.pyc ADDED Viewed

Binary file (8.28 kB). View file

src/__pycache__/modeling.cpython-313.pyc ADDED Viewed

Binary file (12.5 kB). View file

src/__pycache__/preprocessing.cpython-310.pyc ADDED Viewed

Binary file (6.93 kB). View file

src/__pycache__/preprocessing.cpython-313.pyc ADDED Viewed

Binary file (13.2 kB). View file

src/eda.py ADDED Viewed

	@@ -0,0 +1,327 @@

+import matplotlib.pyplot as plt
+import seaborn as sns
+import os
+import pandas as pd
+from wordcloud import WordCloud, STOPWORDS
+import plotly.graph_objs as go
+import plotly.io as pio
+import pycountry
+class EDA:
+    def __init__(self, dfs):
+        self.df = dfs["df"]
+        self.credits_df = dfs["credits_df"]
+        self.keywords_df = dfs["keywords_df"]
+        self.links_df = dfs["links_df"]
+        self.ratings_df = dfs["ratings_df"]
+        self.merged_df = dfs["merged_df"]
+        self.img_path = "D:/Uni/Term 6/Machine Learning/HomeWork/6/report/images/"
+        os.makedirs(self.img_path, exist_ok=True)
+    def plot_rating_distribution(self):
+        plt.figure(figsize=(10, 6))
+        sns.histplot(self.merged_df['rating'], bins=10, kde=False)
+        plt.title('Distribution of Movie Ratings')
+        plt.xlabel('Rating')
+        plt.ylabel('Frequency')
+        plt.savefig(os.path.join(self.img_path, "rating_distribution.png"), bbox_inches='tight')
+        plt.close()
+    def plot_release_year_distribution(self):
+        df = self.merged_df.copy()
+        df['release_date'] = pd.to_datetime(df['release_date'], errors='coerce')
+        df['release_year'] = df['release_date'].dt.year
+        plt.figure(figsize=(12, 6))
+        sns.histplot(df['release_year'].dropna(), bins=50, kde=False)
+        plt.title('Distribution of Movie Release Years')
+        plt.xlabel('Release Year')
+        plt.ylabel('Number of Movies')
+        plt.savefig(os.path.join(self.img_path, "release_year_distribution.png"), bbox_inches='tight')
+        plt.close()
+    def plot_budget_vs_revenue(self):
+        plt.figure(figsize=(10, 6))
+        sns.scatterplot(data=self.merged_df, x='budget', y='revenue')
+        plt.title('Relationship between Movie Budget and Revenue')
+        plt.xlabel('Budget')
+        plt.ylabel('Revenue')
+        plt.savefig(os.path.join(self.img_path, "budget_vs_revenue.png"), bbox_inches='tight')
+        plt.close()
+        # Convert 'budget' and 'revenue' to numeric, coercing errors to NaN
+        self.merged_df['budget'] = pd.to_numeric(self.merged_df['budget'], errors='coerce')
+        self.merged_df['revenue'] = pd.to_numeric(self.merged_df['revenue'], errors='coerce')
+        # Fill NaN values in 'budget' and 'revenue' with 0, as 0 budget/revenue is a meaningful value
+        self.merged_df['budget'] = self.merged_df['budget'].fillna(0)
+        self.merged_df['revenue'] = self.merged_df['revenue'].fillna(0)
+        # Filter out movies with zero budget AND zero revenue
+        filtered_df = self.merged_df[(self.merged_df['budget'] > 0) | (self.merged_df['revenue'] > 0)].copy()
+        plt.figure(figsize=(10, 6))
+        sns.scatterplot(data=filtered_df, x='budget', y='revenue')
+        plt.title('Relationship between Movie Budget and Revenue (Filtered)')
+        plt.xlabel('Budget')
+        plt.ylabel('Revenue')
+        plt.savefig(os.path.join(self.img_path, "budget_vs_revenue_filtered.png"), bbox_inches='tight')
+        plt.close()
+    def plot_genre_counts(self):
+        genre_counts = {}
+        for genres_list in self.df['genres'].dropna():
+            if isinstance(genres_list, str):
+                genres = [genre.strip() for genre in genres_list.split(',')]
+                for genre in genres:
+                    if genre:
+                        genre_counts[genre] = genre_counts.get(genre, 0) + 1
+        top_n = 15
+        top_genres = pd.Series(genre_counts).sort_values(ascending=False).head(top_n)
+        plt.figure(figsize=(12, 8))
+        sns.barplot(x=top_genres.index, y=top_genres.values, palette='viridis')
+        plt.title('Top Movie Genres by Frequency')
+        plt.xlabel('Genre')
+        plt.ylabel('Frequency')
+        plt.xticks(rotation=45, ha='right')
+        plt.tight_layout()
+        plt.savefig(os.path.join(self.img_path, "top_genres.png"), bbox_inches='tight')
+        plt.close()
+    def plot_popularity_distribution(self):
+        plt.figure(figsize=(10, 6))
+        sns.histplot(self.merged_df['popularity'], bins=50, kde=False)
+        plt.title('Distribution of Movie Popularity')
+        plt.xlabel('Popularity')
+        plt.ylabel('Frequency')
+        plt.savefig(os.path.join(self.img_path, "popularity_distribution.png"), bbox_inches='tight')
+        plt.close()
+        filtered_popularity_df = self.merged_df[self.merged_df['popularity'] < 100].copy()
+        plt.figure(figsize=(10, 6))
+        sns.histplot(filtered_popularity_df['popularity'], bins=50, kde=False)
+        plt.title('Distribution of Movie Popularity (Popularity < 100)')
+        plt.xlabel('Popularity')
+        plt.ylabel('Frequency')
+        plt.savefig(os.path.join(self.img_path, "popularity_distribution_lt100.png"), bbox_inches='tight')
+        plt.close()
+        filtered_popularity_df_low = self.merged_df[self.merged_df['popularity'] < 10].copy()
+        plt.figure(figsize=(10, 6))
+        sns.histplot(filtered_popularity_df_low['popularity'], bins=50, kde=False)
+        plt.title('Distribution of Movie Popularity (Popularity < 10)')
+        plt.xlabel('Popularity')
+        plt.ylabel('Frequency')
+        plt.savefig(os.path.join(self.img_path, "popularity_distribution_lt10.png"), bbox_inches='tight')
+        plt.close()
+    def plot_runtime_distribution(self):
+        plt.figure(figsize=(10, 6))
+        sns.histplot(self.merged_df['runtime'].dropna(), bins=50, kde=False)
+        plt.title('Distribution of Movie Runtimes')
+        plt.xlabel('Runtime (minutes)')
+        plt.ylabel('Frequency')
+        plt.savefig(os.path.join(self.img_path, "runtime_distribution.png"), bbox_inches='tight')
+        plt.close()
+    def plot_production_company_counts(self):
+        company_counts = {}
+        for companies_list in self.merged_df['production_companies'].dropna():
+            if isinstance(companies_list, str):
+                companies = [company.strip() for company in companies_list.split(',')]
+                for company in companies:
+                    if company and company != 'Unknown':
+                        company_counts[company] = company_counts.get(company, 0) + 1
+        top_n_companies = 15
+        top_companies = pd.Series(company_counts).sort_values(ascending=False).head(top_n_companies)
+        plt.figure(figsize=(14, 8))
+        sns.barplot(x=top_companies.index, y=top_companies.values, palette='viridis')
+        plt.title(f'Top {top_n_companies} Production Companies')
+        plt.xlabel('Production Company')
+        plt.ylabel('Frequency')
+        plt.xticks(rotation=45, ha='right')
+        plt.tight_layout()
+        plt.savefig(os.path.join(self.img_path, "top_production_companies.png"), bbox_inches='tight')
+        plt.close()
+    def plot_production_country_counts(self):
+        country_counts = {}
+        for countries_list in self.merged_df['production_countries'].dropna():
+            if isinstance(countries_list, str):
+                countries = [country.strip() for country in countries_list.split(',')]
+                for country in countries:
+                    if country and country != 'Unknown':
+                        country_counts[country] = country_counts.get(country, 0) + 1
+        top_n_countries = 15
+        top_countries = pd.Series(country_counts).sort_values(ascending=False).head(top_n_countries)
+        plt.figure(figsize=(14, 8))
+        sns.barplot(x=top_countries.index, y=top_countries.values, palette='magma')
+        plt.title(f'Top {top_n_countries} Production Countries')
+        plt.xlabel('Production Country')
+        plt.ylabel('Frequency')
+        plt.xticks(rotation=45, ha='right')
+        plt.tight_layout()
+        plt.savefig(os.path.join(self.img_path, "top_production_countries.png"), bbox_inches='tight')
+        plt.close()
+    def plot_language_counts(self):
+        language_counts = {}
+        for languages_list in self.merged_df['spoken_languages'].dropna():
+            if isinstance(languages_list, str):
+                languages = [lang.strip() for lang in languages_list.split(',')]
+                for lang in languages:
+                    if lang and lang != 'Unknown':
+                        language_counts[lang] = language_counts.get(lang, 0) + 1
+        language_counts_series = pd.Series(language_counts).sort_values(ascending=False)
+        top_languages = language_counts_series.head(15)
+        plt.figure(figsize=(12, 8))
+        sns.barplot(x=top_languages.index, y=top_languages.values, palette='viridis')
+        plt.title('Top 15 Spoken Languages')
+        plt.xlabel('Language')
+        plt.ylabel('Frequency')
+        plt.xticks(rotation=45, ha='right')
+        plt.tight_layout()
+        plt.savefig(os.path.join(self.img_path, "top_languages.png"), bbox_inches='tight')
+        plt.close()
+    def plot_vote_count_distribution(self):
+        plt.figure(figsize=(10, 6))
+        sns.histplot(self.merged_df['vote_count'], bins=50, kde=False)
+        plt.title('Distribution of Movie Vote Counts')
+        plt.xlabel('Vote Count')
+        plt.ylabel('Frequency')
+        plt.savefig(os.path.join(self.img_path, "vote_count_distribution.png"), bbox_inches='tight')
+        plt.close()
+    def plot_vote_average_distribution(self):
+        plt.figure(figsize=(10, 6))
+        sns.histplot(self.merged_df['vote_average'], bins=20, kde=False)
+        plt.title('Distribution of Movie Vote Averages')
+        plt.xlabel('Vote Average')
+        plt.ylabel('Frequency')
+        plt.savefig(os.path.join(self.img_path, "vote_average_distribution.png"), bbox_inches='tight')
+        plt.close()
+    def plot_vote_count_vs_average(self):
+        plt.figure(figsize=(10, 6))
+        sns.scatterplot(data=self.merged_df, x='vote_count', y='vote_average')
+        plt.title('Relationship between Vote Count and Vote Average')
+        plt.xlabel('Vote Count')
+        plt.ylabel('Vote Average')
+        plt.savefig(os.path.join(self.img_path, "vote_count_vs_average.png"), bbox_inches='tight')
+        plt.close()
+    def plot_wordclouds(self):
+        copy = self.df.copy()
+        copy['title'] = copy['title'].astype('str')
+        copy['overview'] = copy['overview'].astype('str')
+        title_corpus = ' '.join(copy['title'])
+        overview_corpus = ' '.join(copy['overview'])
+        title_wordcloud = WordCloud(stopwords=STOPWORDS, background_color='white', height=2000, width=4000).generate(title_corpus)
+        plt.figure(figsize=(16,8))
+        plt.imshow(title_wordcloud)
+        plt.axis('off')
+        plt.tight_layout()
+        plt.savefig(os.path.join(self.img_path, "wordcloud_title.png"), bbox_inches='tight')
+        plt.close()
+        overview_wordcloud = WordCloud(stopwords=STOPWORDS, background_color='white', height=2000, width=4000).generate(overview_corpus)
+        plt.figure(figsize=(16,8))
+        plt.imshow(overview_wordcloud)
+        plt.axis('off')
+        plt.tight_layout()
+        plt.savefig(os.path.join(self.img_path, "wordcloud_overview.png"), bbox_inches='tight')
+        plt.close()
+    def plot_world_production_map(self):
+        copy = self.df.copy()
+        country_counts = copy['production_countries'].value_counts().reset_index()
+        country_counts.columns = ['country', 'num_movies']
+        country_counts = country_counts[country_counts['country'] != "United States of America"]
+        def get_iso3(country_name):
+            try:
+                return pycountry.countries.lookup(country_name).alpha_3
+            except:
+                return None
+        country_counts['iso_alpha'] = country_counts['country'].apply(get_iso3)
+        country_counts = country_counts.dropna(subset=['iso_alpha'])
+        data = [go.Choropleth(
+            locations = country_counts['iso_alpha'],
+            z = country_counts['num_movies'],
+            text = country_counts['country'],
+            colorscale = [[0,'rgb(255,255,255)'], [1,'rgb(255,0,0)']],
+            autocolorscale = False,
+            reversescale = False,
+            marker = dict(line = dict(color='rgb(180,180,180)', width=0.5)),
+            colorbar = dict(title='Production Countries')
+        )]
+        layout = dict(
+            title = 'Production Countries for the MovieLens Movies (Apart from US)',
+            geo = dict(
+                showframe = False,
+                showcoastlines = False,
+                projection = dict(type = 'mercator')
+            )
+        )
+        fig = go.Figure(data=data, layout=layout)
+        # Save as static image (requires kaleido)
+        try:
+            # Use plotly.io.write_image for better compatibility
+            pio.write_image(fig, os.path.join(self.img_path, "world_production_map.png"))
+        except Exception:
+            # As a fallback, save as HTML if static image export fails
+            try:
+                fig.write_html(os.path.join(self.img_path, "world_production_map.html"))
+            except Exception:
+                pass
+    def plot_decade_pie(self):
+        import plotly.express as px
+        copy = self.df.copy()
+        copy['release_date'] = pd.to_datetime(copy['release_date'], errors='coerce')
+        copy['decade'] = (copy['release_date'].dt.year // 10) * 10
+        decade_counts = copy['decade'].value_counts().sort_index().reset_index()
+        decade_counts.columns = ['decade', 'num_movies']
+        decade_counts['decade'] = decade_counts['decade'].astype(int).astype(str) + "s"
+        fig = px.pie(
+            decade_counts,
+            names='decade',
+            values='num_movies',
+            title="Movies Distribution by Decade (Release Date)",
+            color_discrete_sequence=px.colors.qualitative.Set3
+        )
+        # Save as static image (requires kaleido)
+        try:
+            # Use plotly.io.write_image for better compatibility
+            pio.write_image(fig, os.path.join(self.img_path, "movies_by_decade_pie.png"))
+        except Exception:
+            # As a fallback, save as HTML if static image export fails
+            try:
+                fig.write_html(os.path.join(self.img_path, "movies_by_decade_pie.html"))
+            except Exception:
+                pass
+    def run_all(self):
+        self.plot_rating_distribution()
+        self.plot_release_year_distribution()
+        self.plot_budget_vs_revenue()
+        self.plot_genre_counts()
+        self.plot_popularity_distribution()
+        self.plot_runtime_distribution()
+        self.plot_production_company_counts()
+        self.plot_production_country_counts()
+        self.plot_language_counts()
+        self.plot_vote_count_distribution()
+        self.plot_vote_average_distribution()
+        self.plot_vote_count_vs_average()
+        self.plot_wordclouds()
+        self.plot_world_production_map()
+        self.plot_decade_pie()

src/evaluation.py ADDED Viewed

	@@ -0,0 +1,121 @@

+import pandas as pd
+import numpy as np
+from collections import defaultdict
+from sklearn.metrics.pairwise import cosine_similarity
+def leave_one_out_by_timestamp(ratings_df):
+    ratings_df = ratings_df.sort_values(['userId', 'timestamp'])
+    train_idx, test_idx = [], []
+    for user, group in ratings_df.groupby('userId'):
+        if len(group) > 1:
+            test_idx.append(group.index[-1])
+            train_idx.extend(group.index[:-1])
+        else:
+            test_idx.append(group.index[-1])
+    train = ratings_df.loc[train_idx]
+    test = ratings_df.loc[test_idx]
+    return train, test
+def precision_at_k(ranked_lists, k=10):
+    precisions = []
+    for uid, items in ranked_lists.items():
+        relevant = [r for _, _, r in items[:k] if r >= 4]
+        precisions.append(len(relevant) / k)
+    return np.mean(precisions)
+def recall_at_k(ranked_lists, test_truth, k=10):
+    recalls = []
+    truth = defaultdict(set)
+    # Accept both DataFrame and ndarray for test_truth
+    if isinstance(test_truth, pd.DataFrame):
+        for _, row in test_truth.iterrows():
+            uid, iid, r = row['userId'], row['movieId'], row['rating']
+            if r >= 4:
+                truth[uid].add(iid)
+    else:
+        for row in test_truth:
+            # row can be (uid, iid, r, ...) or (uid, iid, r)
+            uid, iid, r = row[:3]
+            if r >= 4:
+                truth[uid].add(iid)
+    for uid, items in ranked_lists.items():
+        recommended = {iid for iid, _, _ in items[:k]}
+        relevant = truth.get(uid, set())
+        if relevant:
+            recalls.append(len(recommended & relevant) / len(relevant))
+    return np.mean(recalls)
+def ndcg_at_k(ranked_lists, k=10):
+    ndcgs = []
+    for uid, items in ranked_lists.items():
+        dcg = 0.0
+        idcg = 0.0
+        rels = [1 if r >= 4 else 0 for _, _, r in items[:k]]
+        for i, rel in enumerate(rels):
+            dcg += (2**rel - 1) / np.log2(i + 2)
+        ideal_rels = sorted(rels, reverse=True)
+        for i, rel in enumerate(ideal_rels):
+            idcg += (2**rel - 1) / np.log2(i + 2)
+        if idcg > 0:
+            ndcgs.append(dcg / idcg)
+    return np.mean(ndcgs)
+def catalog_coverage(ranked_lists, all_items):
+    recommended = {iid for items in ranked_lists.values() for iid, _, _ in items}
+    return len(recommended) / len(all_items)
+def novelty(ranked_lists, item_popularity):
+    novelties = []
+    total = sum(item_popularity.values())
+    for items in ranked_lists.values():
+        for iid, _, _ in items:
+            p = item_popularity.get(iid, 1) / total
+            novelties.append(-np.log2(p + 1e-9))
+    return np.mean(novelties)
+def intra_list_diversity(ranked_lists, item_features):
+    diversities = []
+    for items in ranked_lists.values():
+        iids = [iid for iid, _, _ in items]
+        feats = [item_features[iid] for iid in iids if iid in item_features]
+        if len(feats) > 1:
+            sims = cosine_similarity(feats)
+            upper = sims[np.triu_indices_from(sims, k=1)]
+            diversities.append(1 - np.mean(upper))
+    return np.mean(diversities)
+def predictions_to_ranked_lists(predictions, k=20):
+    user_items = defaultdict(list)
+    for uid, iid, true_r, est, _ in predictions:
+        user_items[uid].append((iid, est, true_r))
+    ranked = {}
+    for uid, items in user_items.items():
+        ranked[uid] = sorted(items, key=lambda x: x[1], reverse=True)[:k]
+    return ranked
+def evaluate_all(predictions, testset, all_items, item_popularity, item_features, k_list=[10, 20]):
+    ranked_lists = predictions_to_ranked_lists(predictions, k=max(k_list))
+    results = {}
+    for k in k_list:
+        results[f'Precision@{k}'] = precision_at_k(ranked_lists, k)
+        results[f'Recall@{k}'] = recall_at_k(ranked_lists, testset, k)
+        results[f'NDCG@{k}'] = ndcg_at_k(ranked_lists, k)
+    results['Coverage'] = catalog_coverage(ranked_lists, all_items)
+    results['Novelty'] = novelty(ranked_lists, item_popularity)
+    results['Diversity'] = intra_list_diversity(ranked_lists, item_features)
+    return results
+def summarize_results(results_dict):
+    return pd.DataFrame(results_dict).T
+def bootstrap_metric(metric_func, predictions, testset, all_items, item_popularity, item_features, n_bootstrap=100, k=10):
+    scores = []
+    uids = list({p[0] for p in predictions})
+    for _ in range(n_bootstrap):
+        sampled_uids = np.random.choice(uids, size=len(uids), replace=True)
+        sampled_preds = [p for p in predictions if p[0] in sampled_uids]
+        ranked_lists = predictions_to_ranked_lists(sampled_preds, k)
+        score = metric_func(ranked_lists, k)
+        scores.append(score)
+    return np.percentile(scores, [2.5, 97.5])

src/feature_engineering.py ADDED Viewed

	@@ -0,0 +1,224 @@

+import pandas as pd
+import numpy as np
+from sklearn.feature_extraction.text import TfidfVectorizer
+import os
+from sklearn.preprocessing import MultiLabelBinarizer, StandardScaler
+from sklearn.decomposition import TruncatedSVD
+class FeatureEngineering:
+    def __init__(self, dfs, interim_path="D:/Uni/Term 6/Machine Learning/HomeWork/6/data/interim/"):
+        self.merged_df = dfs["merged_df"]
+        self.ratings_df = dfs["ratings_df"]
+        self.interim_path = interim_path
+        os.makedirs(self.interim_path, exist_ok=True)
+    def ordering(self):
+        self.merged_df = self.merged_df.drop(columns=['id', 'tmdbId', 'imdbId', 'imdb_id', 'original_title', 'video'])
+        desired_column_order = [
+            'movieId',
+            'title',
+            'release_date',
+            'runtime',
+            'status',
+            'adult',
+            'budget',
+            'revenue',
+            'popularity',
+            'vote_average',
+            'vote_count',
+            'overview',
+            'genres',
+            'keywords',
+            'cast',
+            'crew',
+            'production_companies',
+            'production_countries',
+            'original_language',
+            'userId',
+            'rating',
+        ]
+        self.merged_df = self.merged_df.reindex(columns=desired_column_order)
+    def outliers(self):
+        self.merged_df['budget'] = pd.to_numeric(self.merged_df['budget'], errors='coerce').fillna(0)
+        self.merged_df['revenue'] = pd.to_numeric(self.merged_df['revenue'], errors='coerce').fillna(0)
+        self.merged_df = self.merged_df[self.merged_df['runtime'] > 0]
+        self.merged_df = self.merged_df[self.merged_df['budget'] >= 0]
+        self.merged_df = self.merged_df[self.merged_df['revenue'] >= 0]
+        for col in ['budget', 'revenue']:
+            upper = self.merged_df[col].quantile(0.995)
+            self.merged_df = self.merged_df[self.merged_df[col] <= upper]
+    def add_budget_to_revenue_ratio(self):
+        self.merged_df['budget'] = pd.to_numeric(self.merged_df['budget'], errors='coerce').fillna(0)
+        self.merged_df['revenue'] = pd.to_numeric(self.merged_df['revenue'], errors='coerce').fillna(0)
+        self.merged_df['budget_to_revenue_ratio'] = self.merged_df.apply(
+            lambda row: row['budget'] / row['revenue'] if row['revenue'] > 0 else 0, axis=1
+        )
+    def add_top_genre_onehot(self, top_n=5):
+        genre_dummies = self.merged_df['genres'].str.get_dummies(sep=', ')
+        top_genres = genre_dummies.sum().sort_values(ascending=False).head(top_n).index
+        for genre in top_genres:
+            self.merged_df[f"genre_{genre}"] = genre_dummies[genre]
+    def add_log_features(self):
+        for col in ['budget', 'revenue', 'popularity', 'vote_count']:
+            self.merged_df[f'log_{col}'] = np.log1p(self.merged_df[col])
+    def add_interaction_features(self):
+        self.merged_df['budget_x_popularity'] = self.merged_df['budget'] * self.merged_df['popularity']
+        self.merged_df['budget_x_vote_count'] = self.merged_df['budget'] * self.merged_df['vote_count']
+    def add_count_features(self):
+        self.merged_df['num_genres'] = self.merged_df['genres'].fillna('').apply(lambda x: len([g for g in x.split(',') if g.strip()]))
+        self.merged_df['num_keywords'] = self.merged_df['keywords'].fillna('').apply(lambda x: len([k for k in x.split(',') if k.strip()]))
+        self.merged_df['num_cast'] = self.merged_df['cast'].fillna('').apply(lambda x: len([c for c in x.split(',') if c.strip()]))
+        self.merged_df['num_crew'] = self.merged_df['crew'].fillna('').apply(lambda x: len([c for c in x.split(',') if c.strip()]))
+    def add_text_length_features(self):
+        self.merged_df['overview_length'] = self.merged_df['overview'].fillna('').apply(len)
+        self.merged_df['title_length'] = self.merged_df['title'].fillna('').apply(len)
+    def add_genre_mean_encoding(self):
+        genre_ratings = {}
+        for genre in self.merged_df['genres'].str.split(',').explode().str.strip().unique():
+            if genre and genre != 'Unknown':
+                mask = self.merged_df['genres'].str.contains(rf'\b{genre}\b', regex=True)
+                genre_ratings[genre] = self.merged_df.loc[mask, 'vote_average'].mean()
+        for genre in list(genre_ratings.keys())[:10]:
+            self.merged_df[f'genre_{genre}_mean_vote'] = self.merged_df['genres'].apply(
+                lambda x: genre_ratings[genre] if genre in x else np.nan
+            )
+    def add_release_date_features(self):
+        self.merged_df['release_date'] = pd.to_datetime(self.merged_df['release_date'], errors='coerce')
+        self.merged_df['release_year'] = self.merged_df['release_date'].dt.year
+        self.merged_df.drop(columns=['release_date'], inplace=True)
+    def add_adult_flag(self):
+        if 'adult' in self.merged_df.columns:
+            self.merged_df['is_adult'] = self.merged_df['adult'].map({'True': 1, 'False': 0})
+        self.merged_df.drop(columns=['adult'], inplace=True)
+    def add_multi_hot_keywords(self, top_n=20):
+        keywords_split = self.merged_df['keywords'].fillna('').apply(lambda x: [k.strip() for k in x.split(',') if k.strip()])
+        mlb = MultiLabelBinarizer()
+        top_keywords = pd.Series([k for sublist in keywords_split for k in sublist]).value_counts().head(top_n).index
+        keywords_filtered = keywords_split.apply(lambda x: [k for k in x if k in top_keywords])
+        keyword_dummies = pd.DataFrame(mlb.fit_transform(keywords_filtered), columns=[f'kw_{k}' for k in mlb.classes_], index=self.merged_df.index)
+        self.merged_df = pd.concat([self.merged_df, keyword_dummies], axis=1)
+    def add_cast_crew_features(self, top_n_cast=5, top_n_crew=5):
+        cast_split = self.merged_df['cast'].fillna('').apply(lambda x: [c.strip() for c in x.split(',') if c.strip()])
+        crew_split = self.merged_df['crew'].fillna('').apply(lambda x: [c.strip() for c in x.split(',') if c.strip()])
+        mlb_cast = MultiLabelBinarizer()
+        mlb_crew = MultiLabelBinarizer()
+        top_cast = pd.Series([c for sublist in cast_split for c in sublist]).value_counts().head(top_n_cast).index
+        top_crew = pd.Series([c for sublist in crew_split for c in sublist]).value_counts().head(top_n_crew).index
+        cast_filtered = cast_split.apply(lambda x: [c for c in x if c in top_cast])
+        crew_filtered = crew_split.apply(lambda x: [c for c in x if c in top_crew])
+        cast_dummies = pd.DataFrame(mlb_cast.fit_transform(cast_filtered), columns=[f'cast_{c}' for c in mlb_cast.classes_], index=self.merged_df.index)
+        crew_dummies = pd.DataFrame(mlb_crew.fit_transform(crew_filtered), columns=[f'crew_{c}' for c in mlb_crew.classes_], index=self.merged_df.index)
+        self.merged_df = pd.concat([self.merged_df, cast_dummies, crew_dummies], axis=1)
+    def add_company_country_features(self, top_n_company=5, top_n_country=5):
+        company_split = self.merged_df['production_companies'].fillna('').apply(lambda x: [c.strip() for c in x.split(',') if c.strip()])
+        country_split = self.merged_df['production_countries'].fillna('').apply(lambda x: [c.strip() for c in x.split(',') if c.strip()])
+        mlb_company = MultiLabelBinarizer()
+        mlb_country = MultiLabelBinarizer()
+        top_company = pd.Series([c for sublist in company_split for c in sublist]).value_counts().head(top_n_company).index
+        top_country = pd.Series([c for sublist in country_split for c in sublist]).value_counts().head(top_n_country).index
+        company_filtered = company_split.apply(lambda x: [c for c in x if c in top_company])
+        country_filtered = country_split.apply(lambda x: [c for c in x if c in top_country])
+        company_dummies = pd.DataFrame(mlb_company.fit_transform(company_filtered), columns=[f'company_{c}' for c in mlb_company.classes_], index=self.merged_df.index)
+        country_dummies = pd.DataFrame(mlb_country.fit_transform(country_filtered), columns=[f'country_{c}' for c in mlb_country.classes_], index=self.merged_df.index)
+        self.merged_df = pd.concat([self.merged_df, company_dummies, country_dummies], axis=1)
+    def add_target_encoding(self, col, target='vote_average', top_n=10):
+        values = pd.Series([v for sublist in self.merged_df[col].fillna('').apply(lambda x: [i.strip() for i in x.split(',') if i.strip()]) for v in sublist])
+        top_values = values.value_counts().head(top_n).index
+        for v in top_values:
+            mask = self.merged_df[col].str.contains(rf'\b{v}\b', regex=True)
+            mean_val = self.merged_df.loc[mask, target].mean()
+            self.merged_df[f'{col}_{v}_mean_{target}'] = mask.astype(int) * mean_val
+    def coding(self):
+        self.add_target_encoding(col='genres')
+        self.add_target_encoding(col='production_companies')
+    def Tfidf(self):
+        tfidf_overview_vectorizer = TfidfVectorizer(max_features=2100, stop_words='english')
+        tfidf_overview_matrix = tfidf_overview_vectorizer.fit_transform(self.merged_df['overview'].fillna(''))
+        self.tfidf_overview_df = pd.DataFrame(tfidf_overview_matrix.toarray(), columns=[f'overview_tfidf_{col}' for col in tfidf_overview_vectorizer.get_feature_names_out()], index=self.merged_df.index)
+    def merging_Tfidf(self):
+        # Combine the original dataframe with the TF-IDF features
+        self.merged_df_with_tfidf = pd.concat([self.merged_df, self.tfidf_overview_df], axis=1)
+    def presvd(self):
+        columns_for_svd = self.merged_df_with_tfidf.select_dtypes(include=np.number).columns.tolist()
+        columns_for_svd = [col for col in columns_for_svd if col not in ['rating', 'movieId', 'userId', 'timestamp', 'release_year']] # Exclude non-feature columns and year
+        for col in columns_for_svd:
+            if self.merged_df_with_tfidf[col].isnull().any():
+                median_val = self.merged_df_with_tfidf[col].median()
+                self.merged_df_with_tfidf[col] = self.merged_df_with_tfidf[col].fillna(median_val)
+        if 'production_companies_Warner Bros._mean_vote_average' in self.merged_df_with_tfidf.columns:
+            self.merged_df_with_tfidf['production_companies_Warner Bros._mean_vote_average'] = self.merged_df_with_tfidf['production_companies_Warner Bros._mean_vote_average'].fillna(0)
+    def svd(self):
+        unique_movies_df = self.merged_df_with_tfidf.groupby('movieId').first().reset_index()
+        columns_for_svd_unique = unique_movies_df.select_dtypes(include=np.number).columns.tolist()
+        columns_for_svd_unique = [col for col in columns_for_svd_unique if col not in ['rating', 'movieId', 'userId', 'timestamp', 'release_year', 'vote_average', 'vote_count']]
+        # Fill NaNs with median for all SVD columns
+        for col in columns_for_svd_unique:
+            if unique_movies_df[col].isnull().any():
+                median_val = unique_movies_df[col].median()
+                unique_movies_df[col] = unique_movies_df[col].fillna(median_val)
+        # Extra: fill any remaining NaNs with 0 (safety for SVD)
+        unique_movies_df[columns_for_svd_unique] = unique_movies_df[columns_for_svd_unique].fillna(0)
+        if 'production_companies_Warner Bros._mean_vote_average' in unique_movies_df.columns:
+            unique_movies_df['production_companies_Warner Bros._mean_vote_average'] = unique_movies_df['production_companies_Warner Bros._mean_vote_average'].fillna(0)
+        n_components = 150
+        svd = TruncatedSVD(n_components=n_components, random_state=42)
+        svd_matrix_unique = svd.fit_transform(unique_movies_df[columns_for_svd_unique])
+        svd_df_unique = pd.DataFrame(svd_matrix_unique, columns=[f'svd_{i+1}' for i in range(n_components)], index=unique_movies_df.index)
+        columns_to_drop_after_svd_unique = [col for col in columns_for_svd_unique if col not in ['vote_average', 'vote_count']]
+        self.unique_movies_reduced = unique_movies_df.drop(columns=columns_to_drop_after_svd_unique).copy()
+        self.unique_movies_reduced = pd.concat([self.unique_movies_reduced, svd_df_unique], axis=1)
+    def run_all(self):
+        self.ordering()
+        self.outliers()
+        self.add_budget_to_revenue_ratio()
+        self.add_top_genre_onehot()
+        self.add_log_features()
+        self.add_interaction_features()
+        self.add_count_features()
+        self.add_text_length_features()
+        self.add_genre_mean_encoding()
+        self.add_release_date_features()
+        self.add_adult_flag()
+        self.add_multi_hot_keywords()
+        self.add_cast_crew_features()
+        self.add_company_country_features()
+        self.coding()
+        self.Tfidf()
+        self.merging_Tfidf()
+        self.presvd()
+        self.svd()
+        return {
+            "merged_df": self.merged_df,
+            "merged_df_with_tfidf": self.merged_df_with_tfidf,
+            "unique_movies_reduced": self.unique_movies_reduced
+        }