Spaces:

MatanKriel
/

social-assistent

Sleeping

App Files Files Community

Matan Kriel commited on Jan 19

Commit

879ede5

1 Parent(s): 450331a

updated data generation pipeline

Browse files

Files changed (14) hide show

README.md +65 -101
app.py +4 -3
model-prep.py +284 -72
model-search.py +0 -21
project_plots/{model_leaderboard.png → diversity_plot.png} +2 -2
project_plots/eda_distribution.png +2 -2
project_plots/feature_importance.png +2 -2
project_plots/{embedding_clusters.png → model_comparison.png} +2 -2
tiktok_test_data_embeddings.parquet → raw_social_media_data.parquet +2 -2
requirements.txt +3 -1
tfidf_vectorizer.pkl +2 -2
tiktok_knowledge_base.parquet +2 -2
viral_model.json +0 -0
viral_model.pkl +3 -0

README.md CHANGED Viewed

@@ -1,114 +1,78 @@
 ---
-title: Social Media Virality
-emoji: 📺
-colorFrom: green
-colorTo: yellow
 sdk: gradio
 sdk_version: 5.9.0
 app_file: app.py
 pinned: false
 ---
-# Social Media Virality Prediction & Optimization Project
-**Course**: Data Science & Machine Learning Applications
-**Project**: Viral Content Assistant
-## 1. Project Overview
-This project aims to develop a data-driven system capable of predicting the viral potential of short-form video content (e.g., TikTok) and optimizing it using Generative AI. By leveraging Natural Language Processing (NLP) and Machine Learning (ML), the system analyzes video descriptions and metadata to forecast view counts and prescribes actionable improvements to maximize engagement.
-The core solution consists of a machine learning pipeline for virality prediction and a web application (Gradio) for real-time user interaction.
-## 2. Data Science Methodology
-### 2.1 Data Acquisition & Generation
-Due to privacy restrictions and API limitations of social platforms, we simulated a realistic dataset reflecting 2025 social media trends.
-*   **Source**: Synthetic generation options using the `Faker` library and `numpy` probabilistic distributions.
-*   **Volume**: 10,000 samples.
-*   **Features**:
-    *   **Textual**: Video descriptions rich in slang (e.g., "Skibidi", "Girl Dinner"), hashtags, and emojis.
-    *   **Temporal**: Upload hour, day of week.
-    *   **Meta**: Video duration, category (Gaming, Beauty, etc.).
-*   **Target Variable**: `views` (Log-normally distributed to mimic real-world viral discrepancies).
-### 2.2 Exploratory Data Analysis (EDA)
-We analyzed the distribution of the target variable and feature correlations.
-*   **Observation**: View counts follow a "power law" distribution; most videos have few views, while a few "viral hits" capture the majority.
-*   **Preprocessing**: We applied a Log-transformation (`np.log1p`) to the `views` variable to normalize the distribution for regression models.
-![Views Distribution](project_plots/eda_distribution.png)
-*Figure 1: Distribution of Log-Transformed View Counts.*
-### 2.3 Feature Engineering
-*   **Text Embeddings**: We used **TF-IDF Vectorization** (Top 2,000 features) to convert unstructured text descriptions into numerical vectors.
-*   **Meta Features**: Encoded `is_weekend`, `hour_of_day`, and `video_duration_sec`.
-*   **Data Splitting**: A **Temporal Split** (80/20) was used instead of a random split to prevent data leakage, ensuring the model predicts future videos based on past trends.
-## 3. Model Development & Evaluation
-We evaluated three distinct algorithms to solve the regression problem (predicting log-views):
-1.  **Linear Regression**: Baseline model for interpretability.
-2.  **Random Forest Regressor**: Ensemble method to capture non-linear relationships.
-3.  **XGBoost Regressor**: Gradient boosting machine known for state-of-the-art tabular performance.
-### 3.1 Comparative Metrics
-Models were assessed using:
-*   **RMSE (Root Mean Squared Error)**: The primary metric for regression accuracy.
-*   **R² (Coefficient of Determination)**: Explains the variance captured by the model.
-*   **F1-Score**: Used to proxy classification performance (predicting if a video hits the "Viral Threshold" (top 20%)).
-![Model Leaderboard](project_plots/model_leaderboard.png)
-*Figure 2: Performance comparison across different architectures.*
-### 3.2 Result
-The **XGBoost Regressor** outperformed other models, achieving the lowest RMSE on the test set. This model was selected for the final deployment.
-## 4. Advanced Analysis: Embeddings & Semantic Search
-Beyond simple regression, we implemented a semantic search engine using **SentenceTransformers** (`all-MiniLM-L6-v2`).
-*   **Purpose**: To retrieve historical viral hits conceptually similar to the user's new idea.
-*   **Clustering**: We visualized the semantic space using PCA (Principal Component Analysis).
-![Embedding Clusters](project_plots/embedding_clusters.png)
-*Figure 3: Semantic clustering of video descriptions.*
-## 5. Application & Deployment
-The final deliverable is an interactive web application built with **Gradio**.
-### 5.1 System Architecture
-The system is decoupled into two main components:
-1.  **Training Pipeline (`model-prep.py`)**: Runs offline to generate synthetic data, train the XGBoost model, and create the vector database. It saves these artifacts (`viral_model.json`, `tfidf_vectorizer.pkl`, `tiktok_knowledge_base.parquet`).
-2.  **Inference App (`app.py`)**: A lightweight Gradio app that loads the pre-trained artifacts to serve real-time predictions without needing to retrain.
-**Data Flow**:
-1.  **Input**: User provided video description.
-2.  **Inference**: Loaded XGBoost model predicts view count.
-3.  **Retrieval**: App searches the pre-computed Parquet knowledge base for similar viral videos.
-4.  **Generative Optimization**: **Google Gemini 2.5 Flash Lite** rewrites the draft.
-5.  **Output**: Predictions, Similar Videos, and AI-Optimized content.
-### 5.2 Usage Instructions
-To run the project locally for assessment:
-1.  **Environment Setup**:
     ```bash
-    python3 -m venv .venv
-    source .venv/bin/activate
     pip install -r requirements.txt
     ```
-2.  **Configuration**:
-    Ensure the `.env` file contains a valid `GEMINI_API_KEY`.
-3.  **Execution**:
     ```bash
     python app.py
     ```
-    Access the UI at `http://localhost:7860`.
-## 6. Conclusion
-This project demonstrates a complete end-to-end Data Science workflow: from synthetic data creation and rigorous model evaluation to the deployment of a user-facing AI application. The integration of predictive analytics (XGBoost) with generative AI (Gemini) provides a robust tool for content creators.
-## 🏆 Credits
-* **Project Author:** Matan Kriel
-* **Project Author:** Odeya Shmuel

 ---
+title: Social Media Virality Assistant
+emoji: 🚀
+colorFrom: indigo
+colorTo: purple
 sdk: gradio
 sdk_version: 5.9.0
 app_file: app.py
 pinned: false
 ---
+# 🚀 Social Media Virality Assistant
+A Data Science project that uses **Large Language Models (LLMs)** and **Machine Learning** to predict and optimize social media content virality.
+## 🌟 Project Overview
+This tool helps content creators go viral by:
+1.  **Predicting Views**: Analyzing video descriptions to forecast performance.
+2.  **Optimizing Content**: Using **Google Gemini AI** to rewrite drafts with viral hooks (slang, hashtags).
+3.  **Learning from History**: Retrieving similar successful videos using **Semantic Search**.
+## 🧠 Data Science Methodology
+### 1. Synthetic Data Generation (LLM-Based)
+Since real-world TikTok data is private, we simulated a "Viral Environment":
+*   **Generator**: Utilized `tiiuae/falcon-rw-1b` (via `transformers`) to generate **10,000 realistic video descriptions**.
+*   **Diversity**: Prompted the LLM with various scenarios ("POV", "GRWM", "Storytime") to ensure distinct content clusters.
+*   **Ground Truth Logic**: Developed a scoring function that assigns "Views" based on linguistic patterns (e.g., questions, emotional triggers) and metadata (time of day, duration), creating a learnable signal for the ML models.
+### 2. Model Development & Comparison
+We treated this as a **Regression Problem** (Predicting Log-Views).
+We compared three algorithms to find the best predictor:
+*   **Linear Regression**: Baseline model.
+*   **Random Forest**: Good for non-linear interactions.
+*   **XGBoost (Winner)**: Gradient boosting provided the best accuracy (Lowest RMSE).
+**Metrics Used:**
+*   **RMSE (Root Mean Squared Error)**: Primary metric for model selection.
+*   **MAE (Mean Absolute Error)**: Average view count error.
+*   **MAPE**: Average percentage error.
+### 3. Advanced Analysis (Plots)
+*   **`diversity_plot.png`**: A PCA visualization showing the semantic spread of the 10,000 generated descriptions.
+*   **`model_comparison.png`**: Bar chart comparing RMSE across models and ROC curves for viral classification validity.
+*   **`feature_importance.png`**: The top 20 words and metadata features that drive virality in our simulated world.
+## 🛠️ Tech Stack
+*   **Core**: Python, Pandas, Numpy, Scikit-Learn
+*   **AI/LLM**: `transformers` (Falcon-1B), `google-genai` (Gemini 2.5)
+*   **ML**: XGBoost, Sentence-Transformers (Embeddings)
+*   **App**: Gradio (Web UI)
+*   **Hardware**: Optimized for Apple Silicon (MPS).
+## 📂 Project Structure
+```bash
+├── app.py                      # Inference App (Gradio)
+├── model-prep.py               # Training Pipeline (Data Gen -> Train -> Save)
+├── requirements.txt            # Dependencies
+├── tiktok_knowledge_base.parquet # Semantic Search Index
+├── viral_model.pkl             # Trained ML Model (Pickle)
+├── tfidf_vectorizer.pkl        # Text Processor
+└── project_plots/              # Generated Analysis Plots
+```
+## 🚀 How to Run
+1.  **Install Dependencies**:
     ```bash
     pip install -r requirements.txt
     ```
+2.  **Train & Generate Data** (Downloads 2.6GB Model):
+    ```bash
+    python model-prep.py
+    ```
+3.  **Run the App**:
     ```bash
+    export GEMINI_API_KEY="your_key_here"
     python app.py
     ```

app.py CHANGED Viewed

@@ -40,9 +40,10 @@ def initialize_app():
     knowledge_df = pd.read_parquet(parquet_path)
     # 2. Load Model
-    print("🧠 Loading XGBoost Model...")
-    model = XGBRegressor()
-    model.load_model("viral_model.json")
     # 3. Load Vectorizer
     print("🔤 Loading TF-IDF Vectorizer...")

     knowledge_df = pd.read_parquet(parquet_path)
     # 2. Load Model
+    print("🧠 Loading Prediction Model (Pickle)...")
+    with open("viral_model.pkl", "rb") as f:
+        model = pickle.load(f)
+    print(f"    -> Loaded model type: {type(model).__name__}")
     # 3. Load Vectorizer
     print("🔤 Loading TF-IDF Vectorizer...")

model-prep.py CHANGED Viewed

@@ -5,6 +5,7 @@ import seaborn as sns
 import warnings
 import os
 import torch
 import google.generativeai as genai
 from faker import Faker
 from datetime import datetime, timedelta
@@ -20,7 +21,7 @@ from sklearn.feature_extraction.text import TfidfVectorizer
 from sklearn.ensemble import RandomForestRegressor
 from xgboost import XGBRegressor
 from sklearn.linear_model import LinearRegression
-from sklearn.metrics import mean_squared_error, f1_score
 from sklearn.decomposition import PCA
 from sentence_transformers import SentenceTransformer
@@ -40,64 +41,165 @@ if not os.path.exists('project_plots'):
 # ---------------------------------------------------------
 # 1. DATA GENERATION (With 2025 Trends)
 # ---------------------------------------------------------
-def generate_enhanced_data(n_rows=10000):
-    print(f"\n[1/8] Generating {n_rows} rows of Real-World 2025 Data...")
-    fake = Faker()
-    trends = [
-        'Delulu', 'Girl Dinner', 'Roman Empire', 'Silent Slay', 'Soft Life',
-        'Grimace Shake', 'Wes Anderson Style', 'Beige Flag', 'Canon Event',
-        'NPC Stream', 'Skibidi', 'Fanum Tax', 'Yapping', 'Glow Up', 'Fit Check'
-    ]
-    formats = [
-        'POV: You realize...', 'GRWM for...', 'Day in the life:',
-        'Storytime:', 'Trying the viral...', 'ASMR packing orders',
-        'Rating my exes...', 'Turn the lights off challenge'
     ]
-    categories = ['Gaming', 'Beauty', 'Comedy', 'Edutainment', 'Lifestyle', 'Food']
     data = []
     start_date = datetime(2024, 1, 1)
-    for _ in range(n_rows):
-        upload_time = start_date + timedelta(days=np.random.randint(0, 365), hours=np.random.randint(0, 23))
-        trend = np.random.choice(trends)
-        fmt = np.random.choice(formats)
-        cat = np.random.choice(categories)
-        description = f"{fmt} {trend} edition! {fake.sentence(nb_words=6)}"
-        tags = ['#fyp', '#foryou', '#viral', f'#{trend.replace(" ", "").lower()}', f'#{cat.lower()}']
-        if np.random.random() > 0.5: tags.append('#trending2025')
-        full_text = f"{description} {' '.join(tags)}"
-        # Meta Features
-        duration = np.random.randint(5, 180)
-        hour = upload_time.hour
-        is_weekend = 1 if upload_time.weekday() >= 5 else 0
-        # View Count Logic
-        base_virality = np.random.lognormal(mean=9.5, sigma=1.8)
-        multiplier = 1.0
-        if is_weekend: multiplier *= 1.2
-        if duration < 15: multiplier *= 1.4
-        if "Delulu" in full_text or "POV" in full_text: multiplier *= 1.6
-        if hour >= 18: multiplier *= 1.1
-        views = int(base_virality * multiplier)
-        data.append({
-            'upload_date': upload_time,
-            'description': full_text,
-            'category': cat,
-            'video_duration_sec': duration,
-            'hour_of_day': hour,
-            'is_weekend': is_weekend,
-            'hashtag_count': len(tags),
-            'views': views
-        })
     df = pd.DataFrame(data)
     df = df.sort_values('upload_date').reset_index(drop=True)
     threshold = df['views'].quantile(0.80)
     df['is_viral_binary'] = (df['views'] > threshold).astype(int)
@@ -123,27 +225,130 @@ def process_data_pipeline(df):
     tfidf = TfidfVectorizer(max_features=2000, stop_words='english')
     X_text = tfidf.fit_transform(df['description']).toarray()
     num_cols = ['video_duration_sec', 'hour_of_day', 'is_weekend', 'hashtag_count']
     X_num = df[num_cols].values
     X = np.hstack((X_text, X_num))
     y = df['log_views'].values
-    y_bin = df['is_viral_binary'].values
     split_idx = int(len(df) * 0.80)
-    return X[:split_idx], X[split_idx:], y[:split_idx], y[split_idx:], y_bin[split_idx:], tfidf
 # ---------------------------------------------------------
-# 3. TRAINING
 # ---------------------------------------------------------
-def train_best_model(X_train, y_train, X_test, y_test):
-    print("\n[3/8] Training Model (XGBoost)...")
-    model = XGBRegressor(n_estimators=100, learning_rate=0.1, max_depth=6, n_jobs=-1)
-    model.fit(X_train, y_train)
-    rmse = np.sqrt(mean_squared_error(y_test, model.predict(X_test)))
-    print(f"    - Model RMSE: {rmse:.3f}")
-    return model
 # ---------------------------------------------------------
 # 4. EMBEDDINGS GENERATION (For Search)
@@ -267,20 +472,27 @@ def optimize_content_with_gemini(user_input, model, vectorizer, knowledge_df, st
 # MAIN EXECUTION
 # ---------------------------------------------------------
 if __name__ == "__main__":
-    # 1. Pipeline
-    df, _ = generate_enhanced_data(10000)
-    X_train, X_test, y_train, y_test, _, tfidf = process_data_pipeline(df)
-    # 2. Train Prediction Model
-    best_model = train_best_model(X_train, y_train, X_test, y_test)
     # 3. Create Knowledge Base (Embeddings)
     knowledge_df, st_model = create_search_index(df)
-    # 4. Save Artifacts for App
-    print("\n[5/8] Saving Model Artifacts for Production...")
-    best_model.save_model("viral_model.json")
-    print("    - Model saved to 'viral_model.json'")
     with open("tfidf_vectorizer.pkl", "wb") as f:
         pickle.dump(tfidf, f)

 import warnings
 import os
 import torch
+from transformers import pipeline
 import google.generativeai as genai
 from faker import Faker
 from datetime import datetime, timedelta
 from sklearn.ensemble import RandomForestRegressor
 from xgboost import XGBRegressor
 from sklearn.linear_model import LinearRegression
+from sklearn.metrics import mean_squared_error, mean_absolute_error, mean_absolute_percentage_error, r2_score
 from sklearn.decomposition import PCA
 from sentence_transformers import SentenceTransformer
 # ---------------------------------------------------------
 # 1. DATA GENERATION (With 2025 Trends)
 # ---------------------------------------------------------
+# ---------------------------------------------------------
+# 1. DATA GENERATION (With LLM - Falcon-RW-1B)
+# ---------------------------------------------------------
+def generate_synthetic_data_llm(n_rows=10000):
+    print(f"\n[1/8] Generating {n_rows} rows of Real-World Data using LLM (Falcon-RW-1B)...")
+    # Setup Pipeline
+    print("    -> Loading Falcon model... (This may take a moment)")
+    # MPS Optimization Logic
+    # 'device' variable is already set globally (cpu or mps)
+    # Pipelines usually take device=0 for GPU, or device="mps"
+    pipeline_kwargs = {
+        "task": "text-generation",
+        "model": "tiiuae/falcon-rw-1b",
+        "device": device # "mps" or "cpu"
+    }
+    # Optimizations for Apple Silicon
+    if device == "mps":
+        print("    -> 🍎 Optimization: Using Apple Silicon (MPS) with float16")
+        pipeline_kwargs["torch_dtype"] = torch.float16
+    elif device == "cuda":
+        pipeline_kwargs["device"] = 0 # Transformers often prefers int for CUDA
+        pipeline_kwargs["torch_dtype"] = torch.float16
+    try:
+        generator = pipeline(**pipeline_kwargs)
+    except Exception as e:
+        print(f"    -> Error loading model: {e}")
+        return pd.DataFrame(), 0
+    print(f"    -> ✅ Model Loaded on {device.upper()}")
+    # Diversity Prompts
+    prompts = [
+        "TikTok Description: POV you realize",
+        "TikTok Description: GRWM for",
+        "TikTok Description: Day in the life of",
+        "TikTok Description: Trying the viral",
+        "TikTok Description: Storytime about",
+        "TikTok Description: ASMR",
+        "TikTok Description: My skincare routine",
+        "TikTok Description: Cooking a healthy",
+        "TikTok Description: Coding a new",
+        "TikTok Description: Travel vlog to"
     ]
     data = []
+    fake = Faker()
     start_date = datetime(2024, 1, 1)
+    # We generate in batches to manage memory/speed better or just loop
+    # Given n_rows is large, a progress bar or simple print every N is good.
+    print(f"    -> Starting generation of {n_rows} items...")
+    # To speed up, we can ask for multiple sequences per prompt,
+    # but we need total n_rows.
+    rows_generated = 0
+    batch_size = 5 # Generate 5 variations per prompt call
+    while rows_generated < n_rows:
+        prompt = np.random.choice(prompts)
+        try:
+            outputs = generator(
+                prompt,
+                max_new_tokens=40,
+                num_return_sequences=batch_size,
+                do_sample=True,
+                temperature=0.9,
+                top_k=50,
+                top_p=0.95,
+                truncation=True,
+                pad_token_id=50256 # Falcon-RW default pad token usually
+            )
+            for o in outputs:
+                if rows_generated >= n_rows: break
+                raw_text = o['generated_text']
+                # Clean up: remove the prompt prefix if desired, or keep it.
+                # Usually we want the full description.
+                # Let's clean newlines.
+                clean_text = raw_text.replace("\n", " ").strip()
+                # Add some synthetic tags if missing (LLM might not add enough)
+                if "#" not in clean_text:
+                    clean_text += " #fyp #viral #trending"
+                # --- SOPHISTICATED VIEW COUNT LOGIC ---
+                # We inject "ground truth" rules so the model can learn real patterns.
+                # Base distribution
+                base_virality = np.random.lognormal(mean=9.5, sigma=1.8)
+                multiplier = 1.0
+                # 1. Linguistic Patterns (The "Text" Signal)
+                full_lower = clean_text.lower()
+                # Boost for "Hooks" (Questions, direct address)
+                if "?" in clean_text: multiplier *= 1.2
+                if "you" in full_lower or "pov" in full_lower: multiplier *= 1.4
+                # Boost for Emotional/Urgent words
+                viral_triggers = ['secret', 'hack', 'wait for it', 'won\'t believe', 'shocking', 'obsessed']
+                if any(w in full_lower for w in viral_triggers): multiplier *= 1.3
+                # Boost for Niche Keywords (Targeting specific audiences)
+                niche_keywords = ['coding', 'recipe', 'tutorial', 'routine', 'haul']
+                if any(w in full_lower for w in niche_keywords): multiplier *= 1.2
+                # 2. Metadata Signals
+                upload_time = start_date + timedelta(days=np.random.randint(0, 365), hours=np.random.randint(0, 23))
+                duration = np.random.randint(5, 180)
+                hour = upload_time.hour
+                is_weekend = 1 if upload_time.weekday() >= 5 else 0
+                if is_weekend: multiplier *= 1.25            # Weekends are slightly better
+                if duration < 15: multiplier *= 1.3          # Short content is king
+                if hour >= 17 and hour <= 21: multiplier *= 1.15 # Prime time boost
+                # Calculate Final Views
+                views = int(base_virality * multiplier)
+                data.append({
+                    'upload_date': upload_time,
+                    'description': clean_text,
+                    'category': 'General',
+                    'video_duration_sec': duration,
+                    'hour_of_day': hour,
+                    'is_weekend': is_weekend,
+                    'hashtag_count': clean_text.count('#'),
+                    'views': views
+                })
+                rows_generated += 1
+            # Print one example per batch for quality control
+            if len(outputs) > 0:
+                print(f"       👀 Sample: {data[-1]['description'][:100]}...")
+            if rows_generated % 100 == 0:
+                print(f"    -> Generated {rows_generated}/{n_rows} rows...")
+        except Exception as e:
+            print(f"    ⚠️ Generation Error: {e}")
+            break
     df = pd.DataFrame(data)
+    # --- SAVE RAW DATA ---
+    raw_save_path = 'raw_social_media_data.parquet'
+    df.to_parquet(raw_save_path)
+    print(f"    -> 💾 Raw Data Saved to {raw_save_path}")
+    # Process for training (Targets)
     df = df.sort_values('upload_date').reset_index(drop=True)
     threshold = df['views'].quantile(0.80)
     df['is_viral_binary'] = (df['views'] > threshold).astype(int)
     tfidf = TfidfVectorizer(max_features=2000, stop_words='english')
     X_text = tfidf.fit_transform(df['description']).toarray()
+    # --- NEW: Data Diversity Plot (PCA) ---
+    print("    -> 🎨 Generating Diversity Plot...")
+    from sklearn.decomposition import PCA
+    # 2D Projection of text features
+    pca = PCA(n_components=2)
+    X_pca = pca.fit_transform(X_text)
+    plt.figure(figsize=(10, 6))
+    plt.scatter(X_pca[:, 0], X_pca[:, 1], c=df['log_views'], cmap='viridis', alpha=0.5)
+    plt.colorbar(label='Log Views')
+    plt.title('Semantic Diversity of Generated Content (PCA)')
+    plt.xlabel('Principal Component 1')
+    plt.ylabel('Principal Component 2')
+    plt.savefig('project_plots/diversity_plot.png')
+    plt.close()
+    print("    -> Plot saved to 'project_plots/diversity_plot.png'")
+    # --------------------------------------
     num_cols = ['video_duration_sec', 'hour_of_day', 'is_weekend', 'hashtag_count']
     X_num = df[num_cols].values
     X = np.hstack((X_text, X_num))
     y = df['log_views'].values
     split_idx = int(len(df) * 0.80)
+    return X[:split_idx], X[split_idx:], y[:split_idx], y[split_idx:], tfidf
 # ---------------------------------------------------------
+# 3. MODEL COMPARISON & TRAINING
 # ---------------------------------------------------------
+def compare_and_train_best_model(X_train, y_train, X_test, y_test):
+    print("\n[3/8] Comparing 3 Models to find the best one...")
+    models = {
+        "Linear Regression": LinearRegression(),
+        "Random Forest": RandomForestRegressor(n_estimators=50, max_depth=10, n_jobs=-1),
+        "XGBoost": XGBRegressor(n_estimators=100, learning_rate=0.1, n_jobs=-1)
+    }
+    results = {}
+    best_name = None
+    best_score = float('inf') # RMSE so lower is better
+    best_model_obj = None
+    print(f"{'Model':<20} | {'RMSE':<10} | {'MAE':<10} | {'MAPE':<10} | {'R²':<10}")
+    print("-" * 70)
+    for name, model in models.items():
+        model.fit(X_train, y_train)
+        preds_log = model.predict(X_test)
+        # Invert log for real metrics
+        preds_real = np.expm1(preds_log)
+        y_test_real = np.expm1(y_test)
+        rmse = np.sqrt(mean_squared_error(y_test_real, preds_real))
+        mae = mean_absolute_error(y_test_real, preds_real)
+        mape = mean_absolute_percentage_error(y_test_real, preds_real)
+        r2 = r2_score(y_test, preds_log)
+        results[name] = {'RMSE': rmse, 'MAE': mae, 'MAPE': mape, 'R2': r2}
+        print(f"{name:<20} | {rmse:.0f}       | {mae:.0f}       | {mape:.2%}     | {r2:.3f}")
+        if rmse < best_score:
+            best_score = rmse
+            best_name = name
+            best_model_obj = model
+    print("-" * 70)
+    print(f"🏆 Winner: {best_name} (RMSE: {best_score:.0f})")
+    # --- PLOTTING ---
+    plt.figure(figsize=(8, 5))
+    # Comparison Bar Chart (RMSE)
+    names = list(results.keys())
+    rmse_scores = [results[n]['RMSE'] for n in names]
+    plt.bar(names, rmse_scores, color=['gray', 'gray', 'green'])
+    plt.title('Model Comparison (RMSE - Lower is Better)')
+    plt.ylabel('RMSE (Views)')
+    plt.tight_layout()
+    plt.savefig('project_plots/model_comparison.png')
+    plt.close()
+    print("    -> Comparison plot saved to 'project_plots/model_comparison.png'")
+    return best_model_obj
+def plot_feature_importance(model, vectorizer, output_path='project_plots/feature_importance.png'):
+    print("    -> 📊 Generating Feature Importance Plot...")
+    # 1. Get Feature Names
+    # TF-IDF features
+    tfidf_names = vectorizer.get_feature_names_out()
+    # Numeric features (Hardcoded based on process_data_pipeline)
+    meta_names = ['video_duration_sec', 'hour_of_day', 'is_weekend', 'hashtag_count']
+    all_features = np.concatenate([tfidf_names, meta_names])
+    # 2. Get Importances
+    if hasattr(model, 'feature_importances_'):
+        # XGBoost / Random Forest
+        importances = model.feature_importances_
+        title = f"Top 20 Features ({type(model).__name__})"
+    elif hasattr(model, 'coef_'):
+        # Linear Regression
+        importances = np.abs(model.coef_) # Magnitude matters
+        title = f"Top 20 Feature Coefficients ({type(model).__name__})"
+    else:
+        print("    ⚠️ Model type does not support feature importance extraction.")
+        return
+    # 3. Sort and Plot Top 20
+    indices = np.argsort(importances)[-20:]
+    plt.figure(figsize=(10, 8))
+    plt.title(title)
+    plt.barh(range(len(indices)), importances[indices], align='center', color='teal')
+    plt.yticks(range(len(indices)), [all_features[i] for i in indices])
+    plt.xlabel('Relative Importance')
+    plt.tight_layout()
+    plt.savefig(output_path)
+    plt.close()
+    print(f"    -> Feature Importance saved to '{output_path}'")
 # ---------------------------------------------------------
 # 4. EMBEDDINGS GENERATION (For Search)
 # MAIN EXECUTION
 # ---------------------------------------------------------
 if __name__ == "__main__":
+    # 1. Pipeline (LLM)
+    print("🚀 Starting Production Run: Generatng 10,000 rows...")
+    df, _ = generate_synthetic_data_llm(10000)
+    X_train, X_test, y_train, y_test, tfidf = process_data_pipeline(df)
+    # 2. Train Prediction Model (COMPARISON Step)
+    best_model = compare_and_train_best_model(X_train, y_train, X_test, y_test)
     # 3. Create Knowledge Base (Embeddings)
     knowledge_df, st_model = create_search_index(df)
+    # 4. Save Artifacts for App & Plot Importance
+    print("\n[5/8] Saving Model Artifacts & Finalizing Plots...")
+    # Plot Feature Importance (Now that we have the winner)
+    plot_feature_importance(best_model, tfidf)
+    # Use Pickle for Model (Generic)
+    with open("viral_model.pkl", "wb") as f:
+        pickle.dump(best_model, f)
+    print("    - Model saved to 'viral_model.pkl'")
     with open("tfidf_vectorizer.pkl", "wb") as f:
         pickle.dump(tfidf, f)

model-search.py DELETED Viewed

@@ -1,21 +0,0 @@
-import google.generativeai as genai
-import os
-from dotenv import load_dotenv
-# 1. Load your API key
-load_dotenv()
-api_key = os.getenv("GEMINI_API_KEY")
-if not api_key:
-    print("Error: API key not found. Make sure it is in your .env file.")
-else:
-    genai.configure(api_key=api_key)
-    print("--- Available Gemini Models ---")
-    # 2. List all models and filter for those that generate content (text/chat)
-    for m in genai.list_models():
-        if 'generateContent' in m.supported_generation_methods:
-            print(f"Name: {m.name}")
-            print(f"   - Display Name: {m.display_name}")
-            print(f"   - Input Limit: {m.input_token_limit} tokens")
-            print("-" * 30)

project_plots/{model_leaderboard.png → diversity_plot.png} RENAMED Viewed

File without changes

project_plots/eda_distribution.png CHANGED Viewed

Git LFS Details

SHA256: 18de2fc7c92289b75a7c77d7e8ff57ae5d462269f8cdd99d584cad6377367f14
Pointer size: 130 Bytes
Size of remote file: 13.7 kB

Git LFS Details

SHA256: 96cf3d4ff0977409b642050164d9ded8a43e2565437d19bd3fb26234c434ba6f
Pointer size: 130 Bytes
Size of remote file: 13.9 kB

project_plots/feature_importance.png CHANGED Viewed

Git LFS Details

SHA256: 215f9eabe41998d7222c3b5104993a60bff7be958cd224edf9d1a42108070db6
Pointer size: 130 Bytes
Size of remote file: 81.9 kB

Git LFS Details

SHA256: a0ac14b476322d1a9d53149a728ff0b3a6002763d1157478379fa68ea701ae04
Pointer size: 130 Bytes
Size of remote file: 32 kB

project_plots/{embedding_clusters.png → model_comparison.png} RENAMED Viewed

File without changes

tiktok_test_data_embeddings.parquet → raw_social_media_data.parquet RENAMED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:9bd222284d380a2ddb7f29f27e1ff77eff7b379eed34674c959804f70ef80ede
-size 3862601

 version https://git-lfs.github.com/spec/v1
+oid sha256:92939908f14b69157b0a99ee186ef1f0ff70d54974bfcf14235468674f73d450
+size 1185030

requirements.txt CHANGED Viewed

@@ -1,4 +1,3 @@
-gradio>=5.0
 pandas
 numpy
 xgboost
@@ -7,3 +6,6 @@ sentence-transformers
 google-generativeai
 python-dotenv
 faker

 pandas
 numpy
 xgboost
 google-generativeai
 python-dotenv
 faker
+transformers
+torch
+accelerate

tfidf_vectorizer.pkl CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:96a8c9e0af89e51756ab34cd1219571dea1cd2ca9f2558895fcd00250a3a6c8b
-size 29989

 version https://git-lfs.github.com/spec/v1
+oid sha256:ebe788182b51f7023d3b94676a566553723f10b0a6795e191f827bda12136339
+size 73096

tiktok_knowledge_base.parquet CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:12a2bf4192ada1bb32d08a87f9e811c28a933c40ae0d4831143f1f1fdccd6579
-size 16651296

 version https://git-lfs.github.com/spec/v1
+oid sha256:82dacd5da6cc1e8f9a62db8e8b6d68f5d5e466300d94dc7707c7afd342a97594
+size 17274184

viral_model.json DELETED Viewed

The diff for this file is too large to render. See raw diff

viral_model.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:92f2ca0ca3bf30dd6a5d7e84d8ebff5612134ff895124e03cb51586a000d9527
+size 214620