Spaces:

MatanKriel
/

social-assistent

Sleeping

App Files Files Community

Matan Kriel commited on Jan 21

Commit

0c8eab5

1 Parent(s): a840261

updated files full data run

Browse files

Files changed (11) hide show

README.md +57 -53
model-prep.py +216 -440
project_plots/diversity_plot.png +0 -3
tfidf_vectorizer.pkl → project_plots/embedding_benchmark.png +2 -2
project_plots/feature_importance.png +0 -3
project_plots/model_comparison.png +0 -3
project_plots/{eda_distribution.png → regression_comparison.png} +2 -2
raw_social_media_data.parquet +0 -3
tiktok_knowledge_base.parquet +0 -3
upload_model_to_hf.py +40 -0
viral_model.pkl +2 -2

README.md CHANGED Viewed

@@ -11,77 +11,81 @@ pinned: false
 # 🚀 Social Media Virality Assistant
-A Data Science project that uses **Large Language Models (LLMs)** and **Machine Learning** to predict and optimize social media content virality.
-## 🌟 Project Overview
-This tool helps content creators go viral by:
-1.  **Predicting Views**: Analyzing video descriptions to forecast performance.
-2.  **Optimizing Content**: Using **Google Gemini AI** to rewrite drafts with viral hooks (slang, hashtags).
-3.  **Learning from History**: Retrieving similar successful videos using **Semantic Search**.
-## 🧠 Data Science Methodology
-### 1. Synthetic Data Generation (LLM-Based)
-Since real-world TikTok data is private, we simulated a "Viral Environment":
-*   **Generator**: Utilized `tiiuae/falcon-rw-1b` (via `transformers`) to generate **10,000 realistic video descriptions**.
-*   **Diversity**: Prompted the LLM with various scenarios ("POV", "GRWM", "Storytime") to ensure distinct content clusters.
-*   **Ground Truth Logic**: Developed a scoring function that assigns "Views" based on linguistic patterns (e.g., questions, emotional triggers) and metadata (time of day, duration), creating a learnable signal for the ML models.
-### 2. Model Development & Comparison
-We treated this as a **Regression Problem** (Predicting Log-Views).
-We compared three algorithms to find the best predictor:
-*   **Linear Regression**: Baseline model.
-*   **Random Forest**: Good for non-linear interactions.
-*   **XGBoost (Winner)**: Gradient boosting provided the best accuracy (Lowest RMSE).
-**Metrics Used:**
-*   **RMSE (Root Mean Squared Error)**: Primary metric for model selection.
-*   **MAE (Mean Absolute Error)**: Average view count error.
-*   **MAPE**: Average percentage error.
-### 3. Advanced Analysis (Plots)
-#### Semantic Diversity (PCA)
-![Diversity Plot](project_plots/diversity_plot.png)
-*A PCA visualization showing the semantic spread of the 10,000 generated descriptions.*
-#### Model Performance
-![Model Comparison](project_plots/model_comparison.png)
-*Bar chart comparing RMSE across models and ROC curves for viral classification validity.*
-#### Feature Importance
-![Feature Importance](project_plots/feature_importance.png)
-*The top 20 words and metadata features that drive virality in our simulated world.*
 ## 🛠️ Tech Stack
-*   **Core**: Python, Pandas, Numpy, Scikit-Learn
-*   **AI/LLM**: `transformers` (Falcon-1B), `google-genai` (Gemini 2.5)
-*   **ML**: XGBoost, Sentence-Transformers (Embeddings)
-*   **App**: Gradio (Web UI)
-*   **Hardware**: Optimized for Apple Silicon (MPS).
-## 📂 Project Structure
-```bash
-├── app.py                      # Inference App (Gradio)
-├── model-prep.py               # Training Pipeline (Data Gen -> Train -> Save)
-├── requirements.txt            # Dependencies
-├── tiktok_knowledge_base.parquet # Semantic Search Index
-├── viral_model.pkl             # Trained ML Model (Pickle)
-├── tfidf_vectorizer.pkl        # Text Processor
-└── project_plots/              # Generated Analysis Plots
-```
 ## 🚀 How to Run
 1.  **Install Dependencies**:
     ```bash
     pip install -r requirements.txt
     ```
-2.  **Train & Generate Data** (Downloads 2.6GB Model):
     ```bash
-    python model-prep.py
     ```
-3.  **Run the App**:
     ```bash
-    export GEMINI_API_KEY="your_key_here"
-    python app.py
     ```

 # 🚀 Social Media Virality Assistant
+A machine learning-powered tool that helps content creators predict and optimize their video virality potential using **XGBoost** and **Google Gemini AI**.
+## 🏗️ Architecture & Pipeline
+This project consists of two main components: a training pipeline (`model-prep.py`) and an inference application (`app.py`).
+### 1. Training Pipeline (`model-prep.py`)
+the `model-prep.py` script handles the end-to-end model creation process:
+1.  **Cloud Data Loading**: It fetches the latest synthetic dataset directly from **Hugging Face** (`MatanKriel/social-assitent-synthetic-data`).
+2.  **Embedding Benchmark**: It evaluates 3 state-of-the-art models (`MiniLM`, `mpnet-base`, `bge-small`) to find the best text encoder.
+    *   *Metrics*: Encoding Speed vs. Clustering Quality (Silhouette Score).
+    *   *Winner*: Defaults to `sentence-transformers/all-mpnet-base-v2`.
+3.  **Feature Engineering**:
+    *   Encodes categorical inputs: `category`, `gender`, `day_of_week`, `age`.
+    *   Combines text embeddings with metadata (`followers`, `duration`, `hour`).
+4.  **Model Training**: Trains and compares three regression algorithms:
+    *   Linear Regression
+    *   Random Forest
+    *   **XGBoost (Winner)**: Selected for having the lowest RMSE.
+5.  **Artifact Generation**: Saves the trained model locally (`viral_model.pkl`) and generates performance plots (`project_plots/`).
+### 2. Inference Application (`app.py`)
+The `app.py` script runs a **Gradio** web interface that pulls artifacts from the cloud at startup:
+1.  **Initialization**:
+    *   Downloads the trained `viral_model.pkl` from Hugging Face (`MatanKriel/social-assitent-viral-predictor`).
+    *   Downloads the dataset to build a Knowledge Base.
+    *   Generates embeddings on-the-fly for the Knowledge Base.
+2.  **Core Features**:
+    *   **Virality Prediction**: Predicts raw view counts based on your draft description and stats.
+    *   **AI Optimization**: Uses **Google Gemini** to rewrite your description with viral hooks and hashtags.
+    *   **Semantic Search**: Finds similar successful videos from the knowledge base using Cosine Similarity.
+---
+## 📊 Model Performance
+The training script (`model-prep.py`) automatically generates these benchmarks:
+### Embedding Model Comparison
+We selected the embedding model that best balances speed and semantic understanding.
+![Embedding Benchmark](project_plots/embedding_benchmark.png)
+### Regression Model Comparison
+We chose the regressor with the lowest error (RMSE) and highest explained variance (R²).
+![Model Comparison](project_plots/regression_comparison.png)
+---
 ## 🛠️ Tech Stack
+This project is built using:
+*   **App**: `gradio`, `google-generativeai`
+*   **ML**: `xgboost`, `scikit-learn`, `sentence-transformers`
+*   **Data**: `pandas`, `numpy`
+*   **Cloud**: `huggingface_hub`, `datasets`
+---
 ## 🚀 How to Run
 1.  **Install Dependencies**:
     ```bash
     pip install -r requirements.txt
     ```
+2.  **Run the App**:
+    The app will automatically download the necessary data and models from Hugging Face.
     ```bash
+    export GEMINI_API_KEY="your_api_key_here"
+    python app.py
     ```
+3.  **(Optional) Retrain the Model**:
+    If you want to re-run the benchmarks and training using the latest data:
     ```bash
+    python model-prep.py
     ```

model-prep.py CHANGED Viewed

@@ -5,510 +5,286 @@ import seaborn as sns
 import warnings
 import os
 import torch
-from transformers import pipeline
-import google.generativeai as genai
-from faker import Faker
-from datetime import datetime, timedelta
-from sklearn.metrics.pairwise import cosine_similarity
 import pickle
-from dotenv import load_dotenv
-# Load environment variables from the .env filea monk
-load_dotenv()
-# Machine Learning Imports
-from sklearn.feature_extraction.text import TfidfVectorizer
 from sklearn.ensemble import RandomForestRegressor
 from xgboost import XGBRegressor
 from sklearn.linear_model import LinearRegression
-from sklearn.metrics import mean_squared_error, mean_absolute_error, mean_absolute_percentage_error, r2_score
-from sklearn.decomposition import PCA
-from sentence_transformers import SentenceTransformer
-# ---------------------------------------------------------
-# 0. SETUP & CONFIGURATION
-# ---------------------------------------------------------
 warnings.filterwarnings('ignore')
 pd.set_option('display.max_columns', None)
-# OPTIMIZATION: Check for Apple Silicon (MPS)
 device = "mps" if torch.backends.mps.is_available() else "cpu"
 print(f"🚀 Optimization: Running on {device.upper()} device")
 if not os.path.exists('project_plots'):
     os.makedirs('project_plots')
 # ---------------------------------------------------------
-# 1. DATA GENERATION (With 2025 Trends)
-# ---------------------------------------------------------
-# ---------------------------------------------------------
-# 1. DATA GENERATION (With LLM - Falcon-RW-1B)
 # ---------------------------------------------------------
-def generate_synthetic_data_llm(n_rows=10000):
-    print(f"\n[1/8] Generating {n_rows} rows of Real-World Data using LLM (Falcon-RW-1B)...")
-    # Setup Pipeline
-    print("    -> Loading Falcon model... (This may take a moment)")
-    # MPS Optimization Logic
-    # 'device' variable is already set globally (cpu or mps)
-    # Pipelines usually take device=0 for GPU, or device="mps"
-    pipeline_kwargs = {
-        "task": "text-generation",
-        "model": "tiiuae/falcon-rw-1b",
-        "device": device # "mps" or "cpu"
-    }
-    # Optimizations for Apple Silicon
-    if device == "mps":
-        print("    -> 🍎 Optimization: Using Apple Silicon (MPS) with float16")
-        pipeline_kwargs["torch_dtype"] = torch.float16
-    elif device == "cuda":
-        pipeline_kwargs["device"] = 0 # Transformers often prefers int for CUDA
-        pipeline_kwargs["torch_dtype"] = torch.float16
     try:
-        generator = pipeline(**pipeline_kwargs)
     except Exception as e:
-        print(f"    -> Error loading model: {e}")
-        return pd.DataFrame(), 0
-    print(f"    -> ✅ Model Loaded on {device.upper()}")
-    # Diversity Prompts
-    prompts = [
-        "TikTok Description: POV you realize",
-        "TikTok Description: GRWM for",
-        "TikTok Description: Day in the life of",
-        "TikTok Description: Trying the viral",
-        "TikTok Description: Storytime about",
-        "TikTok Description: ASMR",
-        "TikTok Description: My skincare routine",
-        "TikTok Description: Cooking a healthy",
-        "TikTok Description: Coding a new",
-        "TikTok Description: Travel vlog to"
     ]
-    data = []
-    fake = Faker()
-    start_date = datetime(2024, 1, 1)
-    # We generate in batches to manage memory/speed better or just loop
-    # Given n_rows is large, a progress bar or simple print every N is good.
-    print(f"    -> Starting generation of {n_rows} items...")
-    # To speed up, we can ask for multiple sequences per prompt,
-    # but we need total n_rows.
-    rows_generated = 0
-    batch_size = 5 # Generate 5 variations per prompt call
-    while rows_generated < n_rows:
-        prompt = np.random.choice(prompts)
         try:
-            outputs = generator(
-                prompt,
-                max_new_tokens=40,
-                num_return_sequences=batch_size,
-                do_sample=True,
-                temperature=0.9,
-                top_k=50,
-                top_p=0.95,
-                truncation=True,
-                pad_token_id=50256 # Falcon-RW default pad token usually
-            )
-            for o in outputs:
-                if rows_generated >= n_rows: break
-                raw_text = o['generated_text']
-                # Clean up: remove the prompt prefix if desired, or keep it.
-                # Usually we want the full description.
-                # Let's clean newlines.
-                clean_text = raw_text.replace("\n", " ").strip()
-                # Add some synthetic tags if missing (LLM might not add enough)
-                if "#" not in clean_text:
-                    clean_text += " #fyp #viral #trending"
-                # --- SOPHISTICATED VIEW COUNT LOGIC ---
-                # We inject "ground truth" rules so the model can learn real patterns.
-                # Base distribution
-                base_virality = np.random.lognormal(mean=9.5, sigma=1.8)
-                multiplier = 1.0
-                # 1. Linguistic Patterns (The "Text" Signal)
-                full_lower = clean_text.lower()
-                # Boost for "Hooks" (Questions, direct address)
-                if "?" in clean_text: multiplier *= 1.2
-                if "you" in full_lower or "pov" in full_lower: multiplier *= 1.4
-                # Boost for Emotional/Urgent words
-                viral_triggers = ['secret', 'hack', 'wait for it', 'won\'t believe', 'shocking', 'obsessed']
-                if any(w in full_lower for w in viral_triggers): multiplier *= 1.3
-                # Boost for Niche Keywords (Targeting specific audiences)
-                niche_keywords = ['coding', 'recipe', 'tutorial', 'routine', 'haul']
-                if any(w in full_lower for w in niche_keywords): multiplier *= 1.2
-                # 2. Metadata Signals
-                upload_time = start_date + timedelta(days=np.random.randint(0, 365), hours=np.random.randint(0, 23))
-                duration = np.random.randint(5, 180)
-                hour = upload_time.hour
-                is_weekend = 1 if upload_time.weekday() >= 5 else 0
-                if is_weekend: multiplier *= 1.25            # Weekends are slightly better
-                if duration < 15: multiplier *= 1.3          # Short content is king
-                if hour >= 17 and hour <= 21: multiplier *= 1.15 # Prime time boost
-                # Calculate Final Views
-                views = int(base_virality * multiplier)
-                data.append({
-                    'upload_date': upload_time,
-                    'description': clean_text,
-                    'category': 'General',
-                    'video_duration_sec': duration,
-                    'hour_of_day': hour,
-                    'is_weekend': is_weekend,
-                    'hashtag_count': clean_text.count('#'),
-                    'views': views
-                })
-                rows_generated += 1
-            # Print one example per batch for quality control
-            if len(outputs) > 0:
-                print(f"       👀 Sample: {data[-1]['description'][:100]}...")
-            if rows_generated % 100 == 0:
-                print(f"    -> Generated {rows_generated}/{n_rows} rows...")
         except Exception as e:
-            print(f"    ⚠️ Generation Error: {e}")
-            break
-    df = pd.DataFrame(data)
-    # --- SAVE RAW DATA ---
-    raw_save_path = 'raw_social_media_data.parquet'
-    df.to_parquet(raw_save_path)
-    print(f"    -> 💾 Raw Data Saved to {raw_save_path}")
-    # Process for training (Targets)
-    df = df.sort_values('upload_date').reset_index(drop=True)
-    threshold = df['views'].quantile(0.80)
-    df['is_viral_binary'] = (df['views'] > threshold).astype(int)
-    df['log_views'] = np.log1p(df['views'])
-    return df, threshold
 # ---------------------------------------------------------
-# 2. EDA & PREPROCESSING
 # ---------------------------------------------------------
-def process_data_pipeline(df):
-    print("\n[2/8] Processing Data Pipeline...")
-    # Simple EDA Save
-    clean_df = df[df['video_duration_sec'] > 0].copy()
-    plt.figure(figsize=(6,4))
-    sns.histplot(clean_df['log_views'], color='teal')
-    plt.title('Log Views Distribution')
-    plt.savefig('project_plots/eda_distribution.png')
-    plt.close()
-    # TF-IDF & Split
-    tfidf = TfidfVectorizer(max_features=2000, stop_words='english')
-    X_text = tfidf.fit_transform(df['description']).toarray()
-    # --- NEW: Data Diversity Plot (PCA) ---
-    print("    -> 🎨 Generating Diversity Plot...")
-    from sklearn.decomposition import PCA
-    # 2D Projection of text features
-    pca = PCA(n_components=2)
-    X_pca = pca.fit_transform(X_text)
-    plt.figure(figsize=(10, 6))
-    plt.scatter(X_pca[:, 0], X_pca[:, 1], c=df['log_views'], cmap='viridis', alpha=0.5)
-    plt.colorbar(label='Log Views')
-    plt.title('Semantic Diversity of Generated Content (PCA)')
-    plt.xlabel('Principal Component 1')
-    plt.ylabel('Principal Component 2')
-    plt.savefig('project_plots/diversity_plot.png')
-    plt.close()
-    print("    -> Plot saved to 'project_plots/diversity_plot.png'")
-    # --------------------------------------
-    num_cols = ['video_duration_sec', 'hour_of_day', 'is_weekend', 'hashtag_count']
-    X_num = df[num_cols].values
-    X = np.hstack((X_text, X_num))
-    y = df['log_views'].values
-    split_idx = int(len(df) * 0.80)
-    return X[:split_idx], X[split_idx:], y[:split_idx], y[split_idx:], tfidf
-# ---------------------------------------------------------
-# 3. MODEL COMPARISON & TRAINING
-# ---------------------------------------------------------
-def compare_and_train_best_model(X_train, y_train, X_test, y_test):
-    print("\n[3/8] Comparing 3 Models to find the best one...")
     models = {
-        "Linear Regression": LinearRegression(),
-        "Random Forest": RandomForestRegressor(n_estimators=50, max_depth=10, n_jobs=-1),
-        "XGBoost": XGBRegressor(n_estimators=100, learning_rate=0.1, n_jobs=-1)
     }
-    results = {}
-    best_name = None
-    best_score = float('inf') # RMSE so lower is better
-    best_model_obj = None
-    print(f"{'Model':<20} | {'RMSE':<10} | {'MAE':<10} | {'MAPE':<10} | {'R²':<10}")
-    print("-" * 70)
     for name, model in models.items():
         model.fit(X_train, y_train)
         preds_log = model.predict(X_test)
-        # Invert log for real metrics
         preds_real = np.expm1(preds_log)
-        y_test_real = np.expm1(y_test)
-        rmse = np.sqrt(mean_squared_error(y_test_real, preds_real))
-        mae = mean_absolute_error(y_test_real, preds_real)
-        mape = mean_absolute_percentage_error(y_test_real, preds_real)
-        r2 = r2_score(y_test, preds_log)
-        results[name] = {'RMSE': rmse, 'MAE': mae, 'MAPE': mape, 'R2': r2}
-        print(f"{name:<20} | {rmse:.0f}       | {mae:.0f}       | {mape:.2%}     | {r2:.3f}")
-        if rmse < best_score:
-            best_score = rmse
-            best_name = name
-            best_model_obj = model
-    print("-" * 70)
-    print(f"🏆 Winner: {best_name} (RMSE: {best_score:.0f})")
     # --- PLOTTING ---
-    plt.figure(figsize=(8, 5))
-    # Comparison Bar Chart (RMSE)
-    names = list(results.keys())
-    rmse_scores = [results[n]['RMSE'] for n in names]
-    plt.bar(names, rmse_scores, color=['gray', 'gray', 'green'])
-    plt.title('Model Comparison (RMSE - Lower is Better)')
-    plt.ylabel('RMSE (Views)')
-    plt.tight_layout()
-    plt.savefig('project_plots/model_comparison.png')
-    plt.close()
-    print("    -> Comparison plot saved to 'project_plots/model_comparison.png'")
-    return best_model_obj
-def plot_feature_importance(model, vectorizer, output_path='project_plots/feature_importance.png'):
-    print("    -> 📊 Generating Feature Importance Plot...")
-    # 1. Get Feature Names
-    # TF-IDF features
-    tfidf_names = vectorizer.get_feature_names_out()
-    # Numeric features (Hardcoded based on process_data_pipeline)
-    meta_names = ['video_duration_sec', 'hour_of_day', 'is_weekend', 'hashtag_count']
-    all_features = np.concatenate([tfidf_names, meta_names])
-    # 2. Get Importances
-    if hasattr(model, 'feature_importances_'):
-        # XGBoost / Random Forest
-        importances = model.feature_importances_
-        title = f"Top 20 Features ({type(model).__name__})"
-    elif hasattr(model, 'coef_'):
-        # Linear Regression
-        importances = np.abs(model.coef_) # Magnitude matters
-        title = f"Top 20 Feature Coefficients ({type(model).__name__})"
-    else:
-        print("    ⚠️ Model type does not support feature importance extraction.")
-        return
-    # 3. Sort and Plot Top 20
-    indices = np.argsort(importances)[-20:]
-    plt.figure(figsize=(10, 8))
-    plt.title(title)
-    plt.barh(range(len(indices)), importances[indices], align='center', color='teal')
-    plt.yticks(range(len(indices)), [all_features[i] for i in indices])
-    plt.xlabel('Relative Importance')
-    plt.tight_layout()
-    plt.savefig(output_path)
-    plt.close()
-    print(f"    -> Feature Importance saved to '{output_path}'")
-# ---------------------------------------------------------
-# 4. EMBEDDINGS GENERATION (For Search)
-# ---------------------------------------------------------
-def create_search_index(df):
-    print("\n[4/8] Creating Vector Search Index...")
-    # Generate embeddings for ALL data so we can search the whole history
-    st_model = SentenceTransformer('all-MiniLM-L6-v2', device=device)
-    embeddings = st_model.encode(df['description'].tolist(), convert_to_numpy=True, show_progress_bar=True)
-    df['embedding'] = list(embeddings)
-    # Save to Parquet (The Knowledge Base)
-    save_path = 'tiktok_knowledge_base.parquet'
-    df.to_parquet(save_path)
-    print(f"    - Knowledge Base saved to {save_path}")
-    return df, st_model
-# ---------------------------------------------------------
-# 5. RETRIEVAL & IMPROVEMENT ENGINE (The Magic Step)
-# ---------------------------------------------------------
-def optimize_content_with_gemini(user_input, model, vectorizer, knowledge_df, st_model):
-    """
-    1. Scores original idea.
-    2. Finds top 3 similar VIRAL videos.
-    3. Asks Gemini to rewrite the idea.
-    4. Re-scores the new idea.
-    """
-    print("\n" + "="*50)
-    print("🚀 VIRAL OPTIMIZATION ENGINE")
-    print("="*50)
-    # --- STEP 1: INITIAL SCORE ---
-    text_vec = vectorizer.transform([user_input]).toarray()
-    # Assume default meta for prediction (15s, 6 PM, weekday)
-    meta_vec = np.array([[15, 18, 0, user_input.count('#')]])
-    feat_vec = np.hstack((text_vec, meta_vec))
-    initial_log = model.predict(feat_vec)[0]
-    initial_views = int(np.expm1(initial_log))
-    print(f"\n📝 ORIGINAL IDEA: {user_input}")
-    print(f"📊 Predicted Views: {initial_views:,}")
-    # --- STEP 2: VECTOR SEARCH (Find similar successful videos) ---
-    print("\n🔍 Searching for similar viral hits in Parquet file...")
-    # Filter only for successful videos (e.g., top 25% of views)
-    high_performance_df = knowledge_df[knowledge_df['views'] > knowledge_df['views'].quantile(0.75)].copy()
-    # Encode user input
-    user_embedding = st_model.encode([user_input], convert_to_numpy=True)
-    # Stack embeddings from the dataframe into a matrix
-    target_embeddings = np.stack(high_performance_df['embedding'].values)
-    # Calculate Cosine Similarity
-    similarities = cosine_similarity(user_embedding, target_embeddings)
-    # Get Top 3 indices
-    top_3_indices = similarities[0].argsort()[-3:][::-1]
-    top_3_videos = high_performance_df.iloc[top_3_indices]['description'].tolist()
-    print("    -> Found 3 similar viral videos to learn from:")
-    for i, vid in enumerate(top_3_videos, 1):
-        print(f"       {i}. {vid[:80]}...")
-    # --- STEP 3: GEMINI OPTIMIZATION ---
-    api_key = os.getenv("GEMINI_API_KEY")
-    if not api_key:
-        print("\n⚠️  SKIPPING AI REWRITE: No 'GEMINI_API_KEY' found in environment variables.")
-        print("    (Set it via 'export GEMINI_API_KEY=your_key' in terminal)")
-        return
-    print("\n🤖 Sending context to Gemini LLM for optimization...")
-    genai.configure(api_key=api_key)
-    llm = genai.GenerativeModel('gemini-2.5-flash-lite')
-    prompt = f"""
-    You are a TikTok Virality Expert.
-    My Draft Description: "{user_input}"
-    Here are 3 successful, viral videos that are similar to my topic:
-    1. {top_3_videos[0]}
-    2. {top_3_videos[1]}
-    3. {top_3_videos[2]}
-    Task: Rewrite my draft description to make it go viral.
-    Use the slang, hashtag style, and structure of the successful examples provided.
-    Keep it under 20 words plus hashtags. Return ONLY the new description.
-    """
-    try:
-        response = llm.generate_content(prompt)
-        improved_idea = response.text.strip()
-        print(f"\n✨ IMPROVED IDEA (By Gemini): {improved_idea}")
-        # --- STEP 4: RE-EVALUATION ---
-        new_text_vec = vectorizer.transform([improved_idea]).toarray()
-        # Update hashtag count for new features
-        new_meta_vec = np.array([[15, 18, 0, improved_idea.count('#')]])
-        new_feat_vec = np.hstack((new_text_vec, new_meta_vec))
-        new_log = model.predict(new_feat_vec)[0]
-        new_views = int(np.expm1(new_log))
-        print(f"📊 New Predicted Views: {new_views:,}")
-        improvement = ((new_views - initial_views) / initial_views) * 100
-        if improvement > 0:
-            print(f"🚀 POTENTIAL UPLIFT: +{improvement:.1f}%")
-        else:
-            print(f"😐 No significant uplift predicted (Model is strict!).")
-    except Exception as e:
-        print(f"❌ Error calling Gemini API: {e}")
-# ---------------------------------------------------------
-# MAIN EXECUTION
-# ---------------------------------------------------------
-if __name__ == "__main__":
-    # 1. Pipeline (LLM)
-    print("🚀 Starting Production Run: Generatng 10,000 rows...")
-    df, _ = generate_synthetic_data_llm(10000)
-    X_train, X_test, y_train, y_test, tfidf = process_data_pipeline(df)
-    # 2. Train Prediction Model (COMPARISON Step)
-    best_model = compare_and_train_best_model(X_train, y_train, X_test, y_test)
-    # 3. Create Knowledge Base (Embeddings)
-    knowledge_df, st_model = create_search_index(df)
-    # 4. Save Artifacts for App & Plot Importance
-    print("\n[5/8] Saving Model Artifacts & Finalizing Plots...")
-    # Plot Feature Importance (Now that we have the winner)
-    plot_feature_importance(best_model, tfidf)
-    # Use Pickle for Model (Generic)
-    with open("viral_model.pkl", "wb") as f:
-        pickle.dump(best_model, f)
-    print("    - Model saved to 'viral_model.pkl'")
-    with open("tfidf_vectorizer.pkl", "wb") as f:
-        pickle.dump(tfidf, f)
-    print("    - Vectorizer saved to 'tfidf_vectorizer.pkl'")
-    # 5. User Interaction Loop
-    while True:
-        print("\n" + "-"*30)
-        user_input = input("Enter your video idea (or 'q' to quit): ")
-        if user_input.lower() == 'q':
-            break
-        optimize_content_with_gemini(
-            user_input=user_input,
-            model=best_model,
-            vectorizer=tfidf,
-            knowledge_df=knowledge_df,
-            st_model=st_model
-        )

 import warnings
 import os
 import torch
+import time
 import pickle
+import google.generativeai as genai
+from datasets import load_dataset
+from sentence_transformers import SentenceTransformer
+from sklearn.metrics import silhouette_score, mean_squared_error, mean_absolute_error, r2_score
 from sklearn.ensemble import RandomForestRegressor
+from sklearn.preprocessing import LabelEncoder
+from sklearn.metrics.pairwise import cosine_similarity
 from xgboost import XGBRegressor
 from sklearn.linear_model import LinearRegression
+from dotenv import load_dotenv
+# Load environment variables
+load_dotenv()
+# Configuration
 warnings.filterwarnings('ignore')
 pd.set_option('display.max_columns', None)
 device = "mps" if torch.backends.mps.is_available() else "cpu"
 print(f"🚀 Optimization: Running on {device.upper()} device")
 if not os.path.exists('project_plots'):
     os.makedirs('project_plots')
 # ---------------------------------------------------------
+# 1. LOAD DATA
 # ---------------------------------------------------------
+def load_data():
+    print(f"\n[1/5] Loading Dataset from Hugging Face...")
     try:
+        dataset = load_dataset("MatanKriel/social-assitent-synthetic-data")
+        if 'train' in dataset:
+            df = dataset['train'].to_pandas()
+        else:
+            df = dataset.to_pandas()
+        print(f"    -> ✅ Loaded {len(df)} rows.")
+        # Basic Preprocessing
+        if 'views' in df.columns:
+            # Create Log Targets for better regression
+            df['log_views'] = np.log1p(df['views'])
+        return df
     except Exception as e:
+        print(f"    ❌ Error loading data: {e}")
+        return pd.DataFrame()
+# ---------------------------------------------------------
+# 2. EMBEDDING BENCHMARK
+# ---------------------------------------------------------
+def benchmark_and_select_model(df):
+    print("\n[2/5] Benchmarking Embedding Models...")
+    models = [
+        "sentence-transformers/all-MiniLM-L6-v2",
+        "sentence-transformers/all-mpnet-base-v2",
+        "BAAI/bge-small-en-v1.5"
     ]
+    results = []
+    # We need labels for Silhouette Score (Cluster Quality)
+    # 'category' is the perfect ground truth for semantic clusters
+    if 'category' not in df.columns:
+        print("⚠️ No 'category' column. Skipping quality metric.")
+        labels = np.zeros(len(df))
+    else:
+        labels = df['category'].values
+    # Sample for speed if dataset is huge (>5k)
+    sample_df = df.sample(min(len(df), 3000), random_state=42)
+    sample_texts = sample_df['description'].fillna("").tolist()
+    sample_labels = sample_df['category'].values
+    print(f"{'Model':<40} | {'Time (s)':<10} | {'Silhouette':<10}")
+    print("-" * 65)
+    best_score = -2
+    best_model_name = models[0] # Default
+    for model_name in models:
         try:
+            st_model = SentenceTransformer(model_name, device=device)
+            # Measure Encoding Time
+            start_t = time.time()
+            embeddings = st_model.encode(sample_texts, convert_to_numpy=True, show_progress_bar=False)
+            time_taken = time.time() - start_t
+            # Measure Cluster Quality
+            score = silhouette_score(embeddings, sample_labels)
+            results.append({
+                "Model": model_name.split('/')[-1], # Short name
+                "Time (s)": time_taken,
+                "Silhouette Score": score
+            })
+            print(f"{model_name:<40} | {time_taken:.2f}       | {score:.4f}")
+            if score > best_score:
+                best_score = score
+                best_model_name = model_name
         except Exception as e:
+            print(f"❌ Error with {model_name}: {e}")
+    print("-" * 65)
+    print(f"🏆 Winner: {best_model_name} (Score: {best_score:.4f})")
+    # --- PLOTTING ---
+    if results:
+        res_df = pd.DataFrame(results)
+        fig, axes = plt.subplots(1, 2, figsize=(14, 6))
+        # 1. Time Plot
+        sns.barplot(data=res_df, x='Model', y='Time (s)', ax=axes[0], palette='Blues_d')
+        axes[0].set_title('Encoding Speed (Lower is Better)')
+        axes[0].tick_params(axis='x', rotation=45)
+        # 2. Quality Plot
+        sns.barplot(data=res_df, x='Model', y='Silhouette Score', ax=axes[1], palette='Greens_d')
+        axes[1].set_title('Clustering Quality (Higher is Better)')
+        axes[1].tick_params(axis='x', rotation=45)
+        plt.tight_layout()
+        plt.savefig('project_plots/embedding_benchmark.png')
+        plt.close()
+        print("    -> 📊 Plot saved: 'project_plots/embedding_benchmark.png'")
+    # Save the winner name for app.py
+    with open("embedding_model_name.txt", "w") as f:
+        f.write(best_model_name)
+    return best_model_name
+# ---------------------------------------------------------
+# 3. GENERATE KNOWLEDGE BASE
+# ---------------------------------------------------------
+def generate_knowledge_base(df, model_name):
+    print(f"\n[3/5] Generating Embeddings with Winner ({model_name})...")
+    st_model = SentenceTransformer(model_name, device=device)
+    # Encode ALL descriptions
+    embeddings = st_model.encode(df['description'].fillna("").tolist(),
+                               convert_to_numpy=True,
+                               show_progress_bar=True)
+    # Store in DataFrame
+    df['embedding'] = list(embeddings)
+    return df, st_model
 # ---------------------------------------------------------
+# 4. TRAIN REGRESSION MODEL
 # ---------------------------------------------------------
+def train_regressor(df):
+    print("\n[4/5] Training View Prediction Model...")
+    # Feature Engineering
+    # 1. Semantic Features (The Embeddings)
+    X_text = np.stack(df['embedding'].values)
+    # 2. Meta Features (Duration, etc.)
+    # Define features to include
+    num_cols = ['duration', 'hour_of_day', 'followers']
+    cat_cols = ['category', 'gender', 'day_of_week', 'age']
+    # Fill missing numerics
+    for c in num_cols:
+        if c not in df.columns: df[c] = 0
+    # Process Categoricals (Label Encoding)
+    for c in cat_cols:
+        if c not in df.columns:
+            df[c] = 'Unknown'
+        le = LabelEncoder()
+        df[c + '_encoded'] = le.fit_transform(df[c].astype(str))
+    # Combine all numeric features (original numeric + encoded categorical)
+    final_meta_cols = num_cols + [c + '_encoded' for c in cat_cols]
+    print(f"    -> Features used: Embeddings + {final_meta_cols}")
+    X_meta = df[final_meta_cols].values
+    # Combine
+    X = np.hstack((X_text, X_meta))
+    y = df['log_views'].values
+    # Split (80/20)
+    split = int(len(df) * 0.8)
+    X_train, X_test = X[:split], X[split:]
+    y_train, y_test = y[:split], y[split:]
+    # Model Comparison
     models = {
+        "RandomForest": RandomForestRegressor(n_estimators=100, max_depth=10, n_jobs=-1),
+        "XGBoost": XGBRegressor(n_estimators=100, learning_rate=0.1, n_jobs=-1),
+        "LinearReg": LinearRegression()
     }
+    best_model = None
+    best_rmse = float('inf')
+    results = [] # Store for plotting
+    print(f"{'Model':<15} | {'RMSE (Views)':<15} | {'R²':<10}")
+    print("-" * 45)
     for name, model in models.items():
         model.fit(X_train, y_train)
         preds_log = model.predict(X_test)
+        # Convert log predictions back to real views
         preds_real = np.expm1(preds_log)
+        y_real = np.expm1(y_test)
+        rmse = np.sqrt(mean_squared_error(y_real, preds_real))
+        r2 = r2_score(y_test, preds_log)
+        results.append({
+            "Model": name,
+            "RMSE": rmse,
+            "R2": r2
+        })
+        print(f"{name:<15} | {rmse:,.0f}        | {r2:.3f}")
+        if rmse < best_rmse:
+            best_rmse = rmse
+            best_model = model
+    print("-" * 45)
+    print(f"🏆 Best Regressor: {type(best_model).__name__}")
     # --- PLOTTING ---
+    if results:
+        res_df = pd.DataFrame(results)
+        fig, axes = plt.subplots(1, 2, figsize=(14, 6))
+        # 1. RMSE Plot
+        sns.barplot(data=res_df, x='Model', y='RMSE', ax=axes[0], palette='Reds_d')
+        axes[0].set_title('Prediction Error (RMSE) - Lower is Better')
+        # 2. R2 Plot
+        sns.barplot(data=res_df, x='Model', y='R2', ax=axes[1], palette='Greens_d')
+        axes[1].set_title('Explained Variance (R²) - Higher is Better')
+        axes[1].set_ylim(0, 1) # R2 is usually 0-1
+        plt.tight_layout()
+        plt.savefig('project_plots/regression_comparison.png')
+        plt.close()
+        print("    -> 📊 Plot saved: 'project_plots/regression_comparison.png'")
+    # Save Model
+    with open("viral_model.pkl", "wb") as f:
+        pickle.dump(best_model, f)
+    print("    -> ✅ Model saved to 'viral_model.pkl'")
+    return best_model
+if __name__ == "__main__":
+    # EXECUTION PIPELINE
+    # 1. Load
+    df = load_data()
+    if df.empty: exit()
+    # 2. Benchmark
+    best_emb_model = benchmark_and_select_model(df)
+    # 3. Generate Knowledge Base
+    df, st_model = generate_knowledge_base(df, best_emb_model)
+    # 4. Train
+    reg_model = train_regressor(df)

project_plots/diversity_plot.png DELETED Viewed

Git LFS Details

SHA256: 84b06bd4e538740e3057e7f48631ece2f140b31c376f93e602653b92d8cade26
Pointer size: 131 Bytes
Size of remote file: 163 kB

tfidf_vectorizer.pkl → project_plots/embedding_benchmark.png RENAMED Viewed

File without changes

project_plots/feature_importance.png DELETED Viewed

Git LFS Details

SHA256: a0ac14b476322d1a9d53149a728ff0b3a6002763d1157478379fa68ea701ae04
Pointer size: 130 Bytes
Size of remote file: 32 kB

project_plots/model_comparison.png DELETED Viewed

Git LFS Details

SHA256: a6c178f39a12baa6335e4442ac83bdac07594cfec6e4b2550819f873eda0b2ef
Pointer size: 130 Bytes
Size of remote file: 17.8 kB

project_plots/{eda_distribution.png → regression_comparison.png} RENAMED Viewed

File without changes

raw_social_media_data.parquet DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:92939908f14b69157b0a99ee186ef1f0ff70d54974bfcf14235468674f73d450
-size 1185030

tiktok_knowledge_base.parquet DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:82dacd5da6cc1e8f9a62db8e8b6d68f5d5e466300d94dc7707c7afd342a97594
-size 17274184

upload_model_to_hf.py ADDED Viewed

	@@ -0,0 +1,40 @@

+import os
+from huggingface_hub import HfApi, login
+# Configuration
+MODEL_FILE = "viral_model.pkl"
+REPO_ID = "MatanKriel/social-assitent-viral-predictor"
+def upload_model():
+    print(f"🚀 Preparing to upload '{MODEL_FILE}' to Hugging Face...")
+    if not os.path.exists(MODEL_FILE):
+        print(f"❌ Error: {MODEL_FILE} not found. Run model-prep.py first.")
+        return
+    try:
+        api = HfApi()
+        # Create repo if it doesn't exist
+        print(f"📦 Checking repository '{REPO_ID}'...")
+        api.create_repo(repo_id=REPO_ID, exist_ok=True, repo_type="model")
+        # Upload file
+        print(f"📤 Uploading {MODEL_FILE}...")
+        api.upload_file(
+            path_or_fileobj=MODEL_FILE,
+            path_in_repo=MODEL_FILE,
+            repo_id=REPO_ID,
+            repo_type="model"
+        )
+        print("\n✅ Upload Complete!")
+        print(f"🔗 Model available at: https://huggingface.co/{REPO_ID}")
+        print("💡 You can now run 'python app.py' and it will download this model.")
+    except Exception as e:
+        print(f"\n❌ Error during upload: {e}")
+        print("💡 Tip: Ensure you are logged in. Run 'huggingface-cli login' if needed.")
+if __name__ == "__main__":
+    upload_model()

viral_model.pkl CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:92f2ca0ca3bf30dd6a5d7e84d8ebff5612134ff895124e03cb51586a000d9527
-size 214620

 version https://git-lfs.github.com/spec/v1
+oid sha256:301463409cf3d6c05f45fe8a31244fafe1ea7bb88619b1fafa35fcab4e207acc
+size 320476