--- title: Social Media Virality Assistant emoji: 🚀 colorFrom: indigo colorTo: purple sdk: gradio sdk_version: 5.9.0 app_file: app.py pinned: false --- # 🚀 Social Media Virality Assistant A machine learning-powered tool that helps content creators predict and optimize their video virality potential using trained **XGBoost** model and **Google Gemini AI**. ## 🏗️ Architecture & Pipeline This project consists of two main components: a training pipeline (`model-prep.py`) and an inference application (`app.py`). ### 1. Training Pipeline (`model-prep.py`) the `model-prep.py` script handles the end-to-end model creation process: 1. **Cloud Data Loading**: It fetches the latest synthetic dataset directly from **Hugging Face** (`MatanKriel/social-assitent-synthetic-data`). 2. **Embedding Benchmark**: It evaluates 3 state-of-the-art models (`MiniLM`, `mpnet-base`, `bge-small`) using **Silhouette Score** on **Composite Labels** (`Category_ViralClass`). * *Why?* Instead of just clustering by topic (e.g., "Gaming"), this forces the model to distinguish between "Viral Gaming Videos" and "Average Gaming Videos". * *Selection*: Automatically picks the best model for this high-resolution task. 3. **Feature Engineering**: * Encodes categorical inputs: `category`, `gender`, `day_of_week`, `age`. * Combines text embeddings with metadata (`followers`, `duration`, `hour`). 4. **Model Training**: Trains and compares three regression algorithms: * Linear Regression * Random Forest * **XGBoost (Winner)**: Selected for having the lowest RMSE. 5. **Artifact Generation**: Saves the trained model locally (`viral_model.pkl`) and generates performance plots (`project_plots/`). ### 2. Inference Application (`app.py`) The `app.py` script runs a **Gradio** web interface that pulls artifacts from the cloud at startup: 1. **Initialization**: * Downloads the trained `viral_model.pkl` from Hugging Face (`MatanKriel/social-assitent-viral-predictor`). * Downloads the dataset to build a Knowledge Base. * Generates embeddings on-the-fly for the Knowledge Base. 2. **Core Features**: * **Virality Prediction**: Predicts raw view counts based on your draft description and stats. * **AI Optimization**: Uses **Google Gemini** to rewrite your description with viral hooks and hashtags with the context of top 3 similar videos from the dataset. * **Semantic Search**: Finds similar successful videos from the knowledge base using Cosine Similarity. --- ## 📊 Model Performance The training script (`model-prep.py`) automatically generates these benchmarks: ### Embedding Model Comparison We selected the embedding model that best balances speed and semantic understanding. ![Embedding Benchmark](project_plots/embedding_benchmark.png) ### Regression Model Comparison We chose the regressor with the lowest error (RMSE) and highest explained variance (R²). ![Model Comparison](project_plots/regression_comparison.png) --- ## 🛠️ Tech Stack This project is built using: * **App**: `gradio`, `google-generativeai` * **ML**: `xgboost`, `scikit-learn`, `sentence-transformers` * **Data**: `pandas`, `numpy` * **Cloud**: `huggingface_hub`, `datasets` ---