--- title: Social Media Virality emoji: 📺 colorFrom: green colorTo: yellow sdk: gradio sdk_version: 5.9.0 app_file: app.py pinned: false --- # Social Media Virality Prediction & Optimization Project **Course**: Data Science & Machine Learning Applications **Project**: Viral Content Assistant ## 1. Project Overview This project aims to develop a data-driven system capable of predicting the viral potential of short-form video content (e.g., TikTok) and optimizing it using Generative AI. By leveraging Natural Language Processing (NLP) and Machine Learning (ML), the system analyzes video descriptions and metadata to forecast view counts and prescribes actionable improvements to maximize engagement. The core solution consists of a machine learning pipeline for virality prediction and a web application (Gradio) for real-time user interaction. ## 2. Data Science Methodology ### 2.1 Data Acquisition & Generation Due to privacy restrictions and API limitations of social platforms, we simulated a realistic dataset reflecting 2025 social media trends. * **Source**: Synthetic generation options using the `Faker` library and `numpy` probabilistic distributions. * **Volume**: 10,000 samples. * **Features**: * **Textual**: Video descriptions rich in slang (e.g., "Skibidi", "Girl Dinner"), hashtags, and emojis. * **Temporal**: Upload hour, day of week. * **Meta**: Video duration, category (Gaming, Beauty, etc.). * **Target Variable**: `views` (Log-normally distributed to mimic real-world viral discrepancies). ### 2.2 Exploratory Data Analysis (EDA) We analyzed the distribution of the target variable and feature correlations. * **Observation**: View counts follow a "power law" distribution; most videos have few views, while a few "viral hits" capture the majority. * **Preprocessing**: We applied a Log-transformation (`np.log1p`) to the `views` variable to normalize the distribution for regression models. ![Views Distribution](project_plots/eda_distribution.png) *Figure 1: Distribution of Log-Transformed View Counts.* ### 2.3 Feature Engineering * **Text Embeddings**: We used **TF-IDF Vectorization** (Top 2,000 features) to convert unstructured text descriptions into numerical vectors. * **Meta Features**: Encoded `is_weekend`, `hour_of_day`, and `video_duration_sec`. * **Data Splitting**: A **Temporal Split** (80/20) was used instead of a random split to prevent data leakage, ensuring the model predicts future videos based on past trends. ## 3. Model Development & Evaluation We evaluated three distinct algorithms to solve the regression problem (predicting log-views): 1. **Linear Regression**: Baseline model for interpretability. 2. **Random Forest Regressor**: Ensemble method to capture non-linear relationships. 3. **XGBoost Regressor**: Gradient boosting machine known for state-of-the-art tabular performance. ### 3.1 Comparative Metrics Models were assessed using: * **RMSE (Root Mean Squared Error)**: The primary metric for regression accuracy. * **R² (Coefficient of Determination)**: Explains the variance captured by the model. * **F1-Score**: Used to proxy classification performance (predicting if a video hits the "Viral Threshold" (top 20%)). ![Model Leaderboard](project_plots/model_leaderboard.png) *Figure 2: Performance comparison across different architectures.* ### 3.2 Result The **XGBoost Regressor** outperformed other models, achieving the lowest RMSE on the test set. This model was selected for the final deployment. ## 4. Advanced Analysis: Embeddings & Semantic Search Beyond simple regression, we implemented a semantic search engine using **SentenceTransformers** (`all-MiniLM-L6-v2`). * **Purpose**: To retrieve historical viral hits conceptually similar to the user's new idea. * **Clustering**: We visualized the semantic space using PCA (Principal Component Analysis). ![Embedding Clusters](project_plots/embedding_clusters.png) *Figure 3: Semantic clustering of video descriptions.* ## 5. Application & Deployment The final deliverable is an interactive web application built with **Gradio**. ### 5.1 System Architecture The system is decoupled into two main components: 1. **Training Pipeline (`model-prep.py`)**: Runs offline to generate synthetic data, train the XGBoost model, and create the vector database. It saves these artifacts (`viral_model.json`, `tfidf_vectorizer.pkl`, `tiktok_knowledge_base.parquet`). 2. **Inference App (`app.py`)**: A lightweight Gradio app that loads the pre-trained artifacts to serve real-time predictions without needing to retrain. **Data Flow**: 1. **Input**: User provided video description. 2. **Inference**: Loaded XGBoost model predicts view count. 3. **Retrieval**: App searches the pre-computed Parquet knowledge base for similar viral videos. 4. **Generative Optimization**: **Google Gemini 2.5 Flash Lite** rewrites the draft. 5. **Output**: Predictions, Similar Videos, and AI-Optimized content. ### 5.2 Usage Instructions To run the project locally for assessment: 1. **Environment Setup**: ```bash python3 -m venv .venv source .venv/bin/activate pip install -r requirements.txt ``` 2. **Configuration**: Ensure the `.env` file contains a valid `GEMINI_API_KEY`. 3. **Execution**: ```bash python app.py ``` Access the UI at `http://localhost:7860`. ## 6. Conclusion This project demonstrates a complete end-to-end Data Science workflow: from synthetic data creation and rigorous model evaluation to the deployment of a user-facing AI application. The integration of predictive analytics (XGBoost) with generative AI (Gemini) provides a robust tool for content creators. ## 🏆 Credits * **Project Author:** Matan Kriel * **Project Author:** Odeya Shmuel