Spaces:
Sleeping
A newer version of the Gradio SDK is available:
6.5.1
title: Social Media Virality
emoji: 📺
colorFrom: green
colorTo: yellow
sdk: gradio
sdk_version: 5.9.0
app_file: app.py
pinned: false
Social Media Virality Prediction & Optimization Project
Course: Data Science & Machine Learning Applications
Project: Viral Content Assistant
1. Project Overview
This project aims to develop a data-driven system capable of predicting the viral potential of short-form video content (e.g., TikTok) and optimizing it using Generative AI. By leveraging Natural Language Processing (NLP) and Machine Learning (ML), the system analyzes video descriptions and metadata to forecast view counts and prescribes actionable improvements to maximize engagement.
The core solution consists of a machine learning pipeline for virality prediction and a web application (Gradio) for real-time user interaction.
2. Data Science Methodology
2.1 Data Acquisition & Generation
Due to privacy restrictions and API limitations of social platforms, we simulated a realistic dataset reflecting 2025 social media trends.
- Source: Synthetic generation options using the
Fakerlibrary andnumpyprobabilistic distributions. - Volume: 10,000 samples.
- Features:
- Textual: Video descriptions rich in slang (e.g., "Skibidi", "Girl Dinner"), hashtags, and emojis.
- Temporal: Upload hour, day of week.
- Meta: Video duration, category (Gaming, Beauty, etc.).
- Target Variable:
views(Log-normally distributed to mimic real-world viral discrepancies).
2.2 Exploratory Data Analysis (EDA)
We analyzed the distribution of the target variable and feature correlations.
- Observation: View counts follow a "power law" distribution; most videos have few views, while a few "viral hits" capture the majority.
- Preprocessing: We applied a Log-transformation (
np.log1p) to theviewsvariable to normalize the distribution for regression models.
Figure 1: Distribution of Log-Transformed View Counts.
2.3 Feature Engineering
- Text Embeddings: We used TF-IDF Vectorization (Top 2,000 features) to convert unstructured text descriptions into numerical vectors.
- Meta Features: Encoded
is_weekend,hour_of_day, andvideo_duration_sec. - Data Splitting: A Temporal Split (80/20) was used instead of a random split to prevent data leakage, ensuring the model predicts future videos based on past trends.
3. Model Development & Evaluation
We evaluated three distinct algorithms to solve the regression problem (predicting log-views):
- Linear Regression: Baseline model for interpretability.
- Random Forest Regressor: Ensemble method to capture non-linear relationships.
- XGBoost Regressor: Gradient boosting machine known for state-of-the-art tabular performance.
3.1 Comparative Metrics
Models were assessed using:
- RMSE (Root Mean Squared Error): The primary metric for regression accuracy.
- R² (Coefficient of Determination): Explains the variance captured by the model.
- F1-Score: Used to proxy classification performance (predicting if a video hits the "Viral Threshold" (top 20%)).
Figure 2: Performance comparison across different architectures.
3.2 Result
The XGBoost Regressor outperformed other models, achieving the lowest RMSE on the test set. This model was selected for the final deployment.
4. Advanced Analysis: Embeddings & Semantic Search
Beyond simple regression, we implemented a semantic search engine using SentenceTransformers (all-MiniLM-L6-v2).
- Purpose: To retrieve historical viral hits conceptually similar to the user's new idea.
- Clustering: We visualized the semantic space using PCA (Principal Component Analysis).
Figure 3: Semantic clustering of video descriptions.
5. Application & Deployment
The final deliverable is an interactive web application built with Gradio.
5.1 System Architecture
The system is decoupled into two main components:
- Training Pipeline (
model-prep.py): Runs offline to generate synthetic data, train the XGBoost model, and create the vector database. It saves these artifacts (viral_model.json,tfidf_vectorizer.pkl,tiktok_knowledge_base.parquet). - Inference App (
app.py): A lightweight Gradio app that loads the pre-trained artifacts to serve real-time predictions without needing to retrain.
Data Flow:
- Input: User provided video description.
- Inference: Loaded XGBoost model predicts view count.
- Retrieval: App searches the pre-computed Parquet knowledge base for similar viral videos.
- Generative Optimization: Google Gemini 2.5 Flash Lite rewrites the draft.
- Output: Predictions, Similar Videos, and AI-Optimized content.
5.2 Usage Instructions
To run the project locally for assessment:
- Environment Setup:
python3 -m venv .venv source .venv/bin/activate pip install -r requirements.txt - Configuration:
Ensure the
.envfile contains a validGEMINI_API_KEY. - Execution:
Access the UI atpython app.pyhttp://localhost:7860.
6. Conclusion
This project demonstrates a complete end-to-end Data Science workflow: from synthetic data creation and rigorous model evaluation to the deployment of a user-facing AI application. The integration of predictive analytics (XGBoost) with generative AI (Gemini) provides a robust tool for content creators.
🏆 Credits
- Project Author: Matan Kriel
- Project Author: Odeya Shmuel