social-assistent / README.md
odeyaaa's picture
Upload 10 files
cf49347 verified

A newer version of the Gradio SDK is available: 6.5.1

Upgrade
metadata
title: Social Media Virality
emoji: 📺
colorFrom: green
colorTo: yellow
sdk: gradio
sdk_version: 5.9.0
app_file: app.py
pinned: false

Social Media Virality Prediction & Optimization Project

Course: Data Science & Machine Learning Applications
Project: Viral Content Assistant

1. Project Overview

This project aims to develop a data-driven system capable of predicting the viral potential of short-form video content (e.g., TikTok) and optimizing it using Generative AI. By leveraging Natural Language Processing (NLP) and Machine Learning (ML), the system analyzes video descriptions and metadata to forecast view counts and prescribes actionable improvements to maximize engagement.

The core solution consists of a machine learning pipeline for virality prediction and a web application (Gradio) for real-time user interaction.

2. Data Science Methodology

2.1 Data Acquisition & Generation

Due to privacy restrictions and API limitations of social platforms, we simulated a realistic dataset reflecting 2025 social media trends.

  • Source: Synthetic generation options using the Faker library and numpy probabilistic distributions.
  • Volume: 10,000 samples.
  • Features:
    • Textual: Video descriptions rich in slang (e.g., "Skibidi", "Girl Dinner"), hashtags, and emojis.
    • Temporal: Upload hour, day of week.
    • Meta: Video duration, category (Gaming, Beauty, etc.).
  • Target Variable: views (Log-normally distributed to mimic real-world viral discrepancies).

2.2 Exploratory Data Analysis (EDA)

We analyzed the distribution of the target variable and feature correlations.

  • Observation: View counts follow a "power law" distribution; most videos have few views, while a few "viral hits" capture the majority.
  • Preprocessing: We applied a Log-transformation (np.log1p) to the views variable to normalize the distribution for regression models.

Views Distribution Figure 1: Distribution of Log-Transformed View Counts.

2.3 Feature Engineering

  • Text Embeddings: We used TF-IDF Vectorization (Top 2,000 features) to convert unstructured text descriptions into numerical vectors.
  • Meta Features: Encoded is_weekend, hour_of_day, and video_duration_sec.
  • Data Splitting: A Temporal Split (80/20) was used instead of a random split to prevent data leakage, ensuring the model predicts future videos based on past trends.

3. Model Development & Evaluation

We evaluated three distinct algorithms to solve the regression problem (predicting log-views):

  1. Linear Regression: Baseline model for interpretability.
  2. Random Forest Regressor: Ensemble method to capture non-linear relationships.
  3. XGBoost Regressor: Gradient boosting machine known for state-of-the-art tabular performance.

3.1 Comparative Metrics

Models were assessed using:

  • RMSE (Root Mean Squared Error): The primary metric for regression accuracy.
  • R² (Coefficient of Determination): Explains the variance captured by the model.
  • F1-Score: Used to proxy classification performance (predicting if a video hits the "Viral Threshold" (top 20%)).

Model Leaderboard Figure 2: Performance comparison across different architectures.

3.2 Result

The XGBoost Regressor outperformed other models, achieving the lowest RMSE on the test set. This model was selected for the final deployment.

4. Advanced Analysis: Embeddings & Semantic Search

Beyond simple regression, we implemented a semantic search engine using SentenceTransformers (all-MiniLM-L6-v2).

  • Purpose: To retrieve historical viral hits conceptually similar to the user's new idea.
  • Clustering: We visualized the semantic space using PCA (Principal Component Analysis).

Embedding Clusters Figure 3: Semantic clustering of video descriptions.

5. Application & Deployment

The final deliverable is an interactive web application built with Gradio.

5.1 System Architecture

The system is decoupled into two main components:

  1. Training Pipeline (model-prep.py): Runs offline to generate synthetic data, train the XGBoost model, and create the vector database. It saves these artifacts (viral_model.json, tfidf_vectorizer.pkl, tiktok_knowledge_base.parquet).
  2. Inference App (app.py): A lightweight Gradio app that loads the pre-trained artifacts to serve real-time predictions without needing to retrain.

Data Flow:

  1. Input: User provided video description.
  2. Inference: Loaded XGBoost model predicts view count.
  3. Retrieval: App searches the pre-computed Parquet knowledge base for similar viral videos.
  4. Generative Optimization: Google Gemini 2.5 Flash Lite rewrites the draft.
  5. Output: Predictions, Similar Videos, and AI-Optimized content.

5.2 Usage Instructions

To run the project locally for assessment:

  1. Environment Setup:
    python3 -m venv .venv
    source .venv/bin/activate
    pip install -r requirements.txt
    
  2. Configuration: Ensure the .env file contains a valid GEMINI_API_KEY.
  3. Execution:
    python app.py
    
    Access the UI at http://localhost:7860.

6. Conclusion

This project demonstrates a complete end-to-end Data Science workflow: from synthetic data creation and rigorous model evaluation to the deployment of a user-facing AI application. The integration of predictive analytics (XGBoost) with generative AI (Gemini) provides a robust tool for content creators.

🏆 Credits

  • Project Author: Matan Kriel
  • Project Author: Odeya Shmuel