TikTok Viral Hook Predictor - Data Science Walkthrough

๐Ÿ“บ Presentation Video


1. Exploratory Data Analysis (EDA) - The Story of the Data

Our journey began by understanding the massive variance in TikTok engagement.

Key Insights:

  • The Long Tail: Most videos have low engagement, while a few outliers reach millions, necessitating a log-scale analysis.
  • Category Performance: Interestingly, while "Health & Fitness" is a saturated category, Navigation and Kids content lead in average views per video.

2. The Baseline Challenge - Linear Regression Failure

Initially, we attempted to predict the exact view count using a standard Linear Regression model.

  • The Problem: The model suffered from "Mean-Seeking" behavior. It predicted the mean value for almost all videos and failed to capture any viral spikes, resulting in a poor fit.

3. Unsupervised Learning - Clustering Viral Patterns

To discover hidden structures, we applied K-Means Clustering.

  • PCA Visualization: By reducing dimensionality, we confirmed that viral videos (yellow cluster) possess distinct metadata profiles compared to standard content. We used these clusters as a new feature to guide our improved models.

4. Model Competition & Selection

We tested multiple architectures to find the best fit for this complex data.

  • Winner: The Random Forest model significantly outperformed its peers, achieving a superior $R^2$ of 0.636 and the lowest error rates.

5. The Winner Model - Regression Performance

By utilizing engineered features and clustering data, our final regression model achieved much better alignment with actual engagement levels compared to the baseline shown in step 2.

6. Final Solution - Classification Strategy

Recognizing that success tiers are more actionable than exact numbers, we reframed the problem into Classification.

Target Transformation: Quantile Binning

To handle the continuous nature of the views, I implemented a Quantile Binning (3+ Classes) strategy. The target variable was divided into three balanced quantiles representing:

  • Low Performance
  • Medium Performance
  • Viral Hits

This ensures that the model learns to identify relative success and remains robust against extreme outliers.

Performance Analysis:

The Random Forest Classifier shows a strong diagonal in the confusion matrix, proving its ability to effectively isolate viral hits from regular content.


Author: Ohad
Environment: Python 3 (Scikit-Learn, Pandas, Matplotlib).

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support