Instructions to use Ohad777/Tiktok-views-predictor with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Scikit-learn
How to use Ohad777/Tiktok-views-predictor with Scikit-learn:
from huggingface_hub import hf_hub_download import joblib model = joblib.load( hf_hub_download("Ohad777/Tiktok-views-predictor", "sklearn_model.joblib") ) # only load pickle files from sources you trust # read more about it here https://skops.readthedocs.io/en/stable/persistence.html - Notebooks
- Google Colab
- Kaggle
- TikTok Viral Hook Predictor - Data Science Walkthrough
TikTok Viral Hook Predictor - Data Science Walkthrough
๐บ Presentation Video
1. Exploratory Data Analysis (EDA) - The Story of the Data
Our journey began by understanding the massive variance in TikTok engagement.
Key Insights:
- The Long Tail: Most videos have low engagement, while a few outliers reach millions, necessitating a log-scale analysis.
- Category Performance: Interestingly, while "Health & Fitness" is a saturated category, Navigation and Kids content lead in average views per video.
2. The Baseline Challenge - Linear Regression Failure
Initially, we attempted to predict the exact view count using a standard Linear Regression model.
- The Problem: The model suffered from "Mean-Seeking" behavior. It predicted the mean value for almost all videos and failed to capture any viral spikes, resulting in a poor fit.
3. Unsupervised Learning - Clustering Viral Patterns
To discover hidden structures, we applied K-Means Clustering.
- PCA Visualization: By reducing dimensionality, we confirmed that viral videos (yellow cluster) possess distinct metadata profiles compared to standard content. We used these clusters as a new feature to guide our improved models.
4. Model Competition & Selection
We tested multiple architectures to find the best fit for this complex data.
- Winner: The Random Forest model significantly outperformed its peers, achieving a superior $R^2$ of 0.636 and the lowest error rates.
5. The Winner Model - Regression Performance
By utilizing engineered features and clustering data, our final regression model achieved much better alignment with actual engagement levels compared to the baseline shown in step 2.
6. Final Solution - Classification Strategy
Recognizing that success tiers are more actionable than exact numbers, we reframed the problem into Classification.
Target Transformation: Quantile Binning
To handle the continuous nature of the views, I implemented a Quantile Binning (3+ Classes) strategy. The target variable was divided into three balanced quantiles representing:
- Low Performance
- Medium Performance
- Viral Hits
This ensures that the model learns to identify relative success and remains robust against extreme outliers.
Performance Analysis:
The Random Forest Classifier shows a strong diagonal in the confusion matrix, proving its ability to effectively isolate viral hits from regular content.
Author: Ohad
Environment: Python 3 (Scikit-Learn, Pandas, Matplotlib).
- Downloads last month
- -