Fitness YouTube Comment Classifier β€” RoBERTa

Fine-tuned roberta-base that classifies YouTube comments from fitness influencer videos into 5 categories: fitness, nutrition, motivational, challenge, product.

Part of a three-experiment study measuring the effect of data volume and model size on a self-scraped fitness influencer comment dataset.


Quick Start

from transformers import pipeline

classifier = pipeline(
    'text-classification',
    model='Krat6s/fitness-comment-classifier-roberta'
)

classifier("This protein shake changed my life, amazing with oat milk")
# [{'label': 'nutrition', 'score': 0.956}]

classifier("I've been doing this workout for 30 days and I can see abs forming!")
# [{'label': 'fitness', 'score': 0.965}]

Model Description

  • Base model: roberta-base (FacebookAI, 125M parameters)
  • Task: Multi-class text classification (5 classes)
  • Domain: YouTube comments from fitness influencer channels
  • Language: English (non-English comments present in dataset but not handled)

Dataset

Self-scraped YouTube comments collected via the YouTube Data API v3 for MSc dissertation research on fitness influencer sentiment and thematic analysis.

  • Total dataset: 92,223 comments across 94 fitness influencer channels
  • Top channels: Noel Deyzel, Browney, Jeff Nippard, Renaissance Periodization, ATHLEAN-X
  • HuggingFace dataset: Krat6s/fitness-youtube-comments

Class Distribution (Full Dataset)

Class Count
challenge 20,923
nutrition 20,506
fitness 19,990
motivational 19,928
product 10,749

Training

Data Splits (20,000 row stratified sample)

Split Size
Train 14,000
Validation 3,000
Test 3,000

Hyperparameters

Parameter Value
Learning rate 2e-5
Epochs 3
Batch size (train) 16
Batch size (eval) 32
Max sequence length 128
Warmup steps 50
Weight decay 0.01
Optimizer AdamW

Training Curve

Epoch Train Loss Val Loss Accuracy F1
1 2.495 2.126 0.592 0.595
2 1.934 2.059 0.607 0.609
3 1.638 2.102 0.614 0.614

Hardware: Kaggle T4 x2 GPU Training time: 643 seconds (~10.7 minutes)


Evaluation Results (Test Set β€” 3,000 samples)

Overall

Metric Score
Accuracy 62.5%
F1 (weighted) 62.5%

Per-Class

Class Precision Recall F1 Support
challenge 0.62 0.58 0.60 685
fitness 0.63 0.67 0.65 647
motivational 0.56 0.66 0.61 641
nutrition 0.69 0.65 0.67 671
product 0.65 0.53 0.58 356

Baseline Comparisons

Model Accuracy
Majority class baseline 22.8%
Pretrained RoBERTa (no fine-tuning) 21.6%
Fine-tuned RoBERTa (this model) 62.5%
Improvement over baseline +39.7pp
Improvement from fine-tuning +40.9pp

Experiment Comparison β€” Data Scaling + Model Scaling

Three experiments run on the same dataset and evaluation pipeline, changing one variable at a time.

Model Parameters Training Data Accuracy F1 Train Time
DistilBERT 66M 5,000 rows 53.6% 53.8% 81s
DistilBERT 66M 20,000 rows 60.4% 60.4% 327s
RoBERTa (this model) 125M 20,000 rows 62.5% 62.5% 643s

Key findings:

  • Data scaling (5K β†’ 20K rows): +6.8pp accuracy, 4x training time
  • Model scaling (DistilBERT β†’ RoBERTa): +2.1pp accuracy, 2x training time
  • Data volume had a larger impact than model size on this task

Per-Class F1 Across All Experiments

Class DistilBERT 5K DistilBERT 20K RoBERTa 20K
challenge 0.48 0.60 0.60
fitness 0.54 0.63 0.65
motivational 0.51 0.58 0.61
nutrition 0.62 0.63 0.67
product 0.54 0.56 0.58

Inference Examples

Comment Predicted Confidence
"This protein shake recipe changed my life, tastes amazing with oat milk" nutrition 95.6%
"I've been doing this workout for 30 days and I can see abs forming!" fitness 96.5%
"Never give up on your dreams, the grind is worth it" motivational 86.0%
"Is this pre-workout worth buying? I've heard mixed reviews" product 90.6%
"Day 7 of the squat challenge complete πŸ”₯" fitness βœ— 89.3%

Note: the final example is a known failure case. "Day 7 of the squat challenge" is correctly a challenge comment, but RoBERTa predicts fitness at high confidence. "Squat" has strong fitness associations in the training data. This illustrates a known failure mode of larger models β€” higher confidence on incorrect predictions. DistilBERT correctly predicted challenge here at lower confidence (50.8%).


Limitations

Challenge/motivational confusion persists across all three model variants. 129 challenge comments were predicted as motivational in the test set despite the larger model and more training data. This is a label ambiguity problem intrinsic to the task β€” challenge and motivational videos share workout encouragement language. The confusion is not resolvable by more data or a larger model without incorporating video title or metadata alongside the comment text.

Product class underrepresentation β€” product has roughly half the examples of other classes. F1 of 0.58 is the lowest across classes despite competitive precision (0.65), driven by low recall (0.53) β€” the model misses nearly half of actual product comments.

High-confidence errors β€” RoBERTa's stronger language associations produce higher confidence scores overall, including on incorrect predictions. The challenge β†’ fitness misclassification at 89.3% confidence is an example.

Non-English comments β€” approximately 15% of the dataset contains non-English comments. These produce unreliable predictions.


Next Steps

  • YouTuber-stratified train/test split β€” train on 80 channels, test on 14 held-out channels to measure generalisation to unseen creators
  • Sentiment classification using human-labelled subset to replace VADER dissertation baseline
  • Incorporate video title as additional input feature to resolve challenge/motivational ambiguity

Related Models


Citation

Dataset: Self-scraped YouTube comments from 94 fitness influencer channels
Collected via YouTube Data API v3 for MSc dissertation research
HuggingFace dataset: Krat6s/fitness-youtube-comments
Downloads last month
23
Safetensors
Model size
0.1B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train Krat6s/fitness-comment-classifier-roberta