Fitness YouTube Comment Classifier — RoBERTa

Fine-tuned roberta-base that classifies YouTube comments from fitness influencer videos into 5 categories: fitness, nutrition, motivational, challenge, product.

Part of a three-experiment study measuring the effect of data volume and model size on a self-scraped fitness influencer comment dataset.

Quick Start

from transformers import pipeline

classifier = pipeline(
    'text-classification',
    model='Krat6s/fitness-comment-classifier-roberta'
)

classifier("This protein shake changed my life, amazing with oat milk")
# [{'label': 'nutrition', 'score': 0.956}]

classifier("I've been doing this workout for 30 days and I can see abs forming!")
# [{'label': 'fitness', 'score': 0.965}]

Model Description

Base model: roberta-base (FacebookAI, 125M parameters)
Task: Multi-class text classification (5 classes)
Domain: YouTube comments from fitness influencer channels
Language: English (non-English comments present in dataset but not handled)

Dataset

Self-scraped YouTube comments collected via the YouTube Data API v3 for MSc dissertation research on fitness influencer sentiment and thematic analysis.

Total dataset: 92,223 comments across 94 fitness influencer channels
Top channels: Noel Deyzel, Browney, Jeff Nippard, Renaissance Periodization, ATHLEAN-X
HuggingFace dataset: Krat6s/fitness-youtube-comments

Class Distribution (Full Dataset)

Class	Count
challenge	20,923
nutrition	20,506
fitness	19,990
motivational	19,928
product	10,749

Training

Data Splits (20,000 row stratified sample)

Split	Size
Train	14,000
Validation	3,000
Test	3,000

Hyperparameters

Parameter	Value
Learning rate	2e-5
Epochs	3
Batch size (train)	16
Batch size (eval)	32
Max sequence length	128
Warmup steps	50
Weight decay	0.01
Optimizer	AdamW

Training Curve

Epoch	Train Loss	Val Loss	Accuracy	F1
1	2.495	2.126	0.592	0.595
2	1.934	2.059	0.607	0.609
3	1.638	2.102	0.614	0.614

Hardware: Kaggle T4 x2 GPU Training time: 643 seconds (~10.7 minutes)

Evaluation Results (Test Set — 3,000 samples)

Overall

Metric	Score
Accuracy	62.5%
F1 (weighted)	62.5%

Per-Class

Class	Precision	Recall	F1	Support
challenge	0.62	0.58	0.60	685
fitness	0.63	0.67	0.65	647
motivational	0.56	0.66	0.61	641
nutrition	0.69	0.65	0.67	671
product	0.65	0.53	0.58	356

Baseline Comparisons

Model	Accuracy
Majority class baseline	22.8%
Pretrained RoBERTa (no fine-tuning)	21.6%
Fine-tuned RoBERTa (this model)	62.5%
Improvement over baseline	+39.7pp
Improvement from fine-tuning	+40.9pp

Experiment Comparison — Data Scaling + Model Scaling

Three experiments run on the same dataset and evaluation pipeline, changing one variable at a time.

Model	Parameters	Training Data	Accuracy	F1	Train Time
DistilBERT	66M	5,000 rows	53.6%	53.8%	81s
DistilBERT	66M	20,000 rows	60.4%	60.4%	327s
RoBERTa (this model)	125M	20,000 rows	62.5%	62.5%	643s

Key findings:

Data scaling (5K → 20K rows): +6.8pp accuracy, 4x training time
Model scaling (DistilBERT → RoBERTa): +2.1pp accuracy, 2x training time
Data volume had a larger impact than model size on this task

Per-Class F1 Across All Experiments

Class	DistilBERT 5K	DistilBERT 20K	RoBERTa 20K
challenge	0.48	0.60	0.60
fitness	0.54	0.63	0.65
motivational	0.51	0.58	0.61
nutrition	0.62	0.63	0.67
product	0.54	0.56	0.58

Inference Examples

Comment	Predicted	Confidence
"This protein shake recipe changed my life, tastes amazing with oat milk"	nutrition	95.6%
"I've been doing this workout for 30 days and I can see abs forming!"	fitness	96.5%
"Never give up on your dreams, the grind is worth it"	motivational	86.0%
"Is this pre-workout worth buying? I've heard mixed reviews"	product	90.6%
"Day 7 of the squat challenge complete 🔥"	fitness ✗	89.3%

Note: the final example is a known failure case. "Day 7 of the squat challenge" is correctly a challenge comment, but RoBERTa predicts fitness at high confidence. "Squat" has strong fitness associations in the training data. This illustrates a known failure mode of larger models — higher confidence on incorrect predictions. DistilBERT correctly predicted challenge here at lower confidence (50.8%).

Limitations

Challenge/motivational confusion persists across all three model variants. 129 challenge comments were predicted as motivational in the test set despite the larger model and more training data. This is a label ambiguity problem intrinsic to the task — challenge and motivational videos share workout encouragement language. The confusion is not resolvable by more data or a larger model without incorporating video title or metadata alongside the comment text.

Product class underrepresentation — product has roughly half the examples of other classes. F1 of 0.58 is the lowest across classes despite competitive precision (0.65), driven by low recall (0.53) — the model misses nearly half of actual product comments.

High-confidence errors — RoBERTa's stronger language associations produce higher confidence scores overall, including on incorrect predictions. The challenge → fitness misclassification at 89.3% confidence is an example.

Non-English comments — approximately 15% of the dataset contains non-English comments. These produce unreliable predictions.

Next Steps

YouTuber-stratified train/test split — train on 80 channels, test on 14 held-out channels to measure generalisation to unseen creators
Sentiment classification using human-labelled subset to replace VADER dissertation baseline
Incorporate video title as additional input feature to resolve challenge/motivational ambiguity

Related Models

Krat6s/fitness-comment-classifier — DistilBERT version trained on 20K rows (60.4% accuracy)

Citation

Dataset: Self-scraped YouTube comments from 94 fitness influencer channels
Collected via YouTube Data API v3 for MSc dissertation research
HuggingFace dataset: Krat6s/fitness-youtube-comments

Downloads last month: 23

Safetensors

Model size

0.1B params

Tensor type

F32

Krat6s
/

fitness-comment-classifier-roberta