twitch-roberta-base / README.md

muyihenhen

Update README.md

c97b90b verified 24 days ago

preview code

raw

history blame contribute delete

2.38 kB

metadata

language: en
tags:
  - twitch
  - roberta
  - domain-adaptation
  - nlp
  - masked-language-modeling
license: mit
base_model: cardiffnlp/twitter-roberta-base-sentiment-latest

Twitch-RoBERTa-Base (Domain Adapted)

This is a Domain-Adapted RoBERTa-Base model pre-trained on ~1.1 million real Twitch chat messages.

It solves the "Domain Shift" problem where standard NLP models (trained on Wikipedia/Twitter) fail to understand gaming slang. For example, standard models often classify "cracked" as Negative (broken) or "cap" as Neutral (hat). This model understands that in a gaming context, "cracked" means Skillful and "cap" means Lie.

Model Performance

Metric	Baseline (Twitter-RoBERTa)	Twitch-RoBERTa (This Model)
Perplexity	~21,375	~5.5
Loss	9.97	1.7

Result: A ~82% reduction in perplexity, effectively teaching the model the specific vocabulary, syntax, and emote usage patterns of the Twitch community.

Architecture & Training

Base Architecture: roberta-base (125M parameters)
Training Task: Masked Language Modeling (MLM)
Dataset: ~1.1 million diverse Twitch messages (aggregated from various popular twitch channels to ensure generalization).
Optimization:
- Precision: FP16 Mixed Precision
- Batch Strategy (On local GPU): Gradient Accumulation (4 Microsteps Per Batch x 4 Batches = Effective Batch Size: 16)
- Masking: Dynamic Masking (0.15 probability) to prevent overfitting on small data.

Intended Use

Recommended Use Cases:

Fine-Tuning: Use this model as the base for training a Sentiment Classifier, Toxicity Detector, or Spam Filter for Twitch chat. It will converge significantly faster and with higher accuracy than a generic BERT model.
Masked Prediction: Auto-completing gaming messages or understanding slang context.

Usage Example

from transformers import pipeline

fill_mask = pipeline(
    "fill-mask",
    model="osamuyiohenhen/twitch-roberta-base",
    tokenizer="osamuyiohenhen/twitch-roberta-base"
)

# Test the model's understanding of slang
result = fill_mask("That play was absolutely <mask>.")
print(result)
# Likely predictions: "cracked", "insane", "nuts", "crazy"

Limitations

Context Window: 128 tokens (optimized for short chat messages).