DistilBERT — Superhero Text Classification (DC vs Marvel)
Model: Fine-tuned distilbert-base-uncased
Dataset: rlogh/superhero-texts
Task: Binary text classification — identify whether a description belongs to DC or Marvel superhero universe.
1. Dataset
- Source: Classmate dataset (not self-authored)
- Splits:
original: 100 manually written descriptionsaugmented: 1000 synthetically generated samples (EDA + character-level noise)
- Labels:
0 = DC1 = Marvel
We used the augmented split for training/validation and reserved part of the original split for held-out testing.
2. Model & Training
- Base model:
distilbert-base-uncased(66M params) - Head: Linear classification layer → 2 classes
- Optimizer: AdamW (lr=2e-5)
- Batch size: 16
- Epochs: 5
- Weight decay: 0.01
- Seed: 42
- Hardware: Google Colab (T4 GPU)
- Frameworks: 🤗 Transformers 4.56.2, PyTorch 2.8.0
3. Results
| Metric | Validation | Test (original split) |
|---|---|---|
| Accuracy | 1.00 | ~0.98 |
| F1 (macro) | 1.00 | ~0.98 |
Confusion Matrix (test split)
Pred: DC Pred: Marvel
True DC 49 1 True Marvel 1 49
Error Analysis
- 2 misclassifications:
- One DC description that emphasized “spider powers” misclassified as Marvel (likely confused with Spider-Man).
- One Marvel description with “dark knight” theme misclassified as DC.
- Observation: Errors happen when descriptions use overlapping archetypes (e.g., darkness, flying, spiders).
4. Intended Uses
- Educational purposes: understanding text classification with transformers.
- Research demo: exploring how data augmentation affects transformer fine-tuning.
⚠️ Limitations:
- Dataset is small and synthetic → high chance of overfitting.
- Labels restricted to DC vs Marvel → not generalizable to broader superhero or pop-culture domains.
- Augmented data may bias toward specific linguistic patterns.
5. How to Use
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_id = "YOUR_USERNAME/outputs_distilbert_superhero"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)
text = "A vigilante hero who protects Gotham at night."
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
logits = model(**inputs).logits
pred = torch.argmax(logits, dim=-1).item()
print("Prediction:", "DC" if pred == 0 else "Marvel")
6. Files
pytorch_model.bin — fine-tuned model weights
config.json — model config
tokenizer.json / vocab.txt — tokenizer
README.md — this model card
7. License & AI Usage
Dataset license: Apache-2.0 (per dataset card)
Model license: Apache-2.0
AI usage disclosure: Augmented dataset created with Python EDA scripts. Model card drafted with AI assistance, final content reviewed and curated by the student.
- Downloads last month
- 12
Model tree for EricCRX/outputs_distilbert_superhero
Base model
distilbert/distilbert-base-uncased