DistilBERT — Superhero Text Classification (DC vs Marvel)

Model: Fine-tuned distilbert-base-uncased
Dataset: rlogh/superhero-texts
Task: Binary text classification — identify whether a description belongs to DC or Marvel superhero universe.

1. Dataset

Source: Classmate dataset (not self-authored)
Splits:
- original: 100 manually written descriptions
- augmented: 1000 synthetically generated samples (EDA + character-level noise)
Labels:
- 0 = DC
- 1 = Marvel

We used the augmented split for training/validation and reserved part of the original split for held-out testing.

2. Model & Training

Base model: distilbert-base-uncased (66M params)
Head: Linear classification layer → 2 classes
Optimizer: AdamW (lr=2e-5)
Batch size: 16
Epochs: 5
Weight decay: 0.01
Seed: 42
Hardware: Google Colab (T4 GPU)
Frameworks: 🤗 Transformers 4.56.2, PyTorch 2.8.0

3. Results

Metric	Validation	Test (original split)
Accuracy	1.00	~0.98
F1 (macro)	1.00	~0.98

Confusion Matrix (test split)

       Pred: DC   Pred: Marvel

True DC 49 1 True Marvel 1 49

Error Analysis

2 misclassifications:
- One DC description that emphasized “spider powers” misclassified as Marvel (likely confused with Spider-Man).
- One Marvel description with “dark knight” theme misclassified as DC.
Observation: Errors happen when descriptions use overlapping archetypes (e.g., darkness, flying, spiders).

4. Intended Uses

Educational purposes: understanding text classification with transformers.
Research demo: exploring how data augmentation affects transformer fine-tuning.

⚠️ Limitations:

Dataset is small and synthetic → high chance of overfitting.
Labels restricted to DC vs Marvel → not generalizable to broader superhero or pop-culture domains.
Augmented data may bias toward specific linguistic patterns.

5. How to Use

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_id = "YOUR_USERNAME/outputs_distilbert_superhero"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)

text = "A vigilante hero who protects Gotham at night."
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
    logits = model(**inputs).logits
pred = torch.argmax(logits, dim=-1).item()

print("Prediction:", "DC" if pred == 0 else "Marvel")

6. Files

pytorch_model.bin — fine-tuned model weights

config.json — model config

tokenizer.json / vocab.txt — tokenizer

README.md — this model card

7. License & AI Usage

Dataset license: Apache-2.0 (per dataset card)

Model license: Apache-2.0

AI usage disclosure: Augmented dataset created with Python EDA scripts. Model card drafted with AI assistance, final content reviewed and curated by the student.

Downloads last month: 2

Safetensors

Model size

67M params

Tensor type

F32

Model tree for EricCRX/outputs_distilbert_superhero

Base model

distilbert/distilbert-base-uncased

Finetuned

(11578)

this model