DistilBERT — Superhero Text Classification (DC vs Marvel)

Model: Fine-tuned distilbert-base-uncased
Dataset: rlogh/superhero-texts
Task: Binary text classification — identify whether a description belongs to DC or Marvel superhero universe.


1. Dataset

  • Source: Classmate dataset (not self-authored)
  • Splits:
    • original: 100 manually written descriptions
    • augmented: 1000 synthetically generated samples (EDA + character-level noise)
  • Labels:
    • 0 = DC
    • 1 = Marvel

We used the augmented split for training/validation and reserved part of the original split for held-out testing.


2. Model & Training

  • Base model: distilbert-base-uncased (66M params)
  • Head: Linear classification layer → 2 classes
  • Optimizer: AdamW (lr=2e-5)
  • Batch size: 16
  • Epochs: 5
  • Weight decay: 0.01
  • Seed: 42
  • Hardware: Google Colab (T4 GPU)
  • Frameworks: 🤗 Transformers 4.56.2, PyTorch 2.8.0

3. Results

Metric Validation Test (original split)
Accuracy 1.00 ~0.98
F1 (macro) 1.00 ~0.98

Confusion Matrix (test split)

       Pred: DC   Pred: Marvel

True DC 49 1 True Marvel 1 49

Error Analysis

  • 2 misclassifications:
    • One DC description that emphasized “spider powers” misclassified as Marvel (likely confused with Spider-Man).
    • One Marvel description with “dark knight” theme misclassified as DC.
  • Observation: Errors happen when descriptions use overlapping archetypes (e.g., darkness, flying, spiders).

4. Intended Uses

  • Educational purposes: understanding text classification with transformers.
  • Research demo: exploring how data augmentation affects transformer fine-tuning.

⚠️ Limitations:

  • Dataset is small and synthetic → high chance of overfitting.
  • Labels restricted to DC vs Marvel → not generalizable to broader superhero or pop-culture domains.
  • Augmented data may bias toward specific linguistic patterns.

5. How to Use

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_id = "YOUR_USERNAME/outputs_distilbert_superhero"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)

text = "A vigilante hero who protects Gotham at night."
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
    logits = model(**inputs).logits
pred = torch.argmax(logits, dim=-1).item()

print("Prediction:", "DC" if pred == 0 else "Marvel")

6. Files

pytorch_model.bin — fine-tuned model weights

config.json — model config

tokenizer.json / vocab.txt — tokenizer

README.md — this model card

7. License & AI Usage

Dataset license: Apache-2.0 (per dataset card)

Model license: Apache-2.0

AI usage disclosure: Augmented dataset created with Python EDA scripts. Model card drafted with AI assistance, final content reviewed and curated by the student.

Downloads last month
-
Safetensors
Model size
67M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for EricCRX/outputs_distilbert_superhero

Finetuned
(10869)
this model