DistilBERT — Superhero Text Classification (DC vs Marvel)

Model: Fine-tuned distilbert-base-uncased
Dataset: rlogh/superhero-texts
Task: Binary text classification — identify whether a description belongs to DC or Marvel superhero universe.


1. Dataset

  • Source: Classmate dataset (not self-authored)
  • Splits:
    • original: 100 manually written descriptions
    • augmented: 1000 synthetically generated samples (EDA + character-level noise)
  • Labels:
    • 0 = DC
    • 1 = Marvel

We used the augmented split for training/validation and reserved part of the original split for held-out testing.


2. Model & Training

  • Base model: distilbert-base-uncased (66M params)
  • Head: Linear classification layer → 2 classes
  • Optimizer: AdamW (lr=2e-5)
  • Batch size: 16
  • Epochs: 5
  • Weight decay: 0.01
  • Seed: 42
  • Hardware: Google Colab (T4 GPU)
  • Frameworks: 🤗 Transformers 4.56.2, PyTorch 2.8.0

3. Results

Metric Validation Test (original split)
Accuracy 1.00 ~0.98
F1 (macro) 1.00 ~0.98

Confusion Matrix (test split)

       Pred: DC   Pred: Marvel

True DC 49 1 True Marvel 1 49

Error Analysis

  • 2 misclassifications:
    • One DC description that emphasized “spider powers” misclassified as Marvel (likely confused with Spider-Man).
    • One Marvel description with “dark knight” theme misclassified as DC.
  • Observation: Errors happen when descriptions use overlapping archetypes (e.g., darkness, flying, spiders).

4. Intended Uses

  • Educational purposes: understanding text classification with transformers.
  • Research demo: exploring how data augmentation affects transformer fine-tuning.

⚠️ Limitations:

  • Dataset is small and synthetic → high chance of overfitting.
  • Labels restricted to DC vs Marvel → not generalizable to broader superhero or pop-culture domains.
  • Augmented data may bias toward specific linguistic patterns.

5. How to Use

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_id = "YOUR_USERNAME/outputs_distilbert_superhero"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)

text = "A vigilante hero who protects Gotham at night."
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
    logits = model(**inputs).logits
pred = torch.argmax(logits, dim=-1).item()

print("Prediction:", "DC" if pred == 0 else "Marvel")

6. Files

pytorch_model.bin — fine-tuned model weights

config.json — model config

tokenizer.json / vocab.txt — tokenizer

README.md — this model card

7. License & AI Usage

Dataset license: Apache-2.0 (per dataset card)

Model license: Apache-2.0

AI usage disclosure: Augmented dataset created with Python EDA scripts. Model card drafted with AI assistance, final content reviewed and curated by the student.

Downloads last month
12
Safetensors
Model size
67M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for EricCRX/outputs_distilbert_superhero

Finetuned
(10492)
this model

Evaluation results