DiyRex's picture
v9 β€” distilbert emotion classifier (8 labels, 0.9356 acc)
f23310b verified
metadata
language:
  - en
license: apache-2.0
tags:
  - distilbert
  - emotion
  - text-classification
  - emobooks
base_model: distilbert-base-uncased
pipeline_tag: text-classification

EmoBooks β€” Emotion Classifier

A distilbert-base-uncased fine-tuned to classify English (and Singlish-normalized) user utterances into 8 emotion labels for the emoBooks Sinhala novel recommender.

Labels

sadness, joy, love, anger, fear, surprise, disgust, calm

The runtime additionally maps these to lonely and anxious via simple keyword rules (see emobooks/classifier.py::LABEL_ALIAS).

Training

Parameter Value
Base model distilbert-base-uncased
Dataset 42 k / 2.5 k / 2.5 k (train/val/test) β€” dataset/training.csv etc. in the emobooks repo
Epochs 4
Batch size 32
Max seq len 160
Learning rate 2.0e-5 (cosine, 6% warmup)
Weight decay 0.01

Test metrics (held-out 2.5 k split)

Metric Value
eval_accuracy 0.9356
eval_loss 0.2372

Inference

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tok = AutoTokenizer.from_pretrained("DiyRex/emobooks-emotion-classifier")
model = AutoModelForSequenceClassification.from_pretrained(
    "DiyRex/emobooks-emotion-classifier"
).eval()

text = "i feel really lonely tonight"
ids = tok(text, return_tensors="pt", truncation=True, max_length=160)
with torch.no_grad():
    logits = model(**ids).logits
label = model.config.id2label[int(logits.argmax(-1))]
print(label)  # β†’ "sadness"  (then mapped to "lonely" by the runtime)

Singlish input

The runtime pre-normalises Singlish/Sinhala affect tokens to English hints before this model runs (see emobooks/normalize.py):

  • mata hari duka β†’ i feel sad. mata hari sad β†’ sadness
  • mata satutui β†’ i feel happy. mata happy β†’ joy
  • mata loku bayak tiyenne β†’ fear-cue prepended β†’ fear

Place in the stack

user text
   ↓ normalize       (Singlish β†’ English hints)
   ↓ this classifier (one of 8 emotion labels)
   ↓ retrieve        (xlm-roberta-base mean-pooled, cosine)
   ↓ filter          (emotion β†’ tone/pacing/theme rules)
   ↓ dialog          (state machine)
   ↓ respond         (Llama-3-8B + DiyRex/emobooks-llama3-lora)
   ↓ guardrail       (catalog index check; no fake books)

License

Apache 2.0