v9 — distilbert emotion classifier (8 labels, 0.9356 acc)

f23310b verified about 1 month ago

2.51 kB

	---
	language:
	- en
	license: apache-2.0
	tags:
	- distilbert
	- emotion
	- text-classification
	- emobooks
	base_model: distilbert-base-uncased
	pipeline_tag: text-classification
	---

	# EmoBooks — Emotion Classifier

	A `distilbert-base-uncased` fine-tuned to classify English (and
	Singlish-normalized) user utterances into 8 emotion labels for the
	[emoBooks](https://huggingface.co/DiyRex/emobooks-llama3-lora) Sinhala
	novel recommender.

	## Labels

	`sadness`, `joy`, `love`, `anger`, `fear`, `surprise`, `disgust`, `calm`

	The runtime additionally maps these to `lonely` and `anxious` via simple
	keyword rules (see `emobooks/classifier.py::LABEL_ALIAS`).

	## Training

	\| Parameter \| Value \|
	\|---\|---\|
	\| Base model \| `distilbert-base-uncased` \|
	\| Dataset \| 42 k / 2.5 k / 2.5 k (train/val/test) — `dataset/training.csv` etc. in the [emobooks repo](https://github.com/) \|
	\| Epochs \| 4 \|
	\| Batch size \| 32 \|
	\| Max seq len \| 160 \|
	\| Learning rate \| 2.0e-5 (cosine, 6% warmup) \|
	\| Weight decay \| 0.01 \|

	## Test metrics (held-out 2.5 k split)

	\| Metric \| Value \|
	\|---\|---\|
	\| eval_accuracy \| 0.9356 \|
	\| eval_loss \| 0.2372 \|

	## Inference

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch

	tok = AutoTokenizer.from_pretrained("DiyRex/emobooks-emotion-classifier")
	model = AutoModelForSequenceClassification.from_pretrained(
	"DiyRex/emobooks-emotion-classifier"
	).eval()

	text = "i feel really lonely tonight"
	ids = tok(text, return_tensors="pt", truncation=True, max_length=160)
	with torch.no_grad():
	logits = model(**ids).logits
	label = model.config.id2label[int(logits.argmax(-1))]
	print(label) # → "sadness" (then mapped to "lonely" by the runtime)
	```

	## Singlish input

	The runtime pre-normalises Singlish/Sinhala affect tokens to English
	hints before this model runs (see `emobooks/normalize.py`):

	- `mata hari duka` → `i feel sad. mata hari sad` → sadness
	- `mata satutui` → `i feel happy. mata happy` → joy
	- `mata loku bayak tiyenne` → fear-cue prepended → fear

	## Place in the stack

	```
	user text
	↓ normalize (Singlish → English hints)
	↓ this classifier (one of 8 emotion labels)
	↓ retrieve (xlm-roberta-base mean-pooled, cosine)
	↓ filter (emotion → tone/pacing/theme rules)
	↓ dialog (state machine)
	↓ respond (Llama-3-8B + DiyRex/emobooks-llama3-lora)
	↓ guardrail (catalog index check; no fake books)
	```

	## License
	Apache 2.0