| --- |
| language: |
| - en |
| license: apache-2.0 |
| tags: |
| - distilbert |
| - emotion |
| - text-classification |
| - emobooks |
| base_model: distilbert-base-uncased |
| pipeline_tag: text-classification |
| --- |
| |
| # EmoBooks β Emotion Classifier |
|
|
| A `distilbert-base-uncased` fine-tuned to classify English (and |
| Singlish-normalized) user utterances into 8 emotion labels for the |
| [emoBooks](https://huggingface.co/DiyRex/emobooks-llama3-lora) Sinhala |
| novel recommender. |
|
|
| ## Labels |
|
|
| `sadness`, `joy`, `love`, `anger`, `fear`, `surprise`, `disgust`, `calm` |
|
|
| The runtime additionally maps these to `lonely` and `anxious` via simple |
| keyword rules (see `emobooks/classifier.py::LABEL_ALIAS`). |
|
|
| ## Training |
|
|
| | Parameter | Value | |
| |---|---| |
| | Base model | `distilbert-base-uncased` | |
| | Dataset | 42 k / 2.5 k / 2.5 k (train/val/test) β `dataset/training.csv` etc. in the [emobooks repo](https://github.com/) | |
| | Epochs | 4 | |
| | Batch size | 32 | |
| | Max seq len | 160 | |
| | Learning rate | 2.0e-5 (cosine, 6% warmup) | |
| | Weight decay | 0.01 | |
|
|
| ## Test metrics (held-out 2.5 k split) |
|
|
| | Metric | Value | |
| |---|---| |
| | eval_accuracy | **0.9356** | |
| | eval_loss | 0.2372 | |
|
|
| ## Inference |
|
|
| ```python |
| from transformers import AutoTokenizer, AutoModelForSequenceClassification |
| import torch |
| |
| tok = AutoTokenizer.from_pretrained("DiyRex/emobooks-emotion-classifier") |
| model = AutoModelForSequenceClassification.from_pretrained( |
| "DiyRex/emobooks-emotion-classifier" |
| ).eval() |
| |
| text = "i feel really lonely tonight" |
| ids = tok(text, return_tensors="pt", truncation=True, max_length=160) |
| with torch.no_grad(): |
| logits = model(**ids).logits |
| label = model.config.id2label[int(logits.argmax(-1))] |
| print(label) # β "sadness" (then mapped to "lonely" by the runtime) |
| ``` |
|
|
| ## Singlish input |
|
|
| The runtime pre-normalises Singlish/Sinhala affect tokens to English |
| hints before this model runs (see `emobooks/normalize.py`): |
|
|
| - `mata hari duka` β `i feel sad. mata hari sad` β **sadness** |
| - `mata satutui` β `i feel happy. mata happy` β **joy** |
| - `mata loku bayak tiyenne` β fear-cue prepended β **fear** |
|
|
| ## Place in the stack |
|
|
| ``` |
| user text |
| β normalize (Singlish β English hints) |
| β this classifier (one of 8 emotion labels) |
| β retrieve (xlm-roberta-base mean-pooled, cosine) |
| β filter (emotion β tone/pacing/theme rules) |
| β dialog (state machine) |
| β respond (Llama-3-8B + DiyRex/emobooks-llama3-lora) |
| β guardrail (catalog index check; no fake books) |
| ``` |
|
|
| ## License |
| Apache 2.0 |
|
|