distilbert-emotion / reports /error_analysis.md

Add confusion matrix + error-analysis report

0772f5d verified 5 days ago

3.45 kB

	## Error analysis (held-out test set)

	Real evaluation of the fine-tuned weights on the `test` split of dair-ai/emotion — 2,000 examples the model never saw in training. Fully reproducible: `python -m emotion.error_report`.

	\| metric \| score \|
	\|---\|---\|
	\| accuracy \| 0.920 \|
	\| macro F1 \| 0.874 \|
	\| weighted F1 \| 0.920 \|

	### Per-class

	\| class \| precision \| recall \| F1 \| support \|
	\|---\|---\|---\|---\|---\|
	\| sadness \| 0.957 \| 0.960 \| 0.959 \| 581 \|
	\| joy \| 0.930 \| 0.944 \| 0.937 \| 695 \|
	\| love \| 0.810 \| 0.805 \| 0.808 \| 159 \|
	\| anger \| 0.953 \| 0.895 \| 0.923 \| 275 \|
	\| fear \| 0.867 \| 0.929 \| 0.897 \| 224 \|
	\| surprise \| 0.786 \| 0.667 \| 0.721 \| 66 \|

	### Confusion matrix

	![Confusion matrix](assets/confusion_matrix.png)

	<details><summary>Raw counts (rows = true, cols = predicted)</summary>

	\| true ↓ / pred → \| sadness \| joy \| love \| anger \| fear \| surprise \| recall \|
	\|---\|---\|---\|---\|---\|---\|---\|---\|
	\| sadness \| 558 \| 10 \| 2 \| 4 \| 7 \| 0 \| 0.96 \|
	\| joy \| 6 \| 656 \| 28 \| 3 \| 1 \| 1 \| 0.94 \|
	\| love \| 0 \| 28 \| 128 \| 3 \| 0 \| 0 \| 0.81 \|
	\| anger \| 13 \| 4 \| 0 \| 246 \| 12 \| 0 \| 0.89 \|
	\| fear \| 3 \| 0 \| 0 \| 2 \| 208 \| 11 \| 0.93 \|
	\| surprise \| 3 \| 7 \| 0 \| 0 \| 12 \| 44 \| 0.67 \|

	</details>

	### Where it fails

	The dominant confusions are joy → love (28), love → joy (28), anger → sadness (13). The single largest error axis is joy ↔ love (28 + 28 mutual misclassifications): both are short, affect-positive messages, so the model leans toward the higher-frequency neighbour. The weakest classes are surprise (F1 0.72, n=66), love (F1 0.81, n=159) — the two rarest in the data — which is exactly why macro F1 (0.874) sits below accuracy (0.920): macro F1 weights every class equally and so exposes the rare-class weakness that accuracy hides. The rarest class, `surprise` (n=66), leaks mainly into `fear` (12) and `joy` (7). The mistakes are semantically adjacent rather than random — the model learned the manifold and is mostly losing the low-support classes, not misfiring broadly.

	### Confidently wrong (highest-confidence mistakes)

	The most useful slice for debugging: cases the model got wrong and was sure about.

	\| true \| predicted \| conf \| text \|
	\|---\|---\|---\|---\|
	\| joy \| sadness \| 0.99 \| i feel very saddened that the king whom i once quite respected as far as monarchs go was i… \|
	\| love \| joy \| 0.99 \| i feel affirmed gracious sensuous and will have less self doubt when a href http generatio… \|
	\| sadness \| joy \| 0.99 \| i first started reading city of dark magic i thought it would be a challenge to actually e… \|
	\| anger \| sadness \| 0.98 \| i actually was in a meeting last week where someone yelled at an older lady because her ph… \|
	\| sadness \| joy \| 0.98 \| i felt a stronger wish to be free from self cherishing through my refuge practice and a re… \|
	\| anger \| sadness \| 0.98 \| i really dont like quinn because i feel like she will just end up hurting barney and i hat… \|
	\| anger \| joy \| 0.98 \| whenever i put myself in others shoes and try to make the person happy \|
	\| sadness \| joy \| 0.98 \| i remain hopeful that the feeling i have is actually excitement a long missed friend \|
	\| sadness \| anger \| 0.98 \| i hate these feelings in my heart i hate that work stressed me out i hate that cornelius w… \|
	\| love \| joy \| 0.97 \| i walked to school he felt the bounce in his step the overjoyed feelings of youth and the … \|

	## Error analysis (held-out test set)

	Real evaluation of the fine-tuned weights on the `test` split of dair-ai/emotion — 2,000 examples the model never saw in training. Fully reproducible: `python -m emotion.error_report`.

	\| metric \| score \|
	\|---\|---\|
	\| accuracy \| 0.920 \|
	\| macro F1 \| 0.874 \|
	\| weighted F1 \| 0.920 \|

	### Per-class

	\| class \| precision \| recall \| F1 \| support \|
	\|---\|---\|---\|---\|---\|
	\| sadness \| 0.957 \| 0.960 \| 0.959 \| 581 \|
	\| joy \| 0.930 \| 0.944 \| 0.937 \| 695 \|
	\| love \| 0.810 \| 0.805 \| 0.808 \| 159 \|
	\| anger \| 0.953 \| 0.895 \| 0.923 \| 275 \|
	\| fear \| 0.867 \| 0.929 \| 0.897 \| 224 \|
	\| surprise \| 0.786 \| 0.667 \| 0.721 \| 66 \|

	### Confusion matrix

	![Confusion matrix](assets/confusion_matrix.png)

	<details><summary>Raw counts (rows = true, cols = predicted)</summary>

	\| true ↓ / pred → \| sadness \| joy \| love \| anger \| fear \| surprise \| recall \|
	\|---\|---\|---\|---\|---\|---\|---\|---\|
	\| sadness \| 558 \| 10 \| 2 \| 4 \| 7 \| 0 \| 0.96 \|
	\| joy \| 6 \| 656 \| 28 \| 3 \| 1 \| 1 \| 0.94 \|
	\| love \| 0 \| 28 \| 128 \| 3 \| 0 \| 0 \| 0.81 \|
	\| anger \| 13 \| 4 \| 0 \| 246 \| 12 \| 0 \| 0.89 \|
	\| fear \| 3 \| 0 \| 0 \| 2 \| 208 \| 11 \| 0.93 \|
	\| surprise \| 3 \| 7 \| 0 \| 0 \| 12 \| 44 \| 0.67 \|

	</details>

	### Where it fails

	The dominant confusions are joy → love (28), love → joy (28), anger → sadness (13). The single largest error axis is joy ↔ love (28 + 28 mutual misclassifications): both are short, affect-positive messages, so the model leans toward the higher-frequency neighbour. The weakest classes are surprise (F1 0.72, n=66), love (F1 0.81, n=159) — the two rarest in the data — which is exactly why macro F1 (0.874) sits below accuracy (0.920): macro F1 weights every class equally and so exposes the rare-class weakness that accuracy hides. The rarest class, `surprise` (n=66), leaks mainly into `fear` (12) and `joy` (7). The mistakes are semantically adjacent rather than random — the model learned the manifold and is mostly losing the low-support classes, not misfiring broadly.

	### Confidently wrong (highest-confidence mistakes)

	The most useful slice for debugging: cases the model got wrong and was sure about.

	\| true \| predicted \| conf \| text \|
	\|---\|---\|---\|---\|
	\| joy \| sadness \| 0.99 \| i feel very saddened that the king whom i once quite respected as far as monarchs go was i… \|
	\| love \| joy \| 0.99 \| i feel affirmed gracious sensuous and will have less self doubt when a href http generatio… \|
	\| sadness \| joy \| 0.99 \| i first started reading city of dark magic i thought it would be a challenge to actually e… \|
	\| anger \| sadness \| 0.98 \| i actually was in a meeting last week where someone yelled at an older lady because her ph… \|
	\| sadness \| joy \| 0.98 \| i felt a stronger wish to be free from self cherishing through my refuge practice and a re… \|
	\| anger \| sadness \| 0.98 \| i really dont like quinn because i feel like she will just end up hurting barney and i hat… \|
	\| anger \| joy \| 0.98 \| whenever i put myself in others shoes and try to make the person happy \|
	\| sadness \| joy \| 0.98 \| i remain hopeful that the feeling i have is actually excitement a long missed friend \|
	\| sadness \| anger \| 0.98 \| i hate these feelings in my heart i hate that work stressed me out i hate that cornelius w… \|
	\| love \| joy \| 0.97 \| i walked to school he felt the bounce in his step the overjoyed feelings of youth and the … \|