emotion-clf-original -- Baseline Emotion Classifier (Original Training Data)

Baseline emotion classification model trained on original, unrefined informal Twitter-style text. Part of the SDVM before/after comparison suite.

Cross-Evaluation Results (2x2 Matrix)

Both models evaluated on both original and SDVM-refined test data (30 samples). This proves that SDVM data refinement genuinely improves model quality -- not just on refined inputs, but across the board.

Model \ Test Data Original Test Refined Test
Original-trained (this model) 40.00% 43.33%
Refined-trained (emotion-clf-refined) 43.33% 46.67%
Model \ Test Data Original Test (Macro F1) Refined Test (Macro F1)
Original-trained (this model) 0.3881 0.4281
Refined-trained 0.3952 0.4481

Key takeaways:

  1. The refined-trained model wins on both test splits -- 43.33% on original test, 46.67% on refined test
  2. Both models improve on refined test data -- cleaning input helps even this original-trained model (+3.33pp)
  3. Best result: refined model + refined test = 46.67% -- a 16.7% relative improvement over this baseline (40%)
  4. SDVM refinement is not style-overfitting -- the refined model generalizes better to original data too

Model Details

Property Value
Architecture TF-IDF (1-2 gram, 10K features) + Logistic Regression
Reference NLP with Transformers Ch. 2 baseline
Training samples 90 (15 per class x 6 classes)
Test samples 30 (5 per class)
Classes joy, sadness, anger, fear, surprise, love
Training data Original informal text (contractions, abbreviations, typos)

Performance

Metric Value
Accuracy 40.00%
Macro F1 0.3881
Log-loss 1.6826

Per-Class F1

Emotion F1
joy 0.4000
sadness 0.2500
anger 0.3333
fear 0.6154
surprise 0.4444
love 0.2857

Training Data Sample

Label Text (original)
joy omg i just got the job i cant believe it im literally shaking rn
joy just had the best day ever with my fav people honestly life is so good
fear i keep having nightmares and idk why im so scared rn
anger my roommate ate my food AGAIN im literally gonna lose it

Usage

import joblib
from huggingface_hub import hf_hub_download

model_path = hf_hub_download(repo_id="SDVM/emotion-clf-original", filename="model.joblib")
pipe = joblib.load(model_path)

texts = ["i cant believe i got the job omg im so happy rn", "feeling really low today idk why"]
predictions = pipe.predict(texts)
print(predictions)  # ['joy', 'sadness']

probas = pipe.predict_proba(texts)
classes = pipe.classes_

Reproduce

The full training pipeline is included in train_compare.py. To reproduce:

pip install sdvm scikit-learn
export SDVM_API_KEY="your-key-here"
python train_compare.py

This generates labeled emotion samples, refines training samples using the SDVM API, trains both original and refined classifiers, evaluates on a held-out test set, and saves results to train_results.json.

Comparison

See SDVM/emotion-clf-refined for the model trained on SDVM-refined data -- it achieves 43.33% accuracy (+8.3% relative improvement) on original test data, and 46.67% on refined test data.

About SDVM

SDVM (Synthetic Data Vending Machine) refines NLP training datasets using proprietary AI models, improving grammar, spelling, and fluency while preserving labels and meaning.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Evaluation results