Lithuanian E-commerce Sentiment Classifier — MNB (TF-IDF, binary)
A Multinomial Naïve Bayes classifier trained from scratch on 47,131 Lithuanian e-commerce reviews. Predicts binary sentiment polarity (positive / negative).
Model details
- Developed by: Austėja Rušėnaitė
- Algorithm:
sklearn.naive_bayes.MultinomialNB,alpha = 1.0(Laplace smoothing). - Features: TF-IDF (5,000 features, unigrams + bigrams,
min_df = 2,sublinear_tf = True) plus four engineered features (exclamation_count,review_word_count,avg_word_length,capital_count) MinMax-scaled to [0, 1]. - Labels: binary —
positiveandnegative. The corpus is naturally three-class (positive / neutral / negative) but the neutral class was dropped at training time. - Training data size: 47,131 reviews (positive 37,269; negative 9,862).
- Class handling:
sample_weight = compute_sample_weight("balanced", y). SMOTE was deliberately not used because interpolating in sparse TF-IDF space produces vectors that do not correspond to plausible Lithuanian sentences. - Language: Lithuanian (
lt). - Domain: Lithuanian e-commerce reviews.
- Saved with scikit-learn: 1.8.0.
- License: Apache 2.0
Files in this repository
| File | Purpose |
|---|---|
mnb_sentiment_binary_model.joblib |
trained MNB classifier |
mnb_tfidf_binary_vectorizer.joblib |
fitted TfidfVectorizer |
mnb_features_binary_scaler.joblib |
fitted MinMaxScaler for engineered features |
README.md |
this card |
Intended use
Polarity classification of Lithuanian e-commerce reviews — for example, as a feature for downstream aggregation across the set of reviews available for a given merchant.
Three-class sentiment is not supported by this checkpoint. Domains other than Lithuanian e-commerce reviews are out of scope: the vocabulary and engineered features are dataset-specific.
Training procedure
Two-stage preprocessing applied to the raw review text before vectorisation:
- Lower-cased, Lithuanian-letters-only normalisation (
[^a-ząčęėįšųūž\s]stripped). - spaCy
lt_core_news_mdlemmatisation and stop-word removal.
The four engineered features are computed from the raw (pre-normalisation) text so that exclamation_count, capital_count and friends are preserved.
The TF-IDF matrix is hstacked with the scaled engineered features to produce a 5,004-dimensional sparse input matrix, on which MNB is fitted with balanced sample weights.
Evaluation
Reported on 5-fold stratified cross-validation over the full labelled dataset (every review appears as a test instance exactly once). No held-out split.
| Metric | Value |
|---|---|
| Accuracy | 0.8952 ± 0.0039 |
| macro-F1 (headline) | 0.8618 ± 0.0043 |
| weighted-F1 | 0.9013 ± 0.0034 |
| macro-precision | 0.8320 ± 0.0044 |
| macro-recall | 0.9208 ± 0.0016 |
| ROC-AUC | 0.9743 ± 0.0011 |
| Average precision | 0.9933 ± 0.0003 |
Per-class breakdown (mean ± std across folds):
| Class | Precision | Recall | F1-score |
|---|---|---|---|
| Negative | 0.6746 ± 0.0095 | 0.9647 ± 0.0023 | 0.7939 ± 0.0057 |
| Positive | 0.9895 ± 0.0006 | 0.8768 ± 0.0055 | 0.9297 ± 0.0028 |
Aggregated confusion matrix across all five folds:
| Predicted: Negative | Predicted: Positive | |
|---|---|---|
| Actual: Negative | 9,514 | 348 |
| Actual: Positive | 4,592 | 32,677 |
Limitations and caveats
- Binary scheme. The neutral class was dropped because Lithuanian three-class sentiment is dominated by neutral-class confusion. Reviews whose sentiment is genuinely mixed or neutral will be forced into the closer of the two binary classes.
- Class imbalance. The corpus is ~79 % positive / ~21 % negative, and the model is trained with balanced sample weights. The resulting operating point has high recall on the minority class (0.9647) at the cost of lower precision (0.6746). Downstream consumers requiring a different operating point can move along the precision–recall curve without retraining.
- Domain specificity. The TF-IDF vocabulary is fitted on Lithuanian e-commerce review text. Application to other text genres is not recommended without retraining.
- No language detection. Inputs in languages other than Lithuanian will be processed by the same preprocessing pipeline and will produce nonsense predictions. Language detection is the responsibility of the caller.
- Bag-of-words limitations. As a bag-of-words model, the classifier does not represent word order beyond the bigram features and is not sensitive to negation scope. Reviews of the form "I expected it to be terrible but it was actually excellent" can be misclassified.
Evaluation results
- macro-F1 (5-fold CV)self-reported0.862
- Accuracy (5-fold CV)self-reported0.895
- ROC-AUC (5-fold CV)self-reported0.974