matchaSentiment / README.md
seedflora's picture
Update retrained README
102b621 verified

A newer version of the Gradio SDK is available: 6.15.2

Upgrade
metadata
title: Matcha Sentiment
emoji: 🍡
colorFrom: green
colorTo: yellow
sdk: gradio
sdk_version: 5.0.1
app_file: app.py
python_version: '3.12'
pinned: false
license: mit

Matcha Sentiment

Sentiment analysis bahasa Indonesia untuk review Matchaya/IKUYO. Dataset dibersihkan menjadi klasifikasi biner Negatif dan Positif, lalu dibandingkan dengan baseline machine learning klasik dan fine-tuning 9 model Transformer Indonesia.

Dashboard

Ringkasan

Area Hasil
Dataset final 2028 review
Label 1014 Negatif, 1014 Positif
Label dihapus 14 Netral
Duplikat dibuang 219 teks
Best classical TF-IDF + Linear SVM
Best Transformer indolem/indobert-base-uncased
Runtime Docker + NVIDIA GPU
Dashboard Gradio, siap Hugging Face Spaces

Catatan push: model Transformer terbaik disimpan di models/best_transformer dan ditrack lewat Git LFS. Weight kandidat di models/transformers/*/model/model.safetensors di-ignore karena bisa dibuat ulang dari pipeline training.

Hasil Utama

Transformer

Model Accuracy Precision Recall F1 ROC AUC
indolem/indobert-base-uncased 0.9951 0.9902 1.0000 0.9951 0.9998
naufalihsan/indonesian-sbert-large 0.9901 0.9806 1.0000 0.9902 0.9998
flax-community/indonesian-roberta-base 0.9901 0.9806 1.0000 0.9902 0.9997
sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 0.9901 0.9806 1.0000 0.9902 0.9996
indobenchmark/indobert-base-p1 0.9901 0.9806 1.0000 0.9902 0.9989
ChristopherA08/IndoELECTRA 0.9901 0.9806 1.0000 0.9902 0.9989
cahya/distilbert-base-indonesian 0.9901 0.9901 0.9901 0.9901 0.9998
indolem/indobertweet-base-uncased 0.9852 0.9804 0.9901 0.9852 0.9994
w11wo/indonesian-roberta-base-sentiment-classifier 0.9803 0.9619 1.0000 0.9806 1.0000

Model terbaik sudah disimpan di:

models/best_transformer

Machine Learning Klasik

TF-IDF dan Word2Vec diuji dengan Stratified 10-fold cross validation. Hasil lengkap:

Feature Model Accuracy Precision Recall F1 ROC AUC
TF-IDF Linear SVM 0.9684 0.9788 0.9576 0.9681 0.9951
Word2Vec Logistic Regression 0.9635 0.9663 0.9606 0.9634 0.9939
Word2Vec Extra Trees 0.9625 0.9653 0.9596 0.9624 0.9940
TF-IDF Logistic Regression 0.9610 0.9756 0.9458 0.9604 0.9933
Word2Vec Linear SVM 0.9596 0.9660 0.9527 0.9593 0.9924
Word2Vec Random Forest 0.9591 0.9632 0.9546 0.9589 0.9927
Word2Vec Gradient Boosting 0.9571 0.9603 0.9536 0.9570 0.9933
TF-IDF Extra Trees 0.9522 0.9580 0.9458 0.9519 0.9918
TF-IDF Random Forest 0.9443 0.9353 0.9546 0.9449 0.9882
TF-IDF Gradient Boosting 0.9147 0.9304 0.8964 0.9131 0.9735

Visual Evaluasi

Results Gallery

Dashboard

Prediksi Visual Kata Kunci
Dashboard Home Dashboard Visual Dashboard Keywords

Detail Plot

Training Loss Confusion Matrix ROC AUC
Training Loss Confusion Matrix ROC AUC
Top Words Word Cloud Positif Word Cloud Negatif
Top Words Word Cloud Positif Word Cloud Negatif

Kata Kunci Bermakna

Beberapa kata yang paling membantu membaca arah sentimen:

Kata Positif Docs Negatif Docs Dominan
enak 173 10 Positif
nyaman 54 10 Positif
ramah 37 9 Positif
terbaik 18 0 Positif
mahal 10 24 Negatif
harga 10 28 Negatif
buruk 0 19 Negatif
antrean 0 19 Negatif
lama 1 16 Negatif

File lengkapnya ada di:

artifacts/classical/keyword_counts.csv

Struktur Proyek

.
β”œβ”€β”€ app.py
β”œβ”€β”€ Dockerfile
β”œβ”€β”€ docker-compose.yml
β”œβ”€β”€ INSTALL_DOCKER.md
β”œβ”€β”€ data/processed/matcha_sentiment_binary.csv
β”œβ”€β”€ docs/images/
β”œβ”€β”€ artifacts/
β”œβ”€β”€ models/best_transformer/
β”œβ”€β”€ models/classical/best_model.joblib
β”œβ”€β”€ scripts/
└── src/matcha_sentiment/

Quick Start

docker build -t matcha-sentiment .
docker run --rm --gpus all -p 7860:7860 -v "${PWD}:/workspace" matcha-sentiment

Buka:

http://localhost:7860

Panduan dari nol sampai deploy ada di INSTALL_DOCKER.md.

Catatan

Skor evaluasi sangat tinggi karena dataset masih kecil dan domainnya sempit. Model ini sudah bagus untuk demo, dashboard, dan eksperimen sentiment analysis review matcha, tetapi untuk production lintas brand atau lintas kategori sebaiknya ditambah data baru yang lebih beragam.

matchaSentiment