|
|
--- |
|
|
datasets: |
|
|
- kornwtp/indonlu-smsa |
|
|
language: |
|
|
- id |
|
|
metrics: |
|
|
- accuracy |
|
|
base_model: |
|
|
- indobenchmark/indobert-base-p1 |
|
|
pipeline_tag: text-classification |
|
|
--- |
|
|
|
|
|
# Indonesian Text Sentiment Analysis π |
|
|
## π Overview |
|
|
This project fine-tunes a **transformer-based model** to analyze sentiment for Indonesian text. |
|
|
|
|
|
## π₯ Data Collection |
|
|
The dataset used for fine-tuning was sourced from **IndoNLU Datasets**, specifically: |
|
|
[SmSA (IndoNLU) Dataset](https://metatext.io/datasets/smsa-(indonlu)) |
|
|
|
|
|
## π Data Preparation |
|
|
- **Tokenization**: |
|
|
- Used **Indobert** for efficient text processing. |
|
|
- **Train-Test Split**: |
|
|
- The Dataset is already splitted into train, validation, and test. |
|
|
|
|
|
## ποΈ Fine-Tuning & Results |
|
|
The model was fine-tuned using **TensorFlow Hugging Face Transformers**. |
|
|
|
|
|
### **π Evaluation Metrics** |
|
|
| **Epoch** | **Train Loss** | **Train Accuracy** | **Eval Loss** | **Eval Accuracy** | **Training Time** | **Validation Time** | |
|
|
|-----------|----------------|---------------------|---------------|-------------------|-------------------|---------------------| |
|
|
| **1** | `0.2471` | `88.15%` | `0.2107` | `91.31%` | `7:55 min` | `10 sec` | |
|
|
| **2** | `0.1844` | `90.41%` | `0.2107` | `92.39%` | `7:50 min` | `10 sec` | |
|
|
| **3** | `0.1502` | `91.66%` | `0.2135` | `93.14%` | `7:51 min` | `9 sec` | |
|
|
| **4** | `0.1285` | `92.50%` | `0.2192` | `93.69%` | `7:50 min` | `10 sec` | |
|
|
| **5** | `0.1101` | `93.13%` | `0.2367` | `94.14%` | `7:48 min` | `9 sec` | |
|
|
|
|
|
## βοΈ Training Parameters |
|
|
epochs = 5 |
|
|
|
|
|
learning_rate = 5e-5 |
|
|
|
|
|
seed_val = 42 |
|
|
|
|
|
max_length = 128 |
|
|
|
|
|
batch_size = 32 |
|
|
|
|
|
eval_batch_size = 32 |
|
|
|
|
|
## π€ How to use |
|
|
|
|
|
```python |
|
|
import tensorflow as tf |
|
|
from transformers import TFAutoModelForSequenceClassification, AutoTokenizer |
|
|
|
|
|
# Load model dan tokenizer |
|
|
model_name = "feverlash/Indonesian-SentimentAnalysis-Model" # Ganti dengan path model yang telah disimpan |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
model = TFAutoModelForSequenceClassification.from_pretrained(model_name) |
|
|
|
|
|
# Fungsi untuk melakukan prediksi sentimen |
|
|
def predict(text): |
|
|
sentiment_mapping = { |
|
|
1: "positive", |
|
|
0: "negative", |
|
|
2: "neutral" |
|
|
} |
|
|
|
|
|
# Tokenisasi teks |
|
|
inputs = tokenizer( |
|
|
text, |
|
|
return_tensors="tf", |
|
|
truncation=True, |
|
|
padding="max_length", |
|
|
max_length=128 |
|
|
) |
|
|
|
|
|
# Prediksi menggunakan model |
|
|
outputs = model(inputs) |
|
|
logits = outputs.logits |
|
|
|
|
|
# Menghitung probabilitas |
|
|
probabilities = tf.nn.softmax(logits).numpy() |
|
|
|
|
|
# Menentukan label prediksi |
|
|
predicted_index = int(tf.argmax(probabilities, axis=1).numpy()[0]) |
|
|
predicted_label = sentiment_mapping.get(predicted_index, "unknown") |
|
|
|
|
|
# Keyakinan prediksi |
|
|
confidence = probabilities[0][predicted_index] |
|
|
|
|
|
print(f"Teks: {text}") |
|
|
print(f"Prediksi label: {predicted_label} (Confidence: {confidence:.2f})") |
|
|
|
|
|
# Contoh penggunaan |
|
|
text = "aku sedang jalan-jalan di Yogyakarta" |
|
|
predict(text) |