🔎 IndoBERT Hoax Detection API

Space ini menyediakan API FastAPI untuk mendeteksi apakah sebuah teks berita berbahasa Indonesia adalah hoaks atau bukan hoaks, menggunakan model IndoBERT yang telah difine-tune.

Model ini dilatih pada kombinasi beberapa sumber:

Summarized_CNN.csv – berita valid (non-hoaks) dari CNN Indonesia
Summarized_Detik.csv – berita valid dari Detik
Summarized_Kompas.csv – berita valid dari Kompas
Summarized_TurnBackHoax.csv – berita hoaks/false claim yang dikurasi oleh Turn BackHoax
merged_clean_filtered_2020plus_halfNaT.csv – kumpulan berita non-hoaks tambahan (2020–2025) yang telah dibersihkan dan dinormalisasi

Semua berita sebelum tahun 2020 dibuang untuk mengurangi bias temporal (“berita lama = hoaks”), dan sekitar 50% data dengan tanggal tidak terbaca (NaT) juga ikut difilter.

📊 Arsitektur Model

Backbone: indolem/indobert-base-uncased
Tugas: binary classification
- 0 → not_hoax
- 1 → hoax
Panjang input maksimum: 256 token
Training:
- Optimizer: AdamW
- Learning rate: 2e-5
- Weight decay: 0.01
- Epoch: 3
- Batch size:
  - Train: 64 (gradient_accumulation_steps=2, effective batch ≈ 128)
  - Eval: 256
- Train set di-oversample agar seimbang:
  - sebelum balancing: 0 ≈ 114.987, 1 ≈ 8.367
  - sesudah balancing: 0 = 114.987, 1 = 114.987
- Validation & Test tidak di-oversample → distribusi tetap imbalanced (≈93% non-hoaks, 7% hoaks)

Total data setelah merge & dedup:

177.288 baris
Label:
- not_hoax: 164.268
- hoax: 11.953

Split:

Train: 123.354
Validation: 26.433
Test: 26.434

📊 Hasil Evaluasi

Semua metrik di bawah diukur dengan kelas positif = hoax.

Validation Set (26.433 sampel)

Accuracy: 0.9983
F1 (hoax): 0.9877
Precision (hoax): 0.9921
Recall (hoax): 0.9833
Weighted F1: 0.9983

Confusion Matrix – Validation

	Pred: not_hoax	Pred: hoax
True not_hoax	24.626	14
True hoax	30	1.763

Artinya di validation:

Dari 24.640 berita non-hoaks, hanya 14 yang keliru ditandai hoaks.
Dari 1.793 berita hoaks, hanya 30 yang lolos sebagai non-hoaks.

Jika kamu melampirkan gambar, misalnya confusion_matrix_validation.png, bisa ditampilkan di sini:

Test Set (26.434 sampel)

Accuracy: 0.9983
F1 (hoax): 0.9874
Precision (hoax): 0.9938
Recall (hoax): 0.9810
Weighted F1: 0.9983

Confusion Matrix – Test

	Pred: not_hoax	Pred: hoax
True not_hoax	24.630	11
True hoax	34	1.759

Interpretasi:

Dari 24.641 berita non-hoaks di test, hanya 11 yang salah ditandai "hoax"
- false positive rate ≈ 0.045%.
Dari 1.793 hoaks, hanya 34 yang tidak terdeteksi
- false negative rate ≈ 1,9%. Grafik jika kamu upload misalnya confusion_matrix_test.png:

Konsistensi antara validation dan test menunjukkan model tidak overfit parah dan generalisasi dengan baik di dalam distribusi data yang sama.

🌐 Endpoint API Base URL Space ini (contoh): text Copy code https://-.hf.space

GET /health Cek apakah API hidup. Response: json Copy code { "status": "ok" }
GET / Informasi singkat tentang model yang dipakai. Example response: json Copy code { "message": "Indo Hoax Detector API is running.", "model_id": "fjrmhri/hoaks-detection", "subfolder": null, "labels": { "0": "not_hoax", "1": "hoax" } }
POST /predict Prediksi untuk satu teks. Request body: json Copy code { "text": "Isi teks berita yang ingin dicek..." } Response: json Copy code { "label": "hoax", "score": 0.9873, "probabilities": { "not_hoax": 0.0127, "hoax": 0.9873 } }
POST /predict-batch Prediksi untuk banyak teks sekaligus. Request body: json Copy code { "texts": [ "Berita pertama ...", "Berita kedua ..." ] } Response: json Copy code { "results": [ { "label": "not_hoax", "score": 0.9981, "probabilities": { "not_hoax": 0.9981, "hoax": 0.0019 } }, { "label": "hoax", "score": 0.9734, "probabilities": { "not_hoax": 0.0266, "hoax": 0.9734 } } ] } 💻 Contoh Pemakaian (curl & JavaScript) curl bash Copy code curl -X POST "https://-.hf.space/predict"
-H "Content-Type: application/json"
-d '{"text": "Isi berita yang mau dicek..."}' JavaScript (mis. Next.js / Vercel) js Copy code const res = await fetch( ${process.env.NEXT_PUBLIC_API_BASE_URL}/predict, { method: "POST", headers: { "Content-Type": "application/json" }, body: JSON.stringify({ text: newsText }), } );

const data = await res.json();

console.log(data.label); // "hoax" atau "not_hoax" console.log(data.score); // confidence console.log(data.probabilities) // semua label + prob ⚠️ Catatan & Batasan Model Semua contoh hoaks berasal dari dataset TurnBackHoax. Semua non-hoaks berasal dari berita portal arus utama dan beberapa dataset tambahan yang sudah dibersihkan. Rentang waktu data sekitar 2020–2025 (berita lama sebelum 2020 dibuang). Model sangat cocok untuk: artikel berita panjang / sedang, debunk dari TurnBackHoax, berita dari portal mainstream Indonesia. Untuk domain lain (misalnya tweet pendek, chat WhatsApp, meme, dsb) performa bisa berbeda dan sebaiknya diuji dulu dengan sample nyata. 🔧 Konfigurasi Model di Space Space ini menggunakan Docker (sdk: docker) dengan environment: MODEL_ID – ID model di Hugging Face Hub (default: fjrmhri/hoaks-detection) MODEL_SUBFOLDER – kosong jika model ada di root repo; isi jika model disimpan di subfolder MAX_LENGTH – panjang maksimum token (default: 256) Kamu bisa mengubah nilai ini di: Space → Settings → Repository → Variables and secrets

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

fjrmhri
/

hoaks-detection

🔎 IndoBERT Hoax Detection API

📊 Arsitektur Model

📊 Hasil Evaluasi

Validation Set (26.433 sampel)

Spaces using fjrmhri/hoaks-detection 2