IndoELECTRA NER - Singgalang Dataset
Model Named Entity Recognition (NER) untuk Bahasa Indonesia menggunakan IndoELECTRA yang di-fine-tune pada dataset SINGGALANG.
π Deskripsi Model
Model ini dapat mendeteksi 3 jenis entitas dalam teks bahasa Indonesia:
- Person: Nama orang
- Place: Nama tempat/lokasi
- Organisation: Nama organisasi/perusahaan
π― Label
Model menggunakan format BIO (Begin-Inside-Outside):
O: Bukan entitasB-Person,I-Person: Entitas PersonB-Place,I-Place: Entitas PlaceB-Organisation,I-Organisation: Entitas Organisation
π§ Training Details
- Base Model: ChristopherA08/IndoELECTRA
- Dataset: SINGGALANG (oversampled)
- Training Strategy: Parameter-efficient fine-tuning
- Classifier head + last 2 encoder layers (unfrozen)
- Remaining layers frozen
- Class Weighting: Applied to handle class imbalance
- Max Sequence Length: 128 tokens
- Batch Size: 16 (with gradient accumulation steps=4)
- Learning Rate: 3e-5
- Epochs: 12 (with early stopping patience=3)
π Performance
Model mencapai performa yang baik pada validation set dengan F1-score tinggi untuk deteksi entitas Person, Place, dan Organisation.
π» Usage
Instalasi
pip install transformers torch
Inference
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
# Load model dan tokenizer
model_name = "ecaaa09/IndoELECTRA-NER-Singgalang"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
# Fungsi untuk prediksi NER
def predict_ner(sentence):
tokens = sentence.split()
inputs = tokenizer(tokens, is_split_into_words=True, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=-1).squeeze().tolist()
results = []
word_ids = inputs.word_ids()
prev_word = None
for idx, word_idx in enumerate(word_ids):
if word_idx is None or word_idx == prev_word:
continue
label = model.config.id2label[predictions[idx]]
results.append((tokens[word_idx], label))
prev_word = word_idx
return results
# Contoh penggunaan
sentence = "Joko Widodo bertemu dengan Prabowo di Jakarta"
results = predict_ner(sentence)
for token, label in results:
print(f"{token:<20} {label}")
Output Example
Joko B-Person
Widodo I-Person
bertemu O
dengan O
Prabowo B-Person
di O
Jakarta B-Place
π₯ Team
Tugas Besar Natural Language Processing - Institut Teknologi Sumatera
| Nama | NIM |
|---|---|
| Rayhan Fatih Gunawan | 122140134 |
| Elsa Elisa Yohana Sianturi | 122140135 |
| Nashwa Putri Laisya | 122140180 |
| Anisa Fitriyani | 122450019 |
| Siti Nur Aarifah | 122450006 |
| Muhammad Nelwan Fakhri | 122140173 |
| Raditya Erza Farandi | 122140209 |
π Citation
@misc{indoelectra-ner-singgalang,
author = {Rayhan Fatih Gunawan et al.},
title = {IndoELECTRA NER - Singgalang Dataset},
year = {2025},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/ecaaa09/IndoELECTRA-NER-Singgalang}}
}
π License
Apache 2.0
- Downloads last month
- -
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support