IndoELECTRA NER - Singgalang Dataset

Model Named Entity Recognition (NER) untuk Bahasa Indonesia menggunakan IndoELECTRA yang di-fine-tune pada dataset SINGGALANG.

πŸ“‹ Deskripsi Model

Model ini dapat mendeteksi 3 jenis entitas dalam teks bahasa Indonesia:

  • Person: Nama orang
  • Place: Nama tempat/lokasi
  • Organisation: Nama organisasi/perusahaan

🎯 Label

Model menggunakan format BIO (Begin-Inside-Outside):

  • O: Bukan entitas
  • B-Person, I-Person: Entitas Person
  • B-Place, I-Place: Entitas Place
  • B-Organisation, I-Organisation: Entitas Organisation

πŸ”§ Training Details

  • Base Model: ChristopherA08/IndoELECTRA
  • Dataset: SINGGALANG (oversampled)
  • Training Strategy: Parameter-efficient fine-tuning
    • Classifier head + last 2 encoder layers (unfrozen)
    • Remaining layers frozen
  • Class Weighting: Applied to handle class imbalance
  • Max Sequence Length: 128 tokens
  • Batch Size: 16 (with gradient accumulation steps=4)
  • Learning Rate: 3e-5
  • Epochs: 12 (with early stopping patience=3)

πŸ“Š Performance

Model mencapai performa yang baik pada validation set dengan F1-score tinggi untuk deteksi entitas Person, Place, dan Organisation.

πŸ’» Usage

Instalasi

pip install transformers torch

Inference

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

# Load model dan tokenizer
model_name = "ecaaa09/IndoELECTRA-NER-Singgalang"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Fungsi untuk prediksi NER
def predict_ner(sentence):
    tokens = sentence.split()
    inputs = tokenizer(tokens, is_split_into_words=True, return_tensors="pt")
    
    with torch.no_grad():
        outputs = model(**inputs)
    
    predictions = torch.argmax(outputs.logits, dim=-1).squeeze().tolist()
    
    results = []
    word_ids = inputs.word_ids()
    prev_word = None
    
    for idx, word_idx in enumerate(word_ids):
        if word_idx is None or word_idx == prev_word:
            continue
        label = model.config.id2label[predictions[idx]]
        results.append((tokens[word_idx], label))
        prev_word = word_idx
    
    return results

# Contoh penggunaan
sentence = "Joko Widodo bertemu dengan Prabowo di Jakarta"
results = predict_ner(sentence)

for token, label in results:
    print(f"{token:<20} {label}")

Output Example

Joko                 B-Person
Widodo               I-Person
bertemu              O
dengan               O
Prabowo              B-Person
di                   O
Jakarta              B-Place

πŸ‘₯ Team

Tugas Besar Natural Language Processing - Institut Teknologi Sumatera

Nama NIM
Rayhan Fatih Gunawan 122140134
Elsa Elisa Yohana Sianturi 122140135
Nashwa Putri Laisya 122140180
Anisa Fitriyani 122450019
Siti Nur Aarifah 122450006
Muhammad Nelwan Fakhri 122140173
Raditya Erza Farandi 122140209

πŸ“ Citation

@misc{indoelectra-ner-singgalang,
  author = {Rayhan Fatih Gunawan et al.},
  title = {IndoELECTRA NER - Singgalang Dataset},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/ecaaa09/IndoELECTRA-NER-Singgalang}}
}

πŸ“„ License

Apache 2.0

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support