Upload folder using huggingface_hub

Browse files

Files changed (7) hide show

README.md +141 -1
config.json +26 -0
pytorch_model.pth +3 -0
special_tokens_map.json +7 -0
tokenizer.json +0 -0
tokenizer_config.json +57 -0
vocab.txt +0 -0

README.md CHANGED Viewed

@@ -1,3 +1,143 @@
 ---
-license: mit
 ---

 ---
+language: id
+tags:
+- indonesian
+- named-entity-recognition
+- ner
+- indoelectra
+datasets:
+- singgalang
+metrics:
+- f1
+- precision
+- recall
+license: apache-2.0
 ---
+# IndoELECTRA NER - Singgalang Dataset
+Model Named Entity Recognition (NER) untuk Bahasa Indonesia menggunakan **IndoELECTRA** yang di-fine-tune pada dataset **SINGGALANG**.
+## 📋 Deskripsi Model
+Model ini dapat mendeteksi 3 jenis entitas dalam teks bahasa Indonesia:
+- **Person**: Nama orang
+- **Place**: Nama tempat/lokasi
+- **Organisation**: Nama organisasi/perusahaan
+## 🎯 Label
+Model menggunakan format BIO (Begin-Inside-Outside):
+- `O`: Bukan entitas
+- `B-Person`, `I-Person`: Entitas Person
+- `B-Place`, `I-Place`: Entitas Place
+- `B-Organisation`, `I-Organisation`: Entitas Organisation
+## 🔧 Training Details
+- **Base Model**: [ChristopherA08/IndoELECTRA](https://huggingface.co/ChristopherA08/IndoELECTRA)
+- **Dataset**: SINGGALANG (oversampled)
+- **Training Strategy**: Parameter-efficient fine-tuning
+  - Classifier head + last 2 encoder layers (unfrozen)
+  - Remaining layers frozen
+- **Class Weighting**: Applied to handle class imbalance
+- **Max Sequence Length**: 128 tokens
+- **Batch Size**: 16 (with gradient accumulation steps=4)
+- **Learning Rate**: 3e-5
+- **Epochs**: 12 (with early stopping patience=3)
+## 📊 Performance
+Model mencapai performa yang baik pada validation set dengan F1-score tinggi untuk deteksi entitas Person, Place, dan Organisation.
+## 💻 Usage
+### Instalasi
+```bash
+pip install transformers torch
+```
+### Inference
+```python
+from transformers import AutoTokenizer, AutoModelForTokenClassification
+import torch
+# Load model dan tokenizer
+model_name = "ecaaa09/IndoELECTRA-NER-Singgalang"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForTokenClassification.from_pretrained(model_name)
+# Fungsi untuk prediksi NER
+def predict_ner(sentence):
+    tokens = sentence.split()
+    inputs = tokenizer(tokens, is_split_into_words=True, return_tensors="pt")
+    with torch.no_grad():
+        outputs = model(**inputs)
+    predictions = torch.argmax(outputs.logits, dim=-1).squeeze().tolist()
+    results = []
+    word_ids = inputs.word_ids()
+    prev_word = None
+    for idx, word_idx in enumerate(word_ids):
+        if word_idx is None or word_idx == prev_word:
+            continue
+        label = model.config.id2label[predictions[idx]]
+        results.append((tokens[word_idx], label))
+        prev_word = word_idx
+    return results
+# Contoh penggunaan
+sentence = "Joko Widodo bertemu dengan Prabowo di Jakarta"
+results = predict_ner(sentence)
+for token, label in results:
+    print(f"{token:<20} {label}")
+```
+### Output Example
+```
+Joko                 B-Person
+Widodo               I-Person
+bertemu              O
+dengan               O
+Prabowo              B-Person
+di                   O
+Jakarta              B-Place
+```
+## 👥 Team
+Tugas Besar Natural Language Processing - Institut Teknologi Sumatera
+| Nama | NIM |
+|------|-----|
+| Rayhan Fatih Gunawan | 122140134 |
+| Elsa Elisa Yohana Sianturi | 122140135 |
+| Nashwa Putri Laisya | 122140180 |
+| Anisa Fitriyani | 122450019 |
+| Siti Nur Aarifah | 122450006 |
+| Muhammad Nelwan Fakhri | 122140173 |
+| Raditya Erza Farandi | 122140209 |
+## 📝 Citation
+```bibtex
+@misc{indoelectra-ner-singgalang,
+  author = {Rayhan Fatih Gunawan et al.},
+  title = {IndoELECTRA NER - Singgalang Dataset},
+  year = {2025},
+  publisher = {Hugging Face},
+  howpublished = {\url{https://huggingface.co/ecaaa09/IndoELECTRA-NER-Singgalang}}
+}
+```
+## 📄 License
+Apache 2.0

config.json ADDED Viewed

	@@ -0,0 +1,26 @@

+{
+  "architectures": [
+    "ElectraForTokenClassification"
+  ],
+  "model_type": "electra",
+  "num_labels": 7,
+  "id2label": {
+    "0": "O",
+    "1": "B-Organisation",
+    "2": "I-Organisation",
+    "3": "B-Person",
+    "4": "I-Person",
+    "5": "B-Place",
+    "6": "I-Place"
+  },
+  "label2id": {
+    "O": 0,
+    "B-Organisation": 1,
+    "I-Organisation": 2,
+    "B-Person": 3,
+    "I-Person": 4,
+    "B-Place": 5,
+    "I-Place": 6
+  },
+  "_name_or_path": "ChristopherA08/IndoELECTRA"
+}

pytorch_model.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:005c48c9d56e3718ee8f4269303d63b97dd269817dba1e7a1e87d0ad8235cba0
+size 449428284

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+  "cls_token": "[CLS]",
+  "mask_token": "[MASK]",
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "unk_token": "[UNK]"
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,57 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "4": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "[CLS]",
+  "do_basic_tokenize": true,
+  "do_lower_case": true,
+  "mask_token": "[MASK]",
+  "model_max_length": 1000000000000000019884624838656,
+  "never_split": null,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "ElectraTokenizer",
+  "unk_token": "[UNK]"
+}

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff