ecaaa09 commited on
Commit
1d6b52d
·
verified ·
1 Parent(s): a462d1d

Upload folder using huggingface_hub

Browse files
README.md CHANGED
@@ -1,3 +1,143 @@
1
  ---
2
- license: mit
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language: id
3
+ tags:
4
+ - indonesian
5
+ - named-entity-recognition
6
+ - ner
7
+ - indoelectra
8
+ datasets:
9
+ - singgalang
10
+ metrics:
11
+ - f1
12
+ - precision
13
+ - recall
14
+ license: apache-2.0
15
  ---
16
+
17
+ # IndoELECTRA NER - Singgalang Dataset
18
+
19
+ Model Named Entity Recognition (NER) untuk Bahasa Indonesia menggunakan **IndoELECTRA** yang di-fine-tune pada dataset **SINGGALANG**.
20
+
21
+ ## 📋 Deskripsi Model
22
+
23
+ Model ini dapat mendeteksi 3 jenis entitas dalam teks bahasa Indonesia:
24
+ - **Person**: Nama orang
25
+ - **Place**: Nama tempat/lokasi
26
+ - **Organisation**: Nama organisasi/perusahaan
27
+
28
+ ## 🎯 Label
29
+
30
+ Model menggunakan format BIO (Begin-Inside-Outside):
31
+ - `O`: Bukan entitas
32
+ - `B-Person`, `I-Person`: Entitas Person
33
+ - `B-Place`, `I-Place`: Entitas Place
34
+ - `B-Organisation`, `I-Organisation`: Entitas Organisation
35
+
36
+ ## 🔧 Training Details
37
+
38
+ - **Base Model**: [ChristopherA08/IndoELECTRA](https://huggingface.co/ChristopherA08/IndoELECTRA)
39
+ - **Dataset**: SINGGALANG (oversampled)
40
+ - **Training Strategy**: Parameter-efficient fine-tuning
41
+ - Classifier head + last 2 encoder layers (unfrozen)
42
+ - Remaining layers frozen
43
+ - **Class Weighting**: Applied to handle class imbalance
44
+ - **Max Sequence Length**: 128 tokens
45
+ - **Batch Size**: 16 (with gradient accumulation steps=4)
46
+ - **Learning Rate**: 3e-5
47
+ - **Epochs**: 12 (with early stopping patience=3)
48
+
49
+ ## 📊 Performance
50
+
51
+ Model mencapai performa yang baik pada validation set dengan F1-score tinggi untuk deteksi entitas Person, Place, dan Organisation.
52
+
53
+ ## 💻 Usage
54
+
55
+ ### Instalasi
56
+
57
+ ```bash
58
+ pip install transformers torch
59
+ ```
60
+
61
+ ### Inference
62
+
63
+ ```python
64
+ from transformers import AutoTokenizer, AutoModelForTokenClassification
65
+ import torch
66
+
67
+ # Load model dan tokenizer
68
+ model_name = "ecaaa09/IndoELECTRA-NER-Singgalang"
69
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
70
+ model = AutoModelForTokenClassification.from_pretrained(model_name)
71
+
72
+ # Fungsi untuk prediksi NER
73
+ def predict_ner(sentence):
74
+ tokens = sentence.split()
75
+ inputs = tokenizer(tokens, is_split_into_words=True, return_tensors="pt")
76
+
77
+ with torch.no_grad():
78
+ outputs = model(**inputs)
79
+
80
+ predictions = torch.argmax(outputs.logits, dim=-1).squeeze().tolist()
81
+
82
+ results = []
83
+ word_ids = inputs.word_ids()
84
+ prev_word = None
85
+
86
+ for idx, word_idx in enumerate(word_ids):
87
+ if word_idx is None or word_idx == prev_word:
88
+ continue
89
+ label = model.config.id2label[predictions[idx]]
90
+ results.append((tokens[word_idx], label))
91
+ prev_word = word_idx
92
+
93
+ return results
94
+
95
+ # Contoh penggunaan
96
+ sentence = "Joko Widodo bertemu dengan Prabowo di Jakarta"
97
+ results = predict_ner(sentence)
98
+
99
+ for token, label in results:
100
+ print(f"{token:<20} {label}")
101
+ ```
102
+
103
+ ### Output Example
104
+
105
+ ```
106
+ Joko B-Person
107
+ Widodo I-Person
108
+ bertemu O
109
+ dengan O
110
+ Prabowo B-Person
111
+ di O
112
+ Jakarta B-Place
113
+ ```
114
+
115
+ ## 👥 Team
116
+
117
+ Tugas Besar Natural Language Processing - Institut Teknologi Sumatera
118
+
119
+ | Nama | NIM |
120
+ |------|-----|
121
+ | Rayhan Fatih Gunawan | 122140134 |
122
+ | Elsa Elisa Yohana Sianturi | 122140135 |
123
+ | Nashwa Putri Laisya | 122140180 |
124
+ | Anisa Fitriyani | 122450019 |
125
+ | Siti Nur Aarifah | 122450006 |
126
+ | Muhammad Nelwan Fakhri | 122140173 |
127
+ | Raditya Erza Farandi | 122140209 |
128
+
129
+ ## 📝 Citation
130
+
131
+ ```bibtex
132
+ @misc{indoelectra-ner-singgalang,
133
+ author = {Rayhan Fatih Gunawan et al.},
134
+ title = {IndoELECTRA NER - Singgalang Dataset},
135
+ year = {2025},
136
+ publisher = {Hugging Face},
137
+ howpublished = {\url{https://huggingface.co/ecaaa09/IndoELECTRA-NER-Singgalang}}
138
+ }
139
+ ```
140
+
141
+ ## 📄 License
142
+
143
+ Apache 2.0
config.json ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "ElectraForTokenClassification"
4
+ ],
5
+ "model_type": "electra",
6
+ "num_labels": 7,
7
+ "id2label": {
8
+ "0": "O",
9
+ "1": "B-Organisation",
10
+ "2": "I-Organisation",
11
+ "3": "B-Person",
12
+ "4": "I-Person",
13
+ "5": "B-Place",
14
+ "6": "I-Place"
15
+ },
16
+ "label2id": {
17
+ "O": 0,
18
+ "B-Organisation": 1,
19
+ "I-Organisation": 2,
20
+ "B-Person": 3,
21
+ "I-Person": 4,
22
+ "B-Place": 5,
23
+ "I-Place": 6
24
+ },
25
+ "_name_or_path": "ChristopherA08/IndoELECTRA"
26
+ }
pytorch_model.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:005c48c9d56e3718ee8f4269303d63b97dd269817dba1e7a1e87d0ad8235cba0
3
+ size 449428284
special_tokens_map.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": "[CLS]",
3
+ "mask_token": "[MASK]",
4
+ "pad_token": "[PAD]",
5
+ "sep_token": "[SEP]",
6
+ "unk_token": "[UNK]"
7
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,57 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "[MASK]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "4": {
36
+ "content": "[SEP]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": true,
45
+ "cls_token": "[CLS]",
46
+ "do_basic_tokenize": true,
47
+ "do_lower_case": true,
48
+ "mask_token": "[MASK]",
49
+ "model_max_length": 1000000000000000019884624838656,
50
+ "never_split": null,
51
+ "pad_token": "[PAD]",
52
+ "sep_token": "[SEP]",
53
+ "strip_accents": null,
54
+ "tokenize_chinese_chars": true,
55
+ "tokenizer_class": "ElectraTokenizer",
56
+ "unk_token": "[UNK]"
57
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff