asmud commited on
Commit
aa52c4f
·
verified ·
1 Parent(s): f6b220e

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +226 -0
README.md ADDED
@@ -0,0 +1,226 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - id
4
+ license: cc-by-sa-3.0
5
+ library_name: spacy
6
+ tags:
7
+ - spacy
8
+ - ner
9
+ - named-entity-recognition
10
+ - indonesian
11
+ - token-classification
12
+ pipeline_tag: token-classification
13
+ model_type: spacy
14
+ datasets:
15
+ - universal_dependencies
16
+ metrics:
17
+ - f1
18
+ - precision
19
+ - recall
20
+ - accuracy
21
+ widget:
22
+ - text: "Presiden Joko Widodo mengunjungi Jakarta pada tanggal 15 Agustus 2023."
23
+ example_title: "Political News"
24
+ - text: "Bank Central Asia (BCA) melaporkan laba bersih sebesar Rp 25,1 triliun pada tahun 2022."
25
+ example_title: "Financial News"
26
+ - text: "Universitas Indonesia terletak di Depok, Jawa Barat."
27
+ example_title: "Educational Institution"
28
+ ---
29
+
30
+ # Indonesian Named Entity Recognition (NER) Model
31
+
32
+ ## Model Description
33
+
34
+ This is a custom Indonesian Named Entity Recognition (NER) model built with spaCy v3.8+. The model is designed to identify and classify 19 different types of named entities in Indonesian text, making it suitable for various NLP applications in the Indonesian language.
35
+
36
+ ## Model Details
37
+
38
+ - **Model Name**: ner_spacy_indonesian
39
+ - **Version**: 1.1.0
40
+ - **Language**: Indonesian (id)
41
+ - **License**: CC BY-SA 3.0
42
+ - **Author**: Asep Muhamad
43
+ - **Email**: asepmuhamad@gmail.com
44
+ - **Website**: https://asmud.me
45
+ - **spaCy Version**: >=3.8.0,<3.9.0
46
+
47
+ ## Architecture
48
+
49
+ - **Pipeline Components**: NER (Named Entity Recognition), Sentence Segmentation (disabled by default)
50
+ - **Architecture**: TransitionBasedParser with HashEmbedCNN token-to-vector model
51
+ - **Token-to-Vector**: HashEmbedCNN with 96-dimensional embeddings, 4-layer depth
52
+ - **Hidden Width**: 64 dimensions
53
+ - **Training**: Trained on Universal Dependencies v2.8 datasets
54
+
55
+ ## Entity Labels
56
+
57
+ The model recognizes 19 different entity types:
58
+
59
+ | Label | Description |
60
+ |-------|-------------|
61
+ | CRD | Cardinal numbers |
62
+ | DAT | Dates |
63
+ | EVT | Events |
64
+ | FAC | Facilities |
65
+ | GPE | Geopolitical entities (countries, cities, states) |
66
+ | LAN | Languages |
67
+ | LAW | Laws |
68
+ | LOC | Locations |
69
+ | MON | Money/monetary values |
70
+ | NOR | Norms |
71
+ | ORD | Ordinal numbers |
72
+ | ORG | Organizations |
73
+ | PER | Persons |
74
+ | PRC | Processes |
75
+ | PRD | Products |
76
+ | QTY | Quantities |
77
+ | REG | Regions |
78
+ | TIM | Time |
79
+ | WOA | Works of art |
80
+
81
+ ## Performance
82
+
83
+ The model achieves strong performance on token-level evaluation:
84
+
85
+ - **Token Accuracy**: 98.59%
86
+ - **Token Precision**: 95.31%
87
+ - **Token Recall**: 95.72%
88
+ - **Token F1-Score**: 95.52%
89
+ - **Sentence Precision**: 90.67%
90
+ - **Sentence Recall**: 81.49%
91
+ - **Sentence F1-Score**: 85.83%
92
+ - **Processing Speed**: 66,612 tokens/second
93
+
94
+ *Performance metrics are based on evaluation using Universal Dependencies v2.8 datasets with spaCy's standard evaluation framework.*
95
+
96
+ ## Installation
97
+
98
+ You can install this model directly from the wheel file:
99
+
100
+ ```bash
101
+ pip install https://huggingface.co/asmud/ner-spacy-indonesian/resolve/main/id_ner_spacy_indonesian-1.1.0-py3-none-any.whl
102
+ ```
103
+
104
+ Or download and install locally:
105
+
106
+ ```bash
107
+ pip install id_ner_spacy_indonesian-1.1.0-py3-none-any.whl
108
+ ```
109
+
110
+ ## Usage
111
+
112
+ ### Basic Usage
113
+
114
+ ```python
115
+ import spacy
116
+
117
+ # Load the model
118
+ nlp = spacy.load("id_ner_spacy_indonesian")
119
+
120
+ # Process text
121
+ text = "Presiden Joko Widodo mengunjungi Jakarta pada tanggal 15 Agustus 2023."
122
+ doc = nlp(text)
123
+
124
+ # Extract entities
125
+ for ent in doc.ents:
126
+ print(f"{ent.text:<20} {ent.label_:<10} {ent.start_char}-{ent.end_char}")
127
+ ```
128
+
129
+ ### Advanced Usage
130
+
131
+ ```python
132
+ import spacy
133
+ from spacy import displacy
134
+
135
+ # Load model
136
+ nlp = spacy.load("id_ner_spacy_indonesian")
137
+
138
+ # Process text
139
+ text = """
140
+ Bank Central Asia (BCA) melaporkan laba bersih sebesar Rp 25,1 triliun
141
+ pada tahun 2022. CEO BCA, Jahja Setiaatmadja, menyatakan bahwa kinerja
142
+ perseroan tetap solid di tengah tantangan ekonomi global.
143
+ """
144
+
145
+ doc = nlp(text)
146
+
147
+ # Print detailed entity information
148
+ for ent in doc.ents:
149
+ print(f"Entity: {ent.text}")
150
+ print(f"Label: {ent.label_}")
151
+ print(f"Position: {ent.start_char}-{ent.end_char}")
152
+ print(f"Confidence: {ent._.score if hasattr(ent._, 'score') else 'N/A'}")
153
+ print("-" * 50)
154
+
155
+ # Visualize entities (in Jupyter notebook)
156
+ displacy.render(doc, style="ent", jupyter=True)
157
+ ```
158
+
159
+ ### Batch Processing
160
+
161
+ ```python
162
+ import spacy
163
+
164
+ nlp = spacy.load("id_ner_spacy_indonesian")
165
+
166
+ # Process multiple texts
167
+ texts = [
168
+ "PT Telkom Indonesia adalah perusahaan telekomunikasi terbesar di Indonesia.",
169
+ "Universitas Indonesia terletak di Depok, Jawa Barat.",
170
+ "Presiden Susilo Bambang Yudhoyono menjabat dari tahun 2004 hingga 2014."
171
+ ]
172
+
173
+ # Batch processing for efficiency
174
+ docs = list(nlp.pipe(texts))
175
+
176
+ for i, doc in enumerate(docs):
177
+ print(f"Text {i+1} entities:")
178
+ for ent in doc.ents:
179
+ print(f" {ent.text} ({ent.label_})")
180
+ print()
181
+ ```
182
+
183
+ ## Model Training
184
+
185
+ This model was trained using:
186
+
187
+ - **Data Source**: Universal Dependencies v2.8 (multiple language datasets including Indonesian)
188
+ - **Training Framework**: spaCy v3.8+
189
+ - **Optimization**: Adam optimizer with gradient clipping
190
+ - **Batch Size**: Dynamic batching (100-1000 words)
191
+ - **Training Steps**: 100,000 maximum steps
192
+ - **Dropout**: 0.1
193
+ - **Evaluation Frequency**: Every 1,000 steps
194
+
195
+ ## Limitations
196
+
197
+ - The model is primarily trained on formal Indonesian text and may have reduced performance on informal or colloquial Indonesian
198
+ - Performance may vary on domain-specific texts not well represented in the training data
199
+ - Some entity boundaries might not be perfect, especially for complex compound entities
200
+
201
+ ## Citation
202
+
203
+ If you use this model in your research or applications, please cite:
204
+
205
+ ```bibtex
206
+ @model{muhamad2024indonesian_ner,
207
+ title={Indonesian Named Entity Recognition Model},
208
+ author={Muhamad, Asep},
209
+ year={2024},
210
+ version={1.1.0},
211
+ url={https://huggingface.co/asmud/ner-spacy-indonesian},
212
+ license={CC BY-SA 3.0}
213
+ }
214
+ ```
215
+
216
+ ## Contact
217
+
218
+ For questions, issues, or collaborations:
219
+
220
+ - **Author**: Asep Muhamad
221
+ - **Email**: asepmuhamad@gmail.com
222
+ - **Website**: https://asmud.me
223
+
224
+ ## Acknowledgments
225
+
226
+ This model was trained using data from Universal Dependencies v2.8, contributed by Daniel Zeman, Joakim Nivre, Mitchell Abrams, and many other contributors. Special thanks to the spaCy team for providing an excellent framework for natural language processing.