Upload README.md with huggingface_hub
Browse files
README.md
ADDED
|
@@ -0,0 +1,226 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language:
|
| 3 |
+
- id
|
| 4 |
+
license: cc-by-sa-3.0
|
| 5 |
+
library_name: spacy
|
| 6 |
+
tags:
|
| 7 |
+
- spacy
|
| 8 |
+
- ner
|
| 9 |
+
- named-entity-recognition
|
| 10 |
+
- indonesian
|
| 11 |
+
- token-classification
|
| 12 |
+
pipeline_tag: token-classification
|
| 13 |
+
model_type: spacy
|
| 14 |
+
datasets:
|
| 15 |
+
- universal_dependencies
|
| 16 |
+
metrics:
|
| 17 |
+
- f1
|
| 18 |
+
- precision
|
| 19 |
+
- recall
|
| 20 |
+
- accuracy
|
| 21 |
+
widget:
|
| 22 |
+
- text: "Presiden Joko Widodo mengunjungi Jakarta pada tanggal 15 Agustus 2023."
|
| 23 |
+
example_title: "Political News"
|
| 24 |
+
- text: "Bank Central Asia (BCA) melaporkan laba bersih sebesar Rp 25,1 triliun pada tahun 2022."
|
| 25 |
+
example_title: "Financial News"
|
| 26 |
+
- text: "Universitas Indonesia terletak di Depok, Jawa Barat."
|
| 27 |
+
example_title: "Educational Institution"
|
| 28 |
+
---
|
| 29 |
+
|
| 30 |
+
# Indonesian Named Entity Recognition (NER) Model
|
| 31 |
+
|
| 32 |
+
## Model Description
|
| 33 |
+
|
| 34 |
+
This is a custom Indonesian Named Entity Recognition (NER) model built with spaCy v3.8+. The model is designed to identify and classify 19 different types of named entities in Indonesian text, making it suitable for various NLP applications in the Indonesian language.
|
| 35 |
+
|
| 36 |
+
## Model Details
|
| 37 |
+
|
| 38 |
+
- **Model Name**: ner_spacy_indonesian
|
| 39 |
+
- **Version**: 1.1.0
|
| 40 |
+
- **Language**: Indonesian (id)
|
| 41 |
+
- **License**: CC BY-SA 3.0
|
| 42 |
+
- **Author**: Asep Muhamad
|
| 43 |
+
- **Email**: asepmuhamad@gmail.com
|
| 44 |
+
- **Website**: https://asmud.me
|
| 45 |
+
- **spaCy Version**: >=3.8.0,<3.9.0
|
| 46 |
+
|
| 47 |
+
## Architecture
|
| 48 |
+
|
| 49 |
+
- **Pipeline Components**: NER (Named Entity Recognition), Sentence Segmentation (disabled by default)
|
| 50 |
+
- **Architecture**: TransitionBasedParser with HashEmbedCNN token-to-vector model
|
| 51 |
+
- **Token-to-Vector**: HashEmbedCNN with 96-dimensional embeddings, 4-layer depth
|
| 52 |
+
- **Hidden Width**: 64 dimensions
|
| 53 |
+
- **Training**: Trained on Universal Dependencies v2.8 datasets
|
| 54 |
+
|
| 55 |
+
## Entity Labels
|
| 56 |
+
|
| 57 |
+
The model recognizes 19 different entity types:
|
| 58 |
+
|
| 59 |
+
| Label | Description |
|
| 60 |
+
|-------|-------------|
|
| 61 |
+
| CRD | Cardinal numbers |
|
| 62 |
+
| DAT | Dates |
|
| 63 |
+
| EVT | Events |
|
| 64 |
+
| FAC | Facilities |
|
| 65 |
+
| GPE | Geopolitical entities (countries, cities, states) |
|
| 66 |
+
| LAN | Languages |
|
| 67 |
+
| LAW | Laws |
|
| 68 |
+
| LOC | Locations |
|
| 69 |
+
| MON | Money/monetary values |
|
| 70 |
+
| NOR | Norms |
|
| 71 |
+
| ORD | Ordinal numbers |
|
| 72 |
+
| ORG | Organizations |
|
| 73 |
+
| PER | Persons |
|
| 74 |
+
| PRC | Processes |
|
| 75 |
+
| PRD | Products |
|
| 76 |
+
| QTY | Quantities |
|
| 77 |
+
| REG | Regions |
|
| 78 |
+
| TIM | Time |
|
| 79 |
+
| WOA | Works of art |
|
| 80 |
+
|
| 81 |
+
## Performance
|
| 82 |
+
|
| 83 |
+
The model achieves strong performance on token-level evaluation:
|
| 84 |
+
|
| 85 |
+
- **Token Accuracy**: 98.59%
|
| 86 |
+
- **Token Precision**: 95.31%
|
| 87 |
+
- **Token Recall**: 95.72%
|
| 88 |
+
- **Token F1-Score**: 95.52%
|
| 89 |
+
- **Sentence Precision**: 90.67%
|
| 90 |
+
- **Sentence Recall**: 81.49%
|
| 91 |
+
- **Sentence F1-Score**: 85.83%
|
| 92 |
+
- **Processing Speed**: 66,612 tokens/second
|
| 93 |
+
|
| 94 |
+
*Performance metrics are based on evaluation using Universal Dependencies v2.8 datasets with spaCy's standard evaluation framework.*
|
| 95 |
+
|
| 96 |
+
## Installation
|
| 97 |
+
|
| 98 |
+
You can install this model directly from the wheel file:
|
| 99 |
+
|
| 100 |
+
```bash
|
| 101 |
+
pip install https://huggingface.co/asmud/ner-spacy-indonesian/resolve/main/id_ner_spacy_indonesian-1.1.0-py3-none-any.whl
|
| 102 |
+
```
|
| 103 |
+
|
| 104 |
+
Or download and install locally:
|
| 105 |
+
|
| 106 |
+
```bash
|
| 107 |
+
pip install id_ner_spacy_indonesian-1.1.0-py3-none-any.whl
|
| 108 |
+
```
|
| 109 |
+
|
| 110 |
+
## Usage
|
| 111 |
+
|
| 112 |
+
### Basic Usage
|
| 113 |
+
|
| 114 |
+
```python
|
| 115 |
+
import spacy
|
| 116 |
+
|
| 117 |
+
# Load the model
|
| 118 |
+
nlp = spacy.load("id_ner_spacy_indonesian")
|
| 119 |
+
|
| 120 |
+
# Process text
|
| 121 |
+
text = "Presiden Joko Widodo mengunjungi Jakarta pada tanggal 15 Agustus 2023."
|
| 122 |
+
doc = nlp(text)
|
| 123 |
+
|
| 124 |
+
# Extract entities
|
| 125 |
+
for ent in doc.ents:
|
| 126 |
+
print(f"{ent.text:<20} {ent.label_:<10} {ent.start_char}-{ent.end_char}")
|
| 127 |
+
```
|
| 128 |
+
|
| 129 |
+
### Advanced Usage
|
| 130 |
+
|
| 131 |
+
```python
|
| 132 |
+
import spacy
|
| 133 |
+
from spacy import displacy
|
| 134 |
+
|
| 135 |
+
# Load model
|
| 136 |
+
nlp = spacy.load("id_ner_spacy_indonesian")
|
| 137 |
+
|
| 138 |
+
# Process text
|
| 139 |
+
text = """
|
| 140 |
+
Bank Central Asia (BCA) melaporkan laba bersih sebesar Rp 25,1 triliun
|
| 141 |
+
pada tahun 2022. CEO BCA, Jahja Setiaatmadja, menyatakan bahwa kinerja
|
| 142 |
+
perseroan tetap solid di tengah tantangan ekonomi global.
|
| 143 |
+
"""
|
| 144 |
+
|
| 145 |
+
doc = nlp(text)
|
| 146 |
+
|
| 147 |
+
# Print detailed entity information
|
| 148 |
+
for ent in doc.ents:
|
| 149 |
+
print(f"Entity: {ent.text}")
|
| 150 |
+
print(f"Label: {ent.label_}")
|
| 151 |
+
print(f"Position: {ent.start_char}-{ent.end_char}")
|
| 152 |
+
print(f"Confidence: {ent._.score if hasattr(ent._, 'score') else 'N/A'}")
|
| 153 |
+
print("-" * 50)
|
| 154 |
+
|
| 155 |
+
# Visualize entities (in Jupyter notebook)
|
| 156 |
+
displacy.render(doc, style="ent", jupyter=True)
|
| 157 |
+
```
|
| 158 |
+
|
| 159 |
+
### Batch Processing
|
| 160 |
+
|
| 161 |
+
```python
|
| 162 |
+
import spacy
|
| 163 |
+
|
| 164 |
+
nlp = spacy.load("id_ner_spacy_indonesian")
|
| 165 |
+
|
| 166 |
+
# Process multiple texts
|
| 167 |
+
texts = [
|
| 168 |
+
"PT Telkom Indonesia adalah perusahaan telekomunikasi terbesar di Indonesia.",
|
| 169 |
+
"Universitas Indonesia terletak di Depok, Jawa Barat.",
|
| 170 |
+
"Presiden Susilo Bambang Yudhoyono menjabat dari tahun 2004 hingga 2014."
|
| 171 |
+
]
|
| 172 |
+
|
| 173 |
+
# Batch processing for efficiency
|
| 174 |
+
docs = list(nlp.pipe(texts))
|
| 175 |
+
|
| 176 |
+
for i, doc in enumerate(docs):
|
| 177 |
+
print(f"Text {i+1} entities:")
|
| 178 |
+
for ent in doc.ents:
|
| 179 |
+
print(f" {ent.text} ({ent.label_})")
|
| 180 |
+
print()
|
| 181 |
+
```
|
| 182 |
+
|
| 183 |
+
## Model Training
|
| 184 |
+
|
| 185 |
+
This model was trained using:
|
| 186 |
+
|
| 187 |
+
- **Data Source**: Universal Dependencies v2.8 (multiple language datasets including Indonesian)
|
| 188 |
+
- **Training Framework**: spaCy v3.8+
|
| 189 |
+
- **Optimization**: Adam optimizer with gradient clipping
|
| 190 |
+
- **Batch Size**: Dynamic batching (100-1000 words)
|
| 191 |
+
- **Training Steps**: 100,000 maximum steps
|
| 192 |
+
- **Dropout**: 0.1
|
| 193 |
+
- **Evaluation Frequency**: Every 1,000 steps
|
| 194 |
+
|
| 195 |
+
## Limitations
|
| 196 |
+
|
| 197 |
+
- The model is primarily trained on formal Indonesian text and may have reduced performance on informal or colloquial Indonesian
|
| 198 |
+
- Performance may vary on domain-specific texts not well represented in the training data
|
| 199 |
+
- Some entity boundaries might not be perfect, especially for complex compound entities
|
| 200 |
+
|
| 201 |
+
## Citation
|
| 202 |
+
|
| 203 |
+
If you use this model in your research or applications, please cite:
|
| 204 |
+
|
| 205 |
+
```bibtex
|
| 206 |
+
@model{muhamad2024indonesian_ner,
|
| 207 |
+
title={Indonesian Named Entity Recognition Model},
|
| 208 |
+
author={Muhamad, Asep},
|
| 209 |
+
year={2024},
|
| 210 |
+
version={1.1.0},
|
| 211 |
+
url={https://huggingface.co/asmud/ner-spacy-indonesian},
|
| 212 |
+
license={CC BY-SA 3.0}
|
| 213 |
+
}
|
| 214 |
+
```
|
| 215 |
+
|
| 216 |
+
## Contact
|
| 217 |
+
|
| 218 |
+
For questions, issues, or collaborations:
|
| 219 |
+
|
| 220 |
+
- **Author**: Asep Muhamad
|
| 221 |
+
- **Email**: asepmuhamad@gmail.com
|
| 222 |
+
- **Website**: https://asmud.me
|
| 223 |
+
|
| 224 |
+
## Acknowledgments
|
| 225 |
+
|
| 226 |
+
This model was trained using data from Universal Dependencies v2.8, contributed by Daniel Zeman, Joakim Nivre, Mitchell Abrams, and many other contributors. Special thanks to the spaCy team for providing an excellent framework for natural language processing.
|