Upload folder using huggingface_hub
Browse files
README.md
CHANGED
|
@@ -1,6 +1,133 @@
|
|
| 1 |
---
|
| 2 |
license: mit
|
| 3 |
language:
|
| 4 |
-
- id
|
| 5 |
pipeline_tag: token-classification
|
| 6 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
license: mit
|
| 3 |
language:
|
| 4 |
+
- id
|
| 5 |
pipeline_tag: token-classification
|
| 6 |
+
---
|
| 7 |
+
|
| 8 |
+
# BERT Base Indonesian Named Entity Recognition
|
| 9 |
+
|
| 10 |
+
This is a BERT-based model fine-tuned for Named Entity Recognition (NER) tasks in Indonesian language. The model is trained to identify and classify named entities such as persons, organizations, locations, and other relevant entities in Indonesian text.
|
| 11 |
+
|
| 12 |
+
## Model Details
|
| 13 |
+
|
| 14 |
+
- **Model Type**: BERT (Bidirectional Encoder Representations from Transformers)
|
| 15 |
+
- **Language**: Indonesian (id)
|
| 16 |
+
- **Task**: Token Classification / Named Entity Recognition
|
| 17 |
+
- **Base Model**: BERT Base
|
| 18 |
+
- **License**: MIT
|
| 19 |
+
|
| 20 |
+
## Intended Use
|
| 21 |
+
|
| 22 |
+
This model is intended for:
|
| 23 |
+
|
| 24 |
+
- Named Entity Recognition in Indonesian text
|
| 25 |
+
- Information extraction from Indonesian documents
|
| 26 |
+
- Text analysis and processing applications
|
| 27 |
+
|
| 28 |
+
## How to Use
|
| 29 |
+
|
| 30 |
+
### Using with Transformers
|
| 31 |
+
|
| 32 |
+
```python
|
| 33 |
+
from transformers import AutoTokenizer, AutoModelForTokenClassification
|
| 34 |
+
import torch
|
| 35 |
+
|
| 36 |
+
# Load model and tokenizer
|
| 37 |
+
model_name = "path/to/bert-base-indonesian-NER" # or Hugging Face model ID if uploaded
|
| 38 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
| 39 |
+
model = AutoModelForTokenClassification.from_pretrained(model_name)
|
| 40 |
+
|
| 41 |
+
# Prepare input text
|
| 42 |
+
text = "Presiden Joko Widodo berkunjung ke Jakarta untuk bertemu dengan Gubernur Anies Baswedan."
|
| 43 |
+
inputs = tokenizer(text, return_tensors="pt")
|
| 44 |
+
|
| 45 |
+
# Get predictions
|
| 46 |
+
with torch.no_grad():
|
| 47 |
+
outputs = model(**inputs)
|
| 48 |
+
predictions = torch.argmax(outputs.logits, dim=2)
|
| 49 |
+
|
| 50 |
+
# Decode predictions
|
| 51 |
+
predicted_tokens = [tokenizer.convert_ids_to_tokens(ids) for ids in inputs["input_ids"]]
|
| 52 |
+
predicted_labels = [model.config.id2label[label_id] for label_id in predictions[0].tolist()]
|
| 53 |
+
|
| 54 |
+
print("Tokens:", predicted_tokens)
|
| 55 |
+
print("Labels:", predicted_labels)
|
| 56 |
+
```
|
| 57 |
+
|
| 58 |
+
### Using with Pipeline
|
| 59 |
+
|
| 60 |
+
```python
|
| 61 |
+
from transformers import pipeline
|
| 62 |
+
|
| 63 |
+
# Load NER pipeline
|
| 64 |
+
ner_pipeline = pipeline("ner", model="path/to/bert-base-indonesian-NER")
|
| 65 |
+
|
| 66 |
+
# Process text
|
| 67 |
+
text = "PT Bank Central Asia Tbk memiliki kantor pusat di Jakarta."
|
| 68 |
+
results = ner_pipeline(text)
|
| 69 |
+
|
| 70 |
+
for result in results:
|
| 71 |
+
print(f"Entity: {result['word']}, Label: {result['entity']}, Confidence: {result['score']:.4f}")
|
| 72 |
+
```
|
| 73 |
+
|
| 74 |
+
## Label Mapping
|
| 75 |
+
|
| 76 |
+
The model uses the following entity labels:
|
| 77 |
+
|
| 78 |
+
- `B-PER`: Beginning of Person name
|
| 79 |
+
- `I-PER`: Inside of Person name
|
| 80 |
+
- `B-ORG`: Beginning of Organization name
|
| 81 |
+
- `I-ORG`: Inside of Organization name
|
| 82 |
+
- `B-LOC`: Beginning of Location name
|
| 83 |
+
- `I-LOC`: Inside of Location name
|
| 84 |
+
- `B-MISC`: Beginning of Miscellaneous entity
|
| 85 |
+
- `I-MISC`: Inside of Miscellaneous entity
|
| 86 |
+
- `O`: Outside (not an entity)
|
| 87 |
+
|
| 88 |
+
## Training Data
|
| 89 |
+
|
| 90 |
+
The model was trained on Indonesian text datasets containing annotated named entities. The training data includes:
|
| 91 |
+
|
| 92 |
+
- News articles
|
| 93 |
+
- Wikipedia pages
|
| 94 |
+
- Social media posts
|
| 95 |
+
- Government documents
|
| 96 |
+
|
| 97 |
+
## Performance
|
| 98 |
+
|
| 99 |
+
The model achieves the following performance metrics on the test set:
|
| 100 |
+
|
| 101 |
+
- Precision: 0.XX
|
| 102 |
+
- Recall: 0.XX
|
| 103 |
+
- F1-Score: 0.XX
|
| 104 |
+
|
| 105 |
+
## Limitations
|
| 106 |
+
|
| 107 |
+
- The model may not perform well on informal or slang-heavy Indonesian text
|
| 108 |
+
- Performance may vary across different domains
|
| 109 |
+
- The model is trained on data up to a certain date and may not recognize newer entities
|
| 110 |
+
|
| 111 |
+
## Ethical Considerations
|
| 112 |
+
|
| 113 |
+
- This model should not be used for surveillance or tracking individuals without consent
|
| 114 |
+
- Always consider privacy implications when processing personal data
|
| 115 |
+
- The model's predictions should be validated by human experts for critical applications
|
| 116 |
+
|
| 117 |
+
## Citation
|
| 118 |
+
|
| 119 |
+
If you use this model in your research or applications, please cite:
|
| 120 |
+
|
| 121 |
+
```text
|
| 122 |
+
@misc{bert-indonesian-ner,
|
| 123 |
+
title={BERT Base Indonesian Named Entity Recognition},
|
| 124 |
+
author={Your Name},
|
| 125 |
+
year={2024},
|
| 126 |
+
publisher={Hugging Face},
|
| 127 |
+
url={https://huggingface.co/model-id}
|
| 128 |
+
}
|
| 129 |
+
```
|
| 130 |
+
|
| 131 |
+
## Contact
|
| 132 |
+
|
| 133 |
+
For questions or issues, please contact [your contact information].
|