BERT-NER / README.md
nahiar's picture
Upload README.md with huggingface_hub
788b35d verified
|
raw
history blame
4.05 kB
---
license: mit
language:
- id
pipeline_tag: token-classification
tags:
- token-classification
- indonesian
- bert
- ner
- named-entity-recognition
- transformers
datasets:
- custom
widget:
- text: "Presiden Joko Widodo berkunjung ke Jakarta untuk bertemu dengan Gubernur Anies Baswedan."
inference: true
---
# BERT Base Indonesian Named Entity Recognition
This is a BERT-based model fine-tuned for Named Entity Recognition (NER) tasks in Indonesian language. The model is trained to identify and classify named entities such as persons, organizations, locations, and other relevant entities in Indonesian text.
## Model Details
- **Model Type**: BERT (Bidirectional Encoder Representations from Transformers)
- **Language**: Indonesian (id)
- **Task**: Token Classification / Named Entity Recognition
- **Base Model**: BERT Base
- **License**: MIT
## Intended Use
This model is intended for:
- Named Entity Recognition in Indonesian text
- Information extraction from Indonesian documents
- Text analysis and processing applications
## How to Use
### Using with Transformers
```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
# Load model and tokenizer
model_name = "path/to/bert-base-indonesian-NER" # or Hugging Face model ID if uploaded
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
# Prepare input text
text = "Presiden Joko Widodo berkunjung ke Jakarta untuk bertemu dengan Gubernur Anies Baswedan."
inputs = tokenizer(text, return_tensors="pt")
# Get predictions
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=2)
# Decode predictions
predicted_tokens = [tokenizer.convert_ids_to_tokens(ids) for ids in inputs["input_ids"]]
predicted_labels = [model.config.id2label[label_id] for label_id in predictions[0].tolist()]
print("Tokens:", predicted_tokens)
print("Labels:", predicted_labels)
```
### Using with Pipeline
```python
from transformers import pipeline
# Load NER pipeline
ner_pipeline = pipeline("ner", model="path/to/bert-base-indonesian-NER")
# Process text
text = "PT Bank Central Asia Tbk memiliki kantor pusat di Jakarta."
results = ner_pipeline(text)
for result in results:
print(f"Entity: {result['word']}, Label: {result['entity']}, Confidence: {result['score']:.4f}")
```
## Label Mapping
The model uses the following entity labels:
- `B-PER`: Beginning of Person name
- `I-PER`: Inside of Person name
- `B-ORG`: Beginning of Organization name
- `I-ORG`: Inside of Organization name
- `B-LOC`: Beginning of Location name
- `I-LOC`: Inside of Location name
- `B-MISC`: Beginning of Miscellaneous entity
- `I-MISC`: Inside of Miscellaneous entity
- `O`: Outside (not an entity)
## Training Data
The model was trained on Indonesian text datasets containing annotated named entities. The training data includes:
- News articles
- Wikipedia pages
- Social media posts
- Government documents
## Performance
The model achieves the following performance metrics on the test set:
- Precision: 0.XX
- Recall: 0.XX
- F1-Score: 0.XX
## Limitations
- The model may not perform well on informal or slang-heavy Indonesian text
- Performance may vary across different domains
- The model is trained on data up to a certain date and may not recognize newer entities
## Ethical Considerations
- This model should not be used for surveillance or tracking individuals without consent
- Always consider privacy implications when processing personal data
- The model's predictions should be validated by human experts for critical applications
## Citation
If you use this model in your research or applications, please cite:
```text
@misc{bert-indonesian-ner,
title={BERT Base Indonesian Named Entity Recognition},
author={Your Name},
year={2024},
publisher={Hugging Face},
url={https://huggingface.co/model-id}
}
```
## Contact
For questions or issues, please contact [your contact information].