|
|
--- |
|
|
license: mit |
|
|
language: |
|
|
- id |
|
|
pipeline_tag: token-classification |
|
|
tags: |
|
|
- token-classification |
|
|
- indonesian |
|
|
- bert |
|
|
- ner |
|
|
- named-entity-recognition |
|
|
- transformers |
|
|
datasets: |
|
|
- custom |
|
|
widget: |
|
|
- text: "Presiden Joko Widodo berkunjung ke Jakarta untuk bertemu dengan Gubernur Anies Baswedan." |
|
|
inference: true |
|
|
--- |
|
|
|
|
|
# BERT Base Indonesian Named Entity Recognition |
|
|
|
|
|
This is a BERT-based model fine-tuned for Named Entity Recognition (NER) tasks in Indonesian. |
|
|
The model is trained to identify and classify named entities such as persons, organizations, locations, and other relevant entities in Indonesian text. |
|
|
|
|
|
--- |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Model Type**: BERT (Bidirectional Encoder Representations from Transformers) |
|
|
- **Language**: Indonesian (id) |
|
|
- **Task**: Token Classification / Named Entity Recognition |
|
|
- **Base Model**: [`cahya/bert-base-indonesian-1.5G`](https://huggingface.co/cahya/bert-base-indonesian-1.5G) |
|
|
- **License**: MIT |
|
|
|
|
|
### Base Model Reference |
|
|
|
|
|
The base model, **BERT Base Indonesian (uncased)**, was pre-trained on: |
|
|
- ~522MB Indonesian Wikipedia |
|
|
- ~1GB Indonesian newspaper text |
|
|
using a masked language modeling (MLM) objective with a 32,000 WordPiece vocabulary. |
|
|
|
|
|
Full details are available on its [model card](https://huggingface.co/cahya/bert-base-indonesian-1.5G). |
|
|
|
|
|
--- |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
This fine-tuned model is intended for: |
|
|
|
|
|
- Named Entity Recognition in Indonesian text |
|
|
- Information extraction from Indonesian documents |
|
|
- Text analysis and processing applications |
|
|
|
|
|
--- |
|
|
|
|
|
## How to Use |
|
|
|
|
|
### Using with Transformers |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForTokenClassification |
|
|
import torch |
|
|
|
|
|
model_name = "nahiar/BERT-NER" # replace with your Hugging Face repo ID |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
model = AutoModelForTokenClassification.from_pretrained(model_name) |
|
|
|
|
|
text = "Presiden Joko Widodo berkunjung ke Jakarta untuk bertemu dengan Gubernur Anies Baswedan." |
|
|
inputs = tokenizer(text, return_tensors="pt") |
|
|
|
|
|
with torch.no_grad(): |
|
|
outputs = model(**inputs) |
|
|
predictions = torch.argmax(outputs.logits, dim=2) |
|
|
|
|
|
tokens = [tokenizer.convert_ids_to_tokens(ids) for ids in inputs["input_ids"]] |
|
|
labels = [model.config.id2label[label_id] for label_id in predictions[0].tolist()] |
|
|
|
|
|
print("Tokens:", tokens) |
|
|
print("Labels:", labels) |
|
|
``` |