nahiar commited on
Commit
848b5ab
·
verified ·
1 Parent(s): b78be3b

Upload folder using huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +129 -2
README.md CHANGED
@@ -1,6 +1,133 @@
1
  ---
2
  license: mit
3
  language:
4
- - id
5
  pipeline_tag: token-classification
6
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
  language:
4
+ - id
5
  pipeline_tag: token-classification
6
+ ---
7
+
8
+ # BERT Base Indonesian Named Entity Recognition
9
+
10
+ This is a BERT-based model fine-tuned for Named Entity Recognition (NER) tasks in Indonesian language. The model is trained to identify and classify named entities such as persons, organizations, locations, and other relevant entities in Indonesian text.
11
+
12
+ ## Model Details
13
+
14
+ - **Model Type**: BERT (Bidirectional Encoder Representations from Transformers)
15
+ - **Language**: Indonesian (id)
16
+ - **Task**: Token Classification / Named Entity Recognition
17
+ - **Base Model**: BERT Base
18
+ - **License**: MIT
19
+
20
+ ## Intended Use
21
+
22
+ This model is intended for:
23
+
24
+ - Named Entity Recognition in Indonesian text
25
+ - Information extraction from Indonesian documents
26
+ - Text analysis and processing applications
27
+
28
+ ## How to Use
29
+
30
+ ### Using with Transformers
31
+
32
+ ```python
33
+ from transformers import AutoTokenizer, AutoModelForTokenClassification
34
+ import torch
35
+
36
+ # Load model and tokenizer
37
+ model_name = "path/to/bert-base-indonesian-NER" # or Hugging Face model ID if uploaded
38
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
39
+ model = AutoModelForTokenClassification.from_pretrained(model_name)
40
+
41
+ # Prepare input text
42
+ text = "Presiden Joko Widodo berkunjung ke Jakarta untuk bertemu dengan Gubernur Anies Baswedan."
43
+ inputs = tokenizer(text, return_tensors="pt")
44
+
45
+ # Get predictions
46
+ with torch.no_grad():
47
+ outputs = model(**inputs)
48
+ predictions = torch.argmax(outputs.logits, dim=2)
49
+
50
+ # Decode predictions
51
+ predicted_tokens = [tokenizer.convert_ids_to_tokens(ids) for ids in inputs["input_ids"]]
52
+ predicted_labels = [model.config.id2label[label_id] for label_id in predictions[0].tolist()]
53
+
54
+ print("Tokens:", predicted_tokens)
55
+ print("Labels:", predicted_labels)
56
+ ```
57
+
58
+ ### Using with Pipeline
59
+
60
+ ```python
61
+ from transformers import pipeline
62
+
63
+ # Load NER pipeline
64
+ ner_pipeline = pipeline("ner", model="path/to/bert-base-indonesian-NER")
65
+
66
+ # Process text
67
+ text = "PT Bank Central Asia Tbk memiliki kantor pusat di Jakarta."
68
+ results = ner_pipeline(text)
69
+
70
+ for result in results:
71
+ print(f"Entity: {result['word']}, Label: {result['entity']}, Confidence: {result['score']:.4f}")
72
+ ```
73
+
74
+ ## Label Mapping
75
+
76
+ The model uses the following entity labels:
77
+
78
+ - `B-PER`: Beginning of Person name
79
+ - `I-PER`: Inside of Person name
80
+ - `B-ORG`: Beginning of Organization name
81
+ - `I-ORG`: Inside of Organization name
82
+ - `B-LOC`: Beginning of Location name
83
+ - `I-LOC`: Inside of Location name
84
+ - `B-MISC`: Beginning of Miscellaneous entity
85
+ - `I-MISC`: Inside of Miscellaneous entity
86
+ - `O`: Outside (not an entity)
87
+
88
+ ## Training Data
89
+
90
+ The model was trained on Indonesian text datasets containing annotated named entities. The training data includes:
91
+
92
+ - News articles
93
+ - Wikipedia pages
94
+ - Social media posts
95
+ - Government documents
96
+
97
+ ## Performance
98
+
99
+ The model achieves the following performance metrics on the test set:
100
+
101
+ - Precision: 0.XX
102
+ - Recall: 0.XX
103
+ - F1-Score: 0.XX
104
+
105
+ ## Limitations
106
+
107
+ - The model may not perform well on informal or slang-heavy Indonesian text
108
+ - Performance may vary across different domains
109
+ - The model is trained on data up to a certain date and may not recognize newer entities
110
+
111
+ ## Ethical Considerations
112
+
113
+ - This model should not be used for surveillance or tracking individuals without consent
114
+ - Always consider privacy implications when processing personal data
115
+ - The model's predictions should be validated by human experts for critical applications
116
+
117
+ ## Citation
118
+
119
+ If you use this model in your research or applications, please cite:
120
+
121
+ ```text
122
+ @misc{bert-indonesian-ner,
123
+ title={BERT Base Indonesian Named Entity Recognition},
124
+ author={Your Name},
125
+ year={2024},
126
+ publisher={Hugging Face},
127
+ url={https://huggingface.co/model-id}
128
+ }
129
+ ```
130
+
131
+ ## Contact
132
+
133
+ For questions or issues, please contact [your contact information].