|
|
--- |
|
|
tags: |
|
|
- BERT |
|
|
- token-classification |
|
|
- sequence-tagger-model |
|
|
language: |
|
|
- ar |
|
|
- en |
|
|
license: mit |
|
|
datasets: |
|
|
- ACE2005 |
|
|
--- |
|
|
# Arabic NER Model |
|
|
- [Github repo](https://github.com/edchengg/GigaBERT) |
|
|
- NER BIO tagging model based on [GigaBERTv4](https://huggingface.co/lanwuwei/GigaBERT-v4-Arabic-and-English). |
|
|
- ACE2005 Training data: English + Arabic |
|
|
- [NER tags](https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/english-entities-guidelines-v6.6.pdf) including: PER, VEH, GPE, WEA, ORG, LOC, FAC |
|
|
|
|
|
## Hyperparameters |
|
|
- learning_rate=2e-5 |
|
|
- num_train_epochs=10 |
|
|
- weight_decay=0.01 |
|
|
|
|
|
## ACE2005 Evaluation results (F1) |
|
|
| Language | Arabic | English | |
|
|
|:----:|:-----------:|:----:| |
|
|
| | 89.4 | 88.8 | |
|
|
|
|
|
## How to use |
|
|
```python |
|
|
>>> from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer |
|
|
|
|
|
>>> ner_model = AutoModelForTokenClassification.from_pretrained("ychenNLP/arabic-ner-ace") |
|
|
>>> ner_tokenizer = AutoTokenizer.from_pretrained("ychenNLP/arabic-ner-ace") |
|
|
>>> ner_pip = pipeline("ner", model=ner_model, tokenizer=ner_tokenizer, grouped_entities=True) |
|
|
|
|
|
>>> output = ner_pip('Protests break out across the US after Supreme Court overturns.') |
|
|
>>> print(output) |
|
|
[{'entity_group': 'GPE', 'score': 0.9979881, 'word': 'us', 'start': 30, 'end': 32}, {'entity_group': 'ORG', 'score': 0.99898684, 'word': 'supreme court', 'start': 39, 'end': 52}] |
|
|
|
|
|
>>> output = ner_pip('قال وزير العدل التركي بكير بوزداغ إن أنقرة تريد 12 مشتبهاً بهم من فنلندا و 21 من السويد') |
|
|
>>> print(output) |
|
|
[{'entity_group': 'PER', 'score': 0.9996214, 'word': 'وزير', 'start': 4, 'end': 8}, {'entity_group': 'ORG', 'score': 0.9952383, 'word': 'العدل', 'start': 9, 'end': 14}, {'entity_group': 'GPE', 'score': 0.9996675, 'word': 'التركي', 'start': 15, 'end': 21}, {'entity_group': 'PER', 'score': 0.9978992, 'word': 'بكير بوزداغ', 'start': 22, 'end': 33}, {'entity_group': 'GPE', 'score': 0.9997154, 'word': 'انقرة', 'start': 37, 'end': 42}, {'entity_group': 'PER', 'score': 0.9946885, 'word': 'مشتبها بهم', 'start': 51, 'end': 62}, {'entity_group': 'GPE', 'score': 0.99967396, 'word': 'فنلندا', 'start': 66, 'end': 72}, {'entity_group': 'PER', 'score': 0.99694425, 'word': '21', 'start': 75, 'end': 77}, {'entity_group': 'GPE', 'score': 0.99963355, 'word': 'السويد', 'start': 81, 'end': 87}] |
|
|
``` |
|
|
|
|
|
### BibTeX entry and citation info |
|
|
|
|
|
```bibtex |
|
|
@inproceedings{lan2020gigabert, |
|
|
author = {Lan, Wuwei and Chen, Yang and Xu, Wei and Ritter, Alan}, |
|
|
title = {Giga{BERT}: Zero-shot Transfer Learning from {E}nglish to {A}rabic}, |
|
|
booktitle = {Proceedings of The 2020 Conference on Empirical Methods on Natural Language Processing (EMNLP)}, |
|
|
year = {2020} |
|
|
} |
|
|
``` |
|
|
|