Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,86 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language:
|
| 3 |
+
- bn
|
| 4 |
+
metrics:
|
| 5 |
+
- f1
|
| 6 |
+
pipeline_tag: token-classification
|
| 7 |
+
---
|
| 8 |
+
# Bangla-Person-Name-Extractor
|
| 9 |
+
This repository contains the implementation of a Bangla Person Name Extractor model which is able to extract Person name entities from a given sentence. We approached it as a token classification task i.e. tagging each token with either a Person's name or not. We leveraged the [BanglaBERT](http://https://github.com/csebuetnlp/banglabert) model for our task, finetuning it for a binary classification task using a custom-prepare dataset. We have deployed the model into huggingface for easier access and use case.
|
| 10 |
+
|
| 11 |
+
|
| 12 |
+
# Datasets
|
| 13 |
+
We used two datasets to train and evaluate our pipeline.
|
| 14 |
+
1. [Bengali-NER/annotated data at master · Rifat1493/Bengali-NER](http://https://github.com/Rifat1493/Bengali-NER/tree/master/annotated%20data)
|
| 15 |
+
2. [banglakit/bengali-ner-data](http://https://raw.githubusercontent.com/banglakit/bengali-ner-data/master/main.jsonl)
|
| 16 |
+
|
| 17 |
+
The annotation formats for both datasets were quite different, so we had to preprocess both of them before merging them. Please refer to [this notebook](https://github.com/MBMMurad/Bangla-Person-Name-Extractor/blob/main/prepare-dataset.ipynb) for preparing the dataset as required.
|
| 18 |
+
|
| 19 |
+
# Training and Evaluation
|
| 20 |
+
We treated this problem as a token classification task.So it seemed perfect to finetune BanglaBERT model for our purpose. [BanglaBERT ](https://huggingface.co/csebuetnlp/banglabert)is an [ELECTRA](https://openreview.net/pdf?id=r1xMH1BtvB) discriminator model pretrained with the Replaced Token Detection (RTD) objective. Finetuned models using this checkpoint achieve state-of-the-art results on many of the NLP tasks in bengali.
|
| 21 |
+
We mainly finetuned two checkpoints of BanglaBERT.
|
| 22 |
+
1. [BanglaBERT](https://huggingface.co/csebuetnlp/banglabert)
|
| 23 |
+
2. [BanglaEERT small](https://huggingface.co/csebuetnlp/banglabert_small)
|
| 24 |
+
|
| 25 |
+
BanglaBERT performed better than BanglaBERT small ( 83% F1 score vs 79% F1 score on the test set) .
|
| 26 |
+
Please refer to [this notebook](https://github.com/MBMMurad/Bangla-Person-Name-Extractor/blob/main/Training%20Notebook%20%3A%20Person%20Name%20Extractor%20using%20BanglaBERT.ipynb) to see the training process.
|
| 27 |
+
|
| 28 |
+
**Quantitative results**
|
| 29 |
+
Please refer to [this notebook](https://github.com/MBMMurad/Bangla-Person-Name-Extractor/blob/main/Inference%20and%20Evaluation%20Notebook.ipynb) to see the evaluation process.
|
| 30 |
+
<br></br>
|
| 31 |
+

|
| 32 |
+
|
| 33 |
+
# How to use it?
|
| 34 |
+
[This Notebook](https://github.com/MBMMurad/Bangla-Person-Name-Extractor/blob/main/Inference_template.ipynb) contains the required Inference Template on a sentence.
|
| 35 |
+
<br></br>
|
| 36 |
+
You can also directly infer using the following code snippet. Just change the sentence.
|
| 37 |
+
|
| 38 |
+
```
|
| 39 |
+
from transformers import AutoModelForPreTraining, AutoTokenizer,AutoModelForTokenClassification #!pip install transformers==4.30.2
|
| 40 |
+
from normalizer import normalize #pip install git+https://github.com/csebuetnlp/normalizer
|
| 41 |
+
import torch #pip install torch
|
| 42 |
+
import numpy as np #!pip install numpy==1.23.5
|
| 43 |
+
|
| 44 |
+
|
| 45 |
+
model = AutoModelForTokenClassification.from_pretrained("MBMMurad/BanglaBERT_Person_Name_Extractor")
|
| 46 |
+
tokenizer = AutoTokenizer.from_pretrained("MBMMurad/BanglaBERT_Person_Name_Extractor")
|
| 47 |
+
def inference_fn(sentence):
|
| 48 |
+
sentence = normalize(sentence)
|
| 49 |
+
tokens = tokenizer.tokenize(sentence)
|
| 50 |
+
inputs = tokenizer.encode(sentence,return_tensors="pt")
|
| 51 |
+
outputs = model(inputs).logits
|
| 52 |
+
predictions = torch.argmax(outputs[0],axis=1)[1:-1].numpy()
|
| 53 |
+
idxs = np.where(predictions==1)
|
| 54 |
+
|
| 55 |
+
return np.array(tokens)[idxs]
|
| 56 |
+
|
| 57 |
+
sentence = "আব্দুর রহিম নামের কাস্টমারকে একশ টাকা বাকি দিলাম।"
|
| 58 |
+
pred = inference_fn(sentence)
|
| 59 |
+
print(f"Input Sentence : {sentence}")
|
| 60 |
+
print(f"Person Name Entities : {pred}")
|
| 61 |
+
|
| 62 |
+
sentence = "ইঞ্জিনিয়ার্স ইনস্টিটিউশন চট্টগ্রামের সাবেক সভাপতি প্রকৌশলী দেলোয়ার হোসেন মজুমদার প্রথম আলোকে বলেন, 'সংকট নিরসনে বর্তমান খালগুলোকে পূর্ণ প্রবাহে ফিরিয়ে আনার পাশাপাশি নতুন তিনটি খাল খনন জরুরি।'"
|
| 63 |
+
pred = inference_fn(sentence)
|
| 64 |
+
print(f"Input Sentence : {sentence}")
|
| 65 |
+
print(f"Person Name Entities : {pred}")
|
| 66 |
+
|
| 67 |
+
|
| 68 |
+
sentence = "দলীয় নেতারা তাঁর বাসভবনে যেতে চাইলে আটক হন।"
|
| 69 |
+
pred = inference_fn(sentence)
|
| 70 |
+
print(f"Input Sentence : {sentence}")
|
| 71 |
+
print(f"Person Name Entities : {pred}")
|
| 72 |
+
```
|
| 73 |
+
|
| 74 |
+
**Output :**
|
| 75 |
+
```
|
| 76 |
+
Input Sentence : আব্দুর রহিম নামের কাস্টমারকে একশ টাকা বাকি দিলাম।
|
| 77 |
+
Person Name Entities : ['আব্দুর' 'রহিম']
|
| 78 |
+
|
| 79 |
+
|
| 80 |
+
Input Sentence : ইঞ্জিনিয়ার্স ইনস্টিটিউশন চট্টগ্রামের সাবেক সভাপতি প্রকৌশলী দেলোয়ার হোসেন মজুমদার প্রথম আলোকে বলেন, 'সংকট নিরসনে বর্তমান খালগুলোকে পূর্ণ প্রবাহে ফিরিয়ে আনার পাশাপাশি নতুন তিনটি খাল খনন জরুরি।'
|
| 81 |
+
Person Name Entities : ['দেলোয়ার' 'হোসেন' 'মজুমদার']
|
| 82 |
+
|
| 83 |
+
|
| 84 |
+
Input Sentence : দলীয় নেতারা তাঁর বাসভবনে যেতে চাইলে আটক হন।
|
| 85 |
+
Person Name Entities : []
|
| 86 |
+
```
|