|
|
--- |
|
|
library_name: transformers |
|
|
base_model: HooshvareLab/bert-base-parsbert-uncased |
|
|
tags: |
|
|
- ner |
|
|
- persian |
|
|
- fine-tuned |
|
|
- transformers |
|
|
model-index: |
|
|
- name: bert-finetuned-ner |
|
|
results: [] |
|
|
--- |
|
|
|
|
|
# bert-finetuned-ner |
|
|
|
|
|
This model is a **fine-tuned Persian BERT model** based on [HooshvareLab/bert-base-parsbert-uncased](https://huggingface.co/HooshvareLab/bert-base-parsbert-uncased) for **Named Entity Recognition (NER)**. It has been trained to identify entities such as persons, organizations, locations, and products in Persian text. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
`bert-finetuned-ner` is designed for token-level classification in Persian. The model uses **ParsBERT**, a BERT variant pretrained on a large Persian corpus, as the base model and is fine-tuned on a wnut2017-persian dataset. It can predict entity labels for each token in input text, supporting tasks such as text analysis, information extraction, and question answering pipelines. |
|
|
|
|
|
## Intended Uses & Limitations |
|
|
|
|
|
### Intended Uses |
|
|
|
|
|
- Named Entity Recognition (NER) in Persian text. |
|
|
- Information extraction for NLP pipelines in Persian language applications. |
|
|
- Academic research or industrial projects requiring entity tagging. |
|
|
|
|
|
### Limitations |
|
|
|
|
|
- Performance depends heavily on the coverage and quality of the training data. Entities not represented in the dataset may not be recognized. |
|
|
- The model may misclassify rare or out-of-vocabulary words. |
|
|
- For critical applications, manual verification of predictions is recommended. |
|
|
- Trained on formal text; performance on dialects or colloquial Persian may vary. |
|
|
|
|
|
## Training and Evaluation Data |
|
|
|
|
|
- **Dataset:** ('Amir13/wnut2017-persian') |
|
|
- Entities annotated include: persons, organizations, locations, creative works, and products. |
|
|
- Tokenization handled using ParsBERT tokenizer (`HooshvareLab/bert-base-parsbert-uncased`). |
|
|
|
|
|
## Training Procedure |
|
|
|
|
|
### Training Hyperparameters |
|
|
|
|
|
- Learning rate: 2e-5 |
|
|
- Train batch size: 8 |
|
|
- Evaluation batch size: 8 |
|
|
- Optimizer: `AdamW` (betas=(0.9, 0.999), epsilon=1e-8) |
|
|
- Learning rate scheduler: linear |
|
|
- Number of epochs: 10 |
|
|
- Seed: 42 |
|
|
|
|
|
### Training Environment |
|
|
|
|
|
- Framework: Transformers 4.56.1 |
|
|
- PyTorch 2.8.0+cu126 |
|
|
- Datasets 4.0.0 |
|
|
- Tokenizers 0.22.0 |
|
|
|
|
|
### Training Results |
|
|
|
|
|
- The model achieved training loss of ~0.13 (averaged over all epochs). |
|
|
- Accuracy, F1, precision, and recall metrics are recommended to be computed on a held-out test set for full evaluation. |
|
|
|
|
|
## How to Use |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline |
|
|
|
|
|
model_path = "path_to_saved_model" |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_path) |
|
|
model = AutoModelForTokenClassification.from_pretrained(model_path) |
|
|
|
|
|
ner_pipeline = pipeline( |
|
|
"ner", |
|
|
model=model, |
|
|
tokenizer=tokenizer, |
|
|
aggregation_strategy="simple" |
|
|
) |
|
|
|
|
|
text = "سلام. در تهران زندگی میکنم." |
|
|
results = ner_pipeline(text) |
|
|
print(results) |
|
|
|