|
|
--- |
|
|
language: |
|
|
- ko |
|
|
- en |
|
|
- es |
|
|
- pt |
|
|
tags: |
|
|
- token-classification |
|
|
- named-entity-recognition |
|
|
- multilingual |
|
|
- transformers |
|
|
license: mit |
|
|
pipeline_tag: token-classification |
|
|
datasets: |
|
|
- wikiann |
|
|
model-index: |
|
|
- name: kaidol-ner-multilingual |
|
|
results: |
|
|
- task: |
|
|
name: Named Entity Recognition |
|
|
type: token-classification |
|
|
dataset: |
|
|
name: WikiAnn (en, ko, es, pt) |
|
|
type: wikiann |
|
|
metrics: |
|
|
- name: F1 |
|
|
type: f1 |
|
|
value: 0.74 |
|
|
base_model: |
|
|
- Davlan/xlm-roberta-base-ner-hrl |
|
|
--- |
|
|
|
|
|
# ๐ KAIdol NER Multilingual Model |
|
|
|
|
|
This is a multilingual NER (Named Entity Recognition) model developed as part of the **KAIdol Project**. |
|
|
It is based on [`Davlan/xlm-roberta-base-ner-hrl`](https://huggingface.co/Davlan/xlm-roberta-base-ner-hrl), fine-tuned on the [WikiAnn](https://huggingface.co/datasets/wikiann) dataset for **Korean (ko)**, **English (en)**, **Spanish (es)**, and **Portuguese (pt)**. |
|
|
|
|
|
## ๐ง Model Details |
|
|
|
|
|
- **Base model**: `Davlan/xlm-roberta-base-ner-hrl` |
|
|
- **NER Tags**: |
|
|
- `PER`: Person |
|
|
- `ORG`: Organization |
|
|
- `LOC`: Location |
|
|
- **Tokenizer**: AutoTokenizer from base model |
|
|
- **Max length**: 128 tokens |
|
|
|
|
|
## ๐ Training Configuration |
|
|
|
|
|
| Parameter | Value | |
|
|
|------------------|-----------| |
|
|
| Epochs | 5 | |
|
|
| Batch Size | 16 | |
|
|
| Optimizer | AdamW | |
|
|
| Learning Rate | 5e-5 | |
|
|
| Loss | CrossEntropy with class weights | |
|
|
| Dataset | WikiAnn (en, ko, es, pt) | |
|
|
|
|
|
## โ
Performance Summary |
|
|
|
|
|
| Language | F1-macro | PER F1 | ORG F1 | LOC F1 | |
|
|
|----------|----------|--------|--------|--------| |
|
|
| English | 0.74 | 0.84 | 0.63 | 0.76 | |
|
|
| Korean | 0.43 | 0.46 | 0.30 | 0.52 | |
|
|
| Spanish | TBD | TBD | TBD | TBD | |
|
|
| Portuguese | TBD | TBD | TBD | TBD | |
|
|
|
|
|
> Performance on `es` and `pt` will be updated after evaluation. Korean performance is limited due to tokenization issues in WikiAnn. |
|
|
|
|
|
## ๐ Usage Example |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForTokenClassification |
|
|
|
|
|
model = AutoModelForTokenClassification.from_pretrained("developer-lunark/kaidol-ner-multilingual") |
|
|
tokenizer = AutoTokenizer.from_pretrained("developer-lunark/kaidol-ner-multilingual") |
|
|
|
|
|
tokens = tokenizer("Barack Obama naciรณ en Hawรกi.", return_tensors="pt") |
|
|
output = model(**tokens) |
|
|
``` |
|
|
|
|
|
## ๐งพ Label Mapping |
|
|
|
|
|
```python |
|
|
{ |
|
|
'O': 0, |
|
|
'B-PER': 1, |
|
|
'I-PER': 2, |
|
|
'B-ORG': 3, |
|
|
'I-ORG': 4, |
|
|
'B-LOC': 5, |
|
|
'I-LOC': 6 |
|
|
} |
|
|
``` |
|
|
|
|
|
## ๐ License |
|
|
|
|
|
MIT License |
|
|
|
|
|
## ๐ฌ Contact |
|
|
|
|
|
Developed by the [KAIdol ํ๋ก์ ํธ ํ]. |
|
|
|
|
|
For questions or collaborations, contact: `developer-lunark` |