File size: 2,648 Bytes

---
language:
- ko
- en
- es
- pt
tags:
- token-classification
- named-entity-recognition
- multilingual
- transformers
license: mit
pipeline_tag: token-classification
datasets:
- wikiann
model-index:
- name: kaidol-ner-multilingual
  results:
  - task:
      name: Named Entity Recognition
      type: token-classification
    dataset:
      name: WikiAnn (en, ko, es, pt)
      type: wikiann
    metrics:
    - name: F1
      type: f1
      value: 0.74
base_model:
- Davlan/xlm-roberta-base-ner-hrl
---

# 🌐 KAIdol NER Multilingual Model

This is a multilingual NER (Named Entity Recognition) model developed as part of the **KAIdol Project**.  
It is based on [`Davlan/xlm-roberta-base-ner-hrl`](https://huggingface.co/Davlan/xlm-roberta-base-ner-hrl), fine-tuned on the [WikiAnn](https://huggingface.co/datasets/wikiann) dataset for **Korean (ko)**, **English (en)**, **Spanish (es)**, and **Portuguese (pt)**.

## 🧠 Model Details

- **Base model**: `Davlan/xlm-roberta-base-ner-hrl`
- **NER Tags**:  
  - `PER`: Person  
  - `ORG`: Organization  
  - `LOC`: Location  
- **Tokenizer**: AutoTokenizer from base model  
- **Max length**: 128 tokens

## 📊 Training Configuration

| Parameter         | Value     |
|------------------|-----------|
| Epochs           | 5         |
| Batch Size       | 16        |
| Optimizer        | AdamW     |
| Learning Rate    | 5e-5      |
| Loss             | CrossEntropy with class weights |
| Dataset          | WikiAnn (en, ko, es, pt) |

## ✅ Performance Summary

| Language | F1-macro | PER F1 | ORG F1 | LOC F1 |
|----------|----------|--------|--------|--------|
| English  | 0.74     | 0.84   | 0.63   | 0.76   |
| Korean   | 0.43     | 0.46   | 0.30   | 0.52   |
| Spanish  | TBD      | TBD    | TBD    | TBD    |
| Portuguese | TBD    | TBD    | TBD    | TBD    |

> Performance on `es` and `pt` will be updated after evaluation. Korean performance is limited due to tokenization issues in WikiAnn.

## 🚀 Usage Example

```python
from transformers import AutoTokenizer, AutoModelForTokenClassification

model = AutoModelForTokenClassification.from_pretrained("developer-lunark/kaidol-ner-multilingual")
tokenizer = AutoTokenizer.from_pretrained("developer-lunark/kaidol-ner-multilingual")

tokens = tokenizer("Barack Obama nació en Hawái.", return_tensors="pt")
output = model(**tokens)
```

## 🧾 Label Mapping

```python
{
  'O': 0,
  'B-PER': 1,
  'I-PER': 2,
  'B-ORG': 3,
  'I-ORG': 4,
  'B-LOC': 5,
  'I-LOC': 6
}
```

## 🔐 License

MIT License

## 📬 Contact

Developed by the [KAIdol 프로젝트 팀].

For questions or collaborations, contact: `developer-lunark`