File size: 2,648 Bytes
9e5a732
d7d8930
 
 
 
 
9e5a732
 
 
 
d7d8930
9e5a732
36225ce
9e5a732
 
 
 
 
 
 
 
 
 
 
 
d7d8930
 
 
36225ce
 
9e5a732
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d7d8930
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
---
language:
- ko
- en
- es
- pt
tags:
- token-classification
- named-entity-recognition
- multilingual
- transformers
license: mit
pipeline_tag: token-classification
datasets:
- wikiann
model-index:
- name: kaidol-ner-multilingual
  results:
  - task:
      name: Named Entity Recognition
      type: token-classification
    dataset:
      name: WikiAnn (en, ko, es, pt)
      type: wikiann
    metrics:
    - name: F1
      type: f1
      value: 0.74
base_model:
- Davlan/xlm-roberta-base-ner-hrl
---

# ๐ŸŒ KAIdol NER Multilingual Model

This is a multilingual NER (Named Entity Recognition) model developed as part of the **KAIdol Project**.  
It is based on [`Davlan/xlm-roberta-base-ner-hrl`](https://huggingface.co/Davlan/xlm-roberta-base-ner-hrl), fine-tuned on the [WikiAnn](https://huggingface.co/datasets/wikiann) dataset for **Korean (ko)**, **English (en)**, **Spanish (es)**, and **Portuguese (pt)**.

## ๐Ÿง  Model Details

- **Base model**: `Davlan/xlm-roberta-base-ner-hrl`
- **NER Tags**:  
  - `PER`: Person  
  - `ORG`: Organization  
  - `LOC`: Location  
- **Tokenizer**: AutoTokenizer from base model  
- **Max length**: 128 tokens

## ๐Ÿ“Š Training Configuration

| Parameter         | Value     |
|------------------|-----------|
| Epochs           | 5         |
| Batch Size       | 16        |
| Optimizer        | AdamW     |
| Learning Rate    | 5e-5      |
| Loss             | CrossEntropy with class weights |
| Dataset          | WikiAnn (en, ko, es, pt) |

## โœ… Performance Summary

| Language | F1-macro | PER F1 | ORG F1 | LOC F1 |
|----------|----------|--------|--------|--------|
| English  | 0.74     | 0.84   | 0.63   | 0.76   |
| Korean   | 0.43     | 0.46   | 0.30   | 0.52   |
| Spanish  | TBD      | TBD    | TBD    | TBD    |
| Portuguese | TBD    | TBD    | TBD    | TBD    |

> Performance on `es` and `pt` will be updated after evaluation. Korean performance is limited due to tokenization issues in WikiAnn.

## ๐Ÿš€ Usage Example

```python
from transformers import AutoTokenizer, AutoModelForTokenClassification

model = AutoModelForTokenClassification.from_pretrained("developer-lunark/kaidol-ner-multilingual")
tokenizer = AutoTokenizer.from_pretrained("developer-lunark/kaidol-ner-multilingual")

tokens = tokenizer("Barack Obama naciรณ en Hawรกi.", return_tensors="pt")
output = model(**tokens)
```

## ๐Ÿงพ Label Mapping

```python
{
  'O': 0,
  'B-PER': 1,
  'I-PER': 2,
  'B-ORG': 3,
  'I-ORG': 4,
  'B-LOC': 5,
  'I-LOC': 6
}
```

## ๐Ÿ” License

MIT License

## ๐Ÿ“ฌ Contact

Developed by the [KAIdol ํ”„๋กœ์ ํŠธ ํŒ€].

For questions or collaborations, contact: `developer-lunark`