developer-lunark commited on
Commit
9e5a732
·
verified ·
1 Parent(s): 4c1d6b3

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +97 -0
README.md ADDED
@@ -0,0 +1,97 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ ---
3
+ language: [ko, en, es, pt]
4
+ tags:
5
+ - token-classification
6
+ - named-entity-recognition
7
+ - multilingual
8
+ license: mit
9
+ datasets:
10
+ - wikiann
11
+ model-index:
12
+ - name: kaidol-ner-multilingual
13
+ results:
14
+ - task:
15
+ name: Named Entity Recognition
16
+ type: token-classification
17
+ dataset:
18
+ name: WikiAnn (en, ko, es, pt)
19
+ type: wikiann
20
+ metrics:
21
+ - name: F1
22
+ type: f1
23
+ value: 0.74
24
+ ---
25
+
26
+ # 🌐 KAIdol NER Multilingual Model
27
+
28
+ This is a multilingual NER (Named Entity Recognition) model developed as part of the **KAIdol Project**.
29
+ It is based on [`Davlan/xlm-roberta-base-ner-hrl`](https://huggingface.co/Davlan/xlm-roberta-base-ner-hrl), fine-tuned on the [WikiAnn](https://huggingface.co/datasets/wikiann) dataset for **Korean (ko)**, **English (en)**, **Spanish (es)**, and **Portuguese (pt)**.
30
+
31
+ ## 🧠 Model Details
32
+
33
+ - **Base model**: `Davlan/xlm-roberta-base-ner-hrl`
34
+ - **NER Tags**:
35
+ - `PER`: Person
36
+ - `ORG`: Organization
37
+ - `LOC`: Location
38
+ - **Tokenizer**: AutoTokenizer from base model
39
+ - **Max length**: 128 tokens
40
+
41
+ ## 📊 Training Configuration
42
+
43
+ | Parameter | Value |
44
+ |------------------|-----------|
45
+ | Epochs | 5 |
46
+ | Batch Size | 16 |
47
+ | Optimizer | AdamW |
48
+ | Learning Rate | 5e-5 |
49
+ | Loss | CrossEntropy with class weights |
50
+ | Dataset | WikiAnn (en, ko, es, pt) |
51
+
52
+ ## ✅ Performance Summary
53
+
54
+ | Language | F1-macro | PER F1 | ORG F1 | LOC F1 |
55
+ |----------|----------|--------|--------|--------|
56
+ | English | 0.74 | 0.84 | 0.63 | 0.76 |
57
+ | Korean | 0.43 | 0.46 | 0.30 | 0.52 |
58
+ | Spanish | TBD | TBD | TBD | TBD |
59
+ | Portuguese | TBD | TBD | TBD | TBD |
60
+
61
+ > Performance on `es` and `pt` will be updated after evaluation. Korean performance is limited due to tokenization issues in WikiAnn.
62
+
63
+ ## 🚀 Usage Example
64
+
65
+ ```python
66
+ from transformers import AutoTokenizer, AutoModelForTokenClassification
67
+
68
+ model = AutoModelForTokenClassification.from_pretrained("developer-lunark/kaidol-ner-multilingual")
69
+ tokenizer = AutoTokenizer.from_pretrained("developer-lunark/kaidol-ner-multilingual")
70
+
71
+ tokens = tokenizer("Barack Obama nació en Hawái.", return_tensors="pt")
72
+ output = model(**tokens)
73
+ ```
74
+
75
+ ## 🧾 Label Mapping
76
+
77
+ ```python
78
+ {
79
+ 'O': 0,
80
+ 'B-PER': 1,
81
+ 'I-PER': 2,
82
+ 'B-ORG': 3,
83
+ 'I-ORG': 4,
84
+ 'B-LOC': 5,
85
+ 'I-LOC': 6
86
+ }
87
+ ```
88
+
89
+ ## 🔐 License
90
+
91
+ MIT License
92
+
93
+ ## 📬 Contact
94
+
95
+ Developed by the [KAIdol 프로젝트 팀].
96
+
97
+ For questions or collaborations, contact: `developer-lunark`