developer-lunark
/

kaidol-ner-multilingual

Token Classification

named-entity-recognition

Model card Files Files and versions

developer-lunark commited on Jul 18, 2025

Commit

9e5a732

·

verified ·

1 Parent(s): 4c1d6b3

Create README.md

Files changed (1) hide show

README.md +97 -0

README.md ADDED Viewed

	@@ -0,0 +1,97 @@

+---
+language: [ko, en, es, pt]
+tags:
+- token-classification
+- named-entity-recognition
+- multilingual
+license: mit
+datasets:
+- wikiann
+model-index:
+- name: kaidol-ner-multilingual
+  results:
+  - task:
+      name: Named Entity Recognition
+      type: token-classification
+    dataset:
+      name: WikiAnn (en, ko, es, pt)
+      type: wikiann
+    metrics:
+      - name: F1
+        type: f1
+        value: 0.74
+---
+# 🌐 KAIdol NER Multilingual Model
+This is a multilingual NER (Named Entity Recognition) model developed as part of the **KAIdol Project**.
+It is based on [`Davlan/xlm-roberta-base-ner-hrl`](https://huggingface.co/Davlan/xlm-roberta-base-ner-hrl), fine-tuned on the [WikiAnn](https://huggingface.co/datasets/wikiann) dataset for **Korean (ko)**, **English (en)**, **Spanish (es)**, and **Portuguese (pt)**.
+## 🧠 Model Details
+- **Base model**: `Davlan/xlm-roberta-base-ner-hrl`
+- **NER Tags**:
+  - `PER`: Person
+  - `ORG`: Organization
+  - `LOC`: Location
+- **Tokenizer**: AutoTokenizer from base model
+- **Max length**: 128 tokens
+## 📊 Training Configuration
+| Parameter         | Value     |
+|------------------|-----------|
+| Epochs           | 5         |
+| Batch Size       | 16        |
+| Optimizer        | AdamW     |
+| Learning Rate    | 5e-5      |
+| Loss             | CrossEntropy with class weights |
+| Dataset          | WikiAnn (en, ko, es, pt) |
+## ✅ Performance Summary
+| Language | F1-macro | PER F1 | ORG F1 | LOC F1 |
+|----------|----------|--------|--------|--------|
+| English  | 0.74     | 0.84   | 0.63   | 0.76   |
+| Korean   | 0.43     | 0.46   | 0.30   | 0.52   |
+| Spanish  | TBD      | TBD    | TBD    | TBD    |
+| Portuguese | TBD    | TBD    | TBD    | TBD    |
+> Performance on `es` and `pt` will be updated after evaluation. Korean performance is limited due to tokenization issues in WikiAnn.
+## 🚀 Usage Example
+```python
+from transformers import AutoTokenizer, AutoModelForTokenClassification
+model = AutoModelForTokenClassification.from_pretrained("developer-lunark/kaidol-ner-multilingual")
+tokenizer = AutoTokenizer.from_pretrained("developer-lunark/kaidol-ner-multilingual")
+tokens = tokenizer("Barack Obama nació en Hawái.", return_tensors="pt")
+output = model(**tokens)
+```
+## 🧾 Label Mapping
+```python
+{
+  'O': 0,
+  'B-PER': 1,
+  'I-PER': 2,
+  'B-ORG': 3,
+  'I-ORG': 4,
+  'B-LOC': 5,
+  'I-LOC': 6
+}
+```
+## 🔐 License
+MIT License
+## 📬 Contact
+Developed by the [KAIdol 프로젝트 팀].
+For questions or collaborations, contact: `developer-lunark`