--- language: - ko - en - es - pt tags: - token-classification - named-entity-recognition - multilingual - transformers license: mit pipeline_tag: token-classification datasets: - wikiann model-index: - name: kaidol-ner-multilingual results: - task: name: Named Entity Recognition type: token-classification dataset: name: WikiAnn (en, ko, es, pt) type: wikiann metrics: - name: F1 type: f1 value: 0.74 base_model: - Davlan/xlm-roberta-base-ner-hrl --- # ๐ŸŒ KAIdol NER Multilingual Model This is a multilingual NER (Named Entity Recognition) model developed as part of the **KAIdol Project**. It is based on [`Davlan/xlm-roberta-base-ner-hrl`](https://huggingface.co/Davlan/xlm-roberta-base-ner-hrl), fine-tuned on the [WikiAnn](https://huggingface.co/datasets/wikiann) dataset for **Korean (ko)**, **English (en)**, **Spanish (es)**, and **Portuguese (pt)**. ## ๐Ÿง  Model Details - **Base model**: `Davlan/xlm-roberta-base-ner-hrl` - **NER Tags**: - `PER`: Person - `ORG`: Organization - `LOC`: Location - **Tokenizer**: AutoTokenizer from base model - **Max length**: 128 tokens ## ๐Ÿ“Š Training Configuration | Parameter | Value | |------------------|-----------| | Epochs | 5 | | Batch Size | 16 | | Optimizer | AdamW | | Learning Rate | 5e-5 | | Loss | CrossEntropy with class weights | | Dataset | WikiAnn (en, ko, es, pt) | ## โœ… Performance Summary | Language | F1-macro | PER F1 | ORG F1 | LOC F1 | |----------|----------|--------|--------|--------| | English | 0.74 | 0.84 | 0.63 | 0.76 | | Korean | 0.43 | 0.46 | 0.30 | 0.52 | | Spanish | TBD | TBD | TBD | TBD | | Portuguese | TBD | TBD | TBD | TBD | > Performance on `es` and `pt` will be updated after evaluation. Korean performance is limited due to tokenization issues in WikiAnn. ## ๐Ÿš€ Usage Example ```python from transformers import AutoTokenizer, AutoModelForTokenClassification model = AutoModelForTokenClassification.from_pretrained("developer-lunark/kaidol-ner-multilingual") tokenizer = AutoTokenizer.from_pretrained("developer-lunark/kaidol-ner-multilingual") tokens = tokenizer("Barack Obama naciรณ en Hawรกi.", return_tensors="pt") output = model(**tokens) ``` ## ๐Ÿงพ Label Mapping ```python { 'O': 0, 'B-PER': 1, 'I-PER': 2, 'B-ORG': 3, 'I-ORG': 4, 'B-LOC': 5, 'I-LOC': 6 } ``` ## ๐Ÿ” License MIT License ## ๐Ÿ“ฌ Contact Developed by the [KAIdol ํ”„๋กœ์ ํŠธ ํŒ€]. For questions or collaborations, contact: `developer-lunark`