developer-lunark
/

kaidol-ner-multilingual

Token Classification

named-entity-recognition

Model card Files Files and versions

kaidol-ner-multilingual / README.md

developer-lunark's picture

developer-lunark

Update README.md

36225ce verified 6 months ago

|

history blame contribute delete

2.65 kB

	---
	language:
	- ko
	- en
	- es
	- pt
	tags:
	- token-classification
	- named-entity-recognition
	- multilingual
	- transformers
	license: mit
	pipeline_tag: token-classification
	datasets:
	- wikiann
	model-index:
	- name: kaidol-ner-multilingual
	results:
	- task:
	name: Named Entity Recognition
	type: token-classification
	dataset:
	name: WikiAnn (en, ko, es, pt)
	type: wikiann
	metrics:
	- name: F1
	type: f1
	value: 0.74
	base_model:
	- Davlan/xlm-roberta-base-ner-hrl
	---

	# 🌐 KAIdol NER Multilingual Model

	This is a multilingual NER (Named Entity Recognition) model developed as part of the KAIdol Project.
	It is based on [`Davlan/xlm-roberta-base-ner-hrl`](https://huggingface.co/Davlan/xlm-roberta-base-ner-hrl), fine-tuned on the [WikiAnn](https://huggingface.co/datasets/wikiann) dataset for Korean (ko), English (en), Spanish (es), and Portuguese (pt).

	## 🧠 Model Details

	- Base model: `Davlan/xlm-roberta-base-ner-hrl`
	- NER Tags:
	- `PER`: Person
	- `ORG`: Organization
	- `LOC`: Location
	- Tokenizer: AutoTokenizer from base model
	- Max length: 128 tokens

	## 📊 Training Configuration

	\| Parameter \| Value \|
	\|------------------\|-----------\|
	\| Epochs \| 5 \|
	\| Batch Size \| 16 \|
	\| Optimizer \| AdamW \|
	\| Learning Rate \| 5e-5 \|
	\| Loss \| CrossEntropy with class weights \|
	\| Dataset \| WikiAnn (en, ko, es, pt) \|

	## ✅ Performance Summary

	\| Language \| F1-macro \| PER F1 \| ORG F1 \| LOC F1 \|
	\|----------\|----------\|--------\|--------\|--------\|
	\| English \| 0.74 \| 0.84 \| 0.63 \| 0.76 \|
	\| Korean \| 0.43 \| 0.46 \| 0.30 \| 0.52 \|
	\| Spanish \| TBD \| TBD \| TBD \| TBD \|
	\| Portuguese \| TBD \| TBD \| TBD \| TBD \|

	> Performance on `es` and `pt` will be updated after evaluation. Korean performance is limited due to tokenization issues in WikiAnn.

	## 🚀 Usage Example

	```python
	from transformers import AutoTokenizer, AutoModelForTokenClassification

	model = AutoModelForTokenClassification.from_pretrained("developer-lunark/kaidol-ner-multilingual")
	tokenizer = AutoTokenizer.from_pretrained("developer-lunark/kaidol-ner-multilingual")

	tokens = tokenizer("Barack Obama nació en Hawái.", return_tensors="pt")
	output = model(**tokens)
	```

	## 🧾 Label Mapping

	```python
	{
	'O': 0,
	'B-PER': 1,
	'I-PER': 2,
	'B-ORG': 3,
	'I-ORG': 4,
	'B-LOC': 5,
	'I-LOC': 6
	}
	```

	## 🔐 License

	MIT License

	## 📬 Contact

	Developed by the [KAIdol 프로젝트 팀].

	For questions or collaborations, contact: `developer-lunark`