AiLab-IMCS-UL
/

lv-ner-v1

Token Classification

Model card Files Files and versions

lv-ner-v1 / README.md

artursz's picture

Update README.md

a0ff4f5 verified about 2 months ago

|

history blame contribute delete

2.92 kB

	---
	language:
	- lv
	license: apache-2.0
	library_name: transformers
	pipeline_tag: token-classification
	tags:
	- token-classification
	- ner
	- latvian
	- deberta
	base_model: AiLab-IMCS-UL/lv-deberta-base
	metrics:
	- precision
	- recall
	- f1
	---

	# Latvian named entity recognition (NER)

	## Dataset
	Trained on the [FullStack dataset](https://github.com/LUMII-AILab/FullStack).

	## Results
	Results on the test split:

	\| Label \| Precision \| Recall \| F1 Score \|
	\|---------------\|----------:\|-------:\|---------:\|
	\| Micro Avg \| 87.2 \| 87.9 \| 87.6 \|
	\| Macro Avg \| 76.6 \| 73.1 \| 73.8 \|
	\| GPE \| 93.2 \| 93.2 \| 93.2 \|
	\| entity \| 50.0 \| 55.2 \| 52.5 \|
	\| event \| 72.0 \| 81.8 \| 76.6 \|
	\| location \| 81.5 \| 78.6 \| 80.0 \|
	\| money \| 60.0 \| 25.0 \| 35.3 \|
	\| organization \| 87.2 \| 89.2 \| 88.2 \|
	\| person \| 96.5 \| 98.4 \| 97.4 \|
	\| product \| 75.0 \| 58.1 \| 65.5 \|
	\| time \| 73.8 \| 78.3 \| 75.9 \|

	## Usage
	```python
	import re

	import torch
	from transformers import AutoModelForTokenClassification, AutoTokenizer


	class NER:
	def __init__(self, model_name='AiLab-IMCS-UL/lv-ner-v1', max_length=1024):
	self.tokenizer = AutoTokenizer.from_pretrained(model_name)
	self.model = AutoModelForTokenClassification.from_pretrained(model_name).eval()
	self.id2label = self.model.config.id2label
	self.max_length = max_length

	def predict(self, text):
	pretokenized = list(re.finditer(r'\w+\|\S', text))
	if not pretokenized:
	return []

	enc = self.tokenizer([m.group(0) for m in pretokenized], is_split_into_words=True, return_tensors='pt', truncation=True, max_length=self.max_length)
	word_ids = enc.word_ids(0)

	with torch.no_grad():
	preds = self.model(**enc).logits.argmax(-1)[0].tolist()

	offsets = [(m.start(), m.end()) for m in pretokenized]
	ents, cur, prev = [], None, None

	for pred, wid in zip(preds, word_ids):
	if wid is None or wid == prev:
	prev = wid
	continue
	prev = wid

	start, end = offsets[wid]
	raw_label = self.id2label[pred]
	if raw_label == 'O':
	if cur:
	ents.append(cur)
	cur = None
	continue

	prefix, label = raw_label.split('-', 1) if '-' in raw_label else ('B', raw_label)
	if prefix == 'B' or not cur or cur['label'] != label:
	if cur:
	ents.append(cur)
	cur = {'start': start, 'end': end, 'label': label}
	else:
	cur['end'] = end

	if cur:
	ents.append(cur)

	for ent in ents:
	ent['text'] = text[ent['start']:ent['end']]

	return ents

	m = NER()
	print(m.predict('Jānis Bērziņš strādā Latvijas uzņēmumā SIA Mia.'))
	```