---
language:
  - lv
license: apache-2.0
library_name: transformers
pipeline_tag: token-classification
tags:
  - token-classification
  - ner
  - latvian
  - deberta
base_model: AiLab-IMCS-UL/lv-deberta-base
metrics:
  - precision
  - recall
  - f1
---

# Latvian named entity recognition (NER)

## Dataset
Trained on the [FullStack dataset](https://github.com/LUMII-AILab/FullStack).

## Results
Results on the test split:

| Label         | Precision | Recall | F1 Score |
|---------------|----------:|-------:|---------:|
| **Micro Avg** | 87.2 | 87.9 | 87.6 |
| **Macro Avg** | 76.6 | 73.1 | 73.8 |
| GPE           | 93.2 | 93.2 | 93.2 |
| entity        | 50.0 | 55.2 | 52.5 |
| event         | 72.0 | 81.8 | 76.6 |
| location      | 81.5 | 78.6 | 80.0 |
| money         | 60.0 | 25.0 | 35.3 |
| organization  | 87.2 | 89.2 | 88.2 |
| person        | 96.5 | 98.4 | 97.4 |
| product       | 75.0 | 58.1 | 65.5 |
| time          | 73.8 | 78.3 | 75.9 |

## Usage
```python
import re

import torch
from transformers import AutoModelForTokenClassification, AutoTokenizer


class NER:
    def __init__(self, model_name='AiLab-IMCS-UL/lv-ner-v1', max_length=1024):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForTokenClassification.from_pretrained(model_name).eval()
        self.id2label = self.model.config.id2label
        self.max_length = max_length

    def predict(self, text):
        pretokenized = list(re.finditer(r'\w+|\S', text))
        if not pretokenized:
            return []

        enc = self.tokenizer([m.group(0) for m in pretokenized], is_split_into_words=True, return_tensors='pt', truncation=True, max_length=self.max_length)
        word_ids = enc.word_ids(0)

        with torch.no_grad():
            preds = self.model(**enc).logits.argmax(-1)[0].tolist()

        offsets = [(m.start(), m.end()) for m in pretokenized]
        ents, cur, prev = [], None, None

        for pred, wid in zip(preds, word_ids):
            if wid is None or wid == prev:
                prev = wid
                continue
            prev = wid

            start, end = offsets[wid]
            raw_label = self.id2label[pred]
            if raw_label == 'O':
                if cur:
                    ents.append(cur)
                    cur = None
                continue

            prefix, label = raw_label.split('-', 1) if '-' in raw_label else ('B', raw_label)
            if prefix == 'B' or not cur or cur['label'] != label:
                if cur:
                    ents.append(cur)
                cur = {'start': start, 'end': end, 'label': label}
            else:
                cur['end'] = end

        if cur:
            ents.append(cur)

        for ent in ents:
            ent['text'] = text[ent['start']:ent['end']]

        return ents

m = NER()
print(m.predict('Jānis Bērziņš strādā Latvijas uzņēmumā SIA Mia.'))
```