Est-RoBERTa for Quantifier Extraction (Estonian)

This model is a fine-tuned version of EMBEDDIA/est-roberta on a custom dataset for extracting quantifier constructions (e.g., "kari koeri", "hunnik raamatuid") in Estonian text.

It performs token classification using the BIO labeling scheme with the following labels:

  • O: Outside
  • B-QUANT: Beginning of a quantifier expression
  • I-QUANT: Inside a quantifier expression

๐Ÿ“Š Training and Evaluation

Epochs: 12

Batch size: 8

Test set: 159 positive cases, 1000 negative cases

Precision: 87.05%

Recall: 94.53%

F1-score: 90.64%

Accuracy: 99.88%

๐Ÿ›๏ธ Funding This work was supported by the Estonian Research Council grant (PRG 1978). Uurimistรถรถd on finantseerinud Eesti Teadusagentuur (PRG 1978).

๐Ÿ” Example Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

model = AutoModelForTokenClassification.from_pretrained("ahtokiil/est-roberta-quant-extraction_EKI")
tokenizer = AutoTokenizer.from_pretrained("ahtokiil/est-roberta-quant-extraction_EKI")

sentence = "Arsti juures tuli tรผkk aega oodata."
inputs = tokenizer(sentence, return_tensors="pt")
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=2)

tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
labels = [model.config.id2label[p.item()] for p in predictions[0]]

print(list(zip(tokens, labels)))
Downloads last month
1
Safetensors
Model size
0.1B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support