File size: 1,685 Bytes
3faa324 be2d054 3815ee6 3faa324 527d720 3faa324 73bf51c 3faa324 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 | ---
language: et
tags:
- estonian
- token-classification
- quantifier-extraction
- roberta
- transformers
license: mit
datasets:
- custom
metrics:
- precision
- recall
- f1
- accuracy
---
# Est-RoBERTa for Quantifier Extraction (Estonian)
This model is a fine-tuned version of [`EMBEDDIA/est-roberta`](https://huggingface.co/EMBEDDIA/est-roberta) on a custom dataset for extracting **quantifier constructions** (e.g., "kari koeri", "hunnik raamatuid") in Estonian text.
It performs **token classification** using the BIO labeling scheme with the following labels:
- `O`: Outside
- `B-QUANT`: Beginning of a quantifier expression
- `I-QUANT`: Inside a quantifier expression
📊 Training and Evaluation
Epochs: 12
Batch size: 8
Test set: 159 positive cases, 1000 negative cases
Precision: 87.05%
Recall: 94.53%
F1-score: 90.64%
Accuracy: 99.88%
🏛️ Funding
This work was supported by the Estonian Research Council grant (PRG 1978).
Uurimistööd on finantseerinud Eesti Teadusagentuur (PRG 1978).
## 🔍 Example Usage
```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
model = AutoModelForTokenClassification.from_pretrained("ahtokiil/est-roberta-quant-extraction_EKI")
tokenizer = AutoTokenizer.from_pretrained("ahtokiil/est-roberta-quant-extraction_EKI")
sentence = "Arsti juures tuli tükk aega oodata."
inputs = tokenizer(sentence, return_tensors="pt")
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=2)
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
labels = [model.config.id2label[p.item()] for p in predictions[0]]
print(list(zip(tokens, labels)))
|