|
|
--- |
|
|
language: et |
|
|
tags: |
|
|
- estonian |
|
|
- token-classification |
|
|
- quantifier-extraction |
|
|
- roberta |
|
|
- transformers |
|
|
license: mit |
|
|
datasets: |
|
|
- custom |
|
|
metrics: |
|
|
- precision |
|
|
- recall |
|
|
- f1 |
|
|
- accuracy |
|
|
--- |
|
|
|
|
|
# Est-RoBERTa for Quantifier Extraction (Estonian) |
|
|
|
|
|
This model is a fine-tuned version of [`EMBEDDIA/est-roberta`](https://huggingface.co/EMBEDDIA/est-roberta) on a custom dataset for extracting **quantifier constructions** (e.g., "kari koeri", "hunnik raamatuid") in Estonian text. |
|
|
|
|
|
It performs **token classification** using the BIO labeling scheme with the following labels: |
|
|
|
|
|
- `O`: Outside |
|
|
- `B-QUANT`: Beginning of a quantifier expression |
|
|
- `I-QUANT`: Inside a quantifier expression |
|
|
|
|
|
📊 Training and Evaluation |
|
|
|
|
|
Epochs: 12 |
|
|
|
|
|
Batch size: 8 |
|
|
|
|
|
Test set: 159 positive cases, 1000 negative cases |
|
|
|
|
|
Precision: 87.05% |
|
|
|
|
|
Recall: 94.53% |
|
|
|
|
|
F1-score: 90.64% |
|
|
|
|
|
Accuracy: 99.88% |
|
|
|
|
|
🏛️ Funding |
|
|
This work was supported by the Estonian Research Council grant (PRG 1978). |
|
|
Uurimistööd on finantseerinud Eesti Teadusagentuur (PRG 1978). |
|
|
|
|
|
## 🔍 Example Usage |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForTokenClassification |
|
|
import torch |
|
|
|
|
|
model = AutoModelForTokenClassification.from_pretrained("ahtokiil/est-roberta-quant-extraction_EKI") |
|
|
tokenizer = AutoTokenizer.from_pretrained("ahtokiil/est-roberta-quant-extraction_EKI") |
|
|
|
|
|
sentence = "Arsti juures tuli tükk aega oodata." |
|
|
inputs = tokenizer(sentence, return_tensors="pt") |
|
|
outputs = model(**inputs) |
|
|
predictions = torch.argmax(outputs.logits, dim=2) |
|
|
|
|
|
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0]) |
|
|
labels = [model.config.id2label[p.item()] for p in predictions[0]] |
|
|
|
|
|
print(list(zip(tokens, labels))) |
|
|
|