--- language: et tags: - estonian - token-classification - quantifier-extraction - roberta - transformers license: mit datasets: - custom metrics: - precision - recall - f1 - accuracy --- # Est-RoBERTa for Quantifier Extraction (Estonian) This model is a fine-tuned version of [`EMBEDDIA/est-roberta`](https://huggingface.co/EMBEDDIA/est-roberta) on a custom dataset for extracting **quantifier constructions** (e.g., "kari koeri", "hunnik raamatuid") in Estonian text. It performs **token classification** using the BIO labeling scheme with the following labels: - `O`: Outside - `B-QUANT`: Beginning of a quantifier expression - `I-QUANT`: Inside a quantifier expression 📊 Training and Evaluation Epochs: 12 Batch size: 8 Test set: 159 positive cases, 1000 negative cases Precision: 87.05% Recall: 94.53% F1-score: 90.64% Accuracy: 99.88% 🏛️ Funding This work was supported by the Estonian Research Council grant (PRG 1978). Uurimistööd on finantseerinud Eesti Teadusagentuur (PRG 1978). ## 🔍 Example Usage ```python from transformers import AutoTokenizer, AutoModelForTokenClassification import torch model = AutoModelForTokenClassification.from_pretrained("ahtokiil/est-roberta-quant-extraction_EKI") tokenizer = AutoTokenizer.from_pretrained("ahtokiil/est-roberta-quant-extraction_EKI") sentence = "Arsti juures tuli tükk aega oodata." inputs = tokenizer(sentence, return_tensors="pt") outputs = model(**inputs) predictions = torch.argmax(outputs.logits, dim=2) tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0]) labels = [model.config.id2label[p.item()] for p in predictions[0]] print(list(zip(tokens, labels)))