--- language: - fo license: cc-by-4.0 library_name: transformers pipeline_tag: token-classification tags: - faroese - pos-tagging - morphology - xlm-roberta - token-classification - lrec-coling-2026 base_model: vesteinn/ScandiBERT model_creator: Setur --- # BRAGD: Constrained Multi-Label POS Tagging for Faroese BRAGD is a Faroese POS and morphological tagging model based on ScandiBERT. It predicts a **73-dimensional binary feature vector** for each token, covering word class, subcategory, gender, number, case, article, proper noun status, degree, declension, mood, voice, tense, person, and definiteness. This Hugging Face repository contains a fine-tuned `XLMRobertaForTokenClassification` checkpoint with **73 output labels**, along with the decoding files `constraint_mask.json` and `tag_mappings.json`. The repository is currently published as a Transformers/XLM-RoBERTa safetensors model under `Setur/BRAGD`. ## Model Details - **Model name:** BRAGD - **Repository:** `Setur/BRAGD` - **Architecture:** `XLMRobertaForTokenClassification` - **Base model:** `vesteinn/ScandiBERT` - **Task:** Faroese POS + morphological tagging - **Output format:** 73 binary features per token, decoded into BRAGD tags ## Performance In the accompanying paper, the constrained multi-label BRAGD model achieves: - **97.5% composite tag accuracy** on the **Sosialurin-BRAGD** corpus (10-fold cross-validation) - **96.2% composite tag accuracy** on **OOD-BRAGD** out-of-domain data These numbers describe the evaluated research setup reported in the paper, not this release model trained on the combined data. ## Training Data The model is based on the BRAGD annotation scheme for Faroese. ### Sosialurin-BRAGD - **6,099 sentences** - about **123k tokens** - **651 unique tags** - each tag decomposed into **73 binary features** ### OOD-BRAGD - **500 sentences** - mixed-genre out-of-domain Faroese evaluation data The release model in this repository was trained on **both** datasets. ## Label Structure The 73 output dimensions are organized as follows: - **0–14:** Word class - **15–29:** Subcategory - **30–33:** Gender - **34–36:** Number - **37–41:** Case - **42–43:** Article - **44–45:** Proper noun - **46–50:** Degree - **51–53:** Declension - **54–60:** Mood - **61–63:** Voice - **64–66:** Tense - **67–70:** Person - **71–72:** Definiteness ## Using the Model This model predicts **feature vectors**, not directly formatted BRAGD tags. To get the final BRAGD tag and readable features, you should: 1. run the model, 2. select the most likely word class, 3. activate only the valid feature groups for that word class using `constraint_mask.json`, 4. map the resulting feature vector back to a BRAGD tag using `tag_mappings.json`. ### Install requirements ```bash pip install numpy torch "transformers==4.57.1" sentencepiece huggingface_hub ``` ### Python example ```python import json import numpy as np import torch from huggingface_hub import hf_hub_download from transformers import XLMRobertaTokenizerFast, XLMRobertaForTokenClassification model_name = "Setur/BRAGD" tokenizer = XLMRobertaTokenizerFast.from_pretrained(model_name) model = XLMRobertaForTokenClassification.from_pretrained(model_name) model.eval() # Download decoding assets constraint_mask_path = hf_hub_download(model_name, "constraint_mask.json") tag_mappings_path = hf_hub_download(model_name, "tag_mappings.json") with open(constraint_mask_path, "r", encoding="utf-8") as f: raw_mask = json.load(f) constraint_mask = {int(k): [tuple(x) for x in v] for k, v in raw_mask.items()} with open(tag_mappings_path, "r", encoding="utf-8") as f: raw_map = json.load(f) features_to_tag = {tuple(map(int, k.split(","))): v for k, v in raw_map.items()} WORD_CLASS_NAMES = { 0: "Noun", 1: "Adjective", 2: "Pronoun", 3: "Number", 4: "Verb", 5: "Participle", 6: "Adverb", 7: "Conjunction", 8: "Foreign", 9: "Unanalyzed", 10: "Abbreviation", 11: "Web", 12: "Punctuation", 13: "Symbol", 14: "Article", } INTERVAL_NAMES = { (15, 29): "subcategory", (30, 33): "gender", (34, 36): "number", (37, 41): "case", (42, 43): "article", (44, 45): "proper_noun", (46, 50): "degree", (51, 53): "declension", (54, 60): "mood", (61, 63): "voice", (64, 66): "tense", (67, 70): "person", (71, 72): "definiteness", } FEATURE_COLUMNS = [ "S", "A", "P", "N", "V", "L", "D", "C", "F", "X", "T", "W", "K", "M", "R", "D", "B", "E", "I", "P", "Q", "N", "G", "R", "X", "S", "C", "O", "T", "s", "M", "F", "N", "g", "S", "P", "n", "N", "A", "D", "G", "c", "A", "a", "P", "r", "P", "C", "S", "A", "d", "S", "W", "e", "I", "M", "N", "S", "P", "E", "U", "A", "M", "v", "P", "A", "t", "1", "2", "3", "p", "D", "I", ] def decode_token(logits): pred = np.zeros(logits.shape[0], dtype=int) # predict word class wc = int(np.argmax(logits[:15])) pred[wc] = 1 # predict only valid feature groups for this word class for start, end in constraint_mask.get(wc, []): group = logits[start:end+1] pred[start + int(np.argmax(group))] = 1 tag = features_to_tag.get(tuple(pred.tolist()), None) features = {"word_class": WORD_CLASS_NAMES.get(wc, str(wc))} for (start, end), name in INTERVAL_NAMES.items(): group = pred[start:end+1] active = np.where(group == 1)[0] if len(active) == 1: features[name] = FEATURE_COLUMNS[start + active[0]] return tag, features text = "Hetta er eitt føroyskt dømi" words = text.split() enc = tokenizer( [words], is_split_into_words=True, return_tensors="pt", padding=True, truncation=True, ) with torch.no_grad(): logits = model(**enc).logits[0] word_ids = enc.word_ids(batch_index=0) seen = set() for i, word_id in enumerate(word_ids): if word_id is None or word_id in seen: continue seen.add(word_id) tag, features = decode_token(logits[i].cpu().numpy()) print(f"{words[word_id]:15s} {str(tag):10s} {features}") ``` ### Example output ```text Hetta PDNpSN {'word_class': 'Pronoun', 'subcategory': 'D', 'gender': 'N', 'number': 'S', 'case': 'N', 'person': 'p'} er VNAPS3 {'word_class': 'Verb', 'number': 'S', 'mood': 'N', 'voice': 'A', 'tense': 'P', 'person': '3'} eitt RNSNI {'word_class': 'Article', 'gender': 'N', 'number': 'S', 'case': 'N', 'definiteness': 'I'} føroyskt APSNSN {'word_class': 'Adjective', 'gender': 'N', 'number': 'S', 'case': 'N', 'degree': 'P', 'declension': 'S'} dømi SNSNar {'word_class': 'Noun', 'gender': 'N', 'number': 'S', 'case': 'N', 'article': 'a', 'proper_noun': 'r'} ``` ## Files in this Repository This model repository contains model and decoding files, including: - `model.safetensors` - `config.json` - tokenizer files - `constraint_mask.json` - `tag_mappings.json` :contentReference[oaicite:2]{index=2} ## Further Resources For full training code, data preparation, and paper-related experiments, see the GitHub repository: `https://github.com/Maltoknidepilin/BRAGD.git` ## Citation ```bibtex @inproceedings{simonsen2026bragd, title={{BRAGD}: Constrained Multi-Label {POS} Tagging for {F}aroese}, author={Simonsen, Annika and Scalvini, Barbara and Johannesen, Uni and Debess, Iben Nyholm and Einarsson, Hafsteinn and Sn{\ae}bjarnarson, V{\'e}steinn}, booktitle={Proceedings of the 2026 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2026)}, year={2026} } ``` ## Authors Annika Simonsen, Barbara Scalvini, Uni Johannesen, Iben Nyholm Debess, Hafsteinn Einarsson, and Vésteinn Snæbjarnarson ## License This repository is marked as **CC BY 4.0** on Hugging Face. :contentReference[oaicite:3]{index=3}