Setur
/

BRAGD

+---
+language:
+- fo
+license: cc-by-4.0
+library_name: transformers
+pipeline_tag: token-classification
+tags:
+- faroese
+- pos-tagging
+- morphology
+- xlm-roberta
+- token-classification
+- lrec-coling-2026
+base_model: vesteinn/ScandiBERT
+model_creator: Setur
+---
+# BRAGD: Constrained Multi-Label POS Tagging for Faroese
+BRAGD is a Faroese POS and morphological tagging model based on ScandiBERT. It predicts a **73-dimensional binary feature vector** for each token, covering word class, subcategory, gender, number, case, article, proper noun status, degree, declension, mood, voice, tense, person, and definiteness.
+This Hugging Face repository contains a fine-tuned `XLMRobertaForTokenClassification` checkpoint with **73 output labels**, along with the decoding files `constraint_mask.json` and `tag_mappings.json`. The repository is currently published as a Transformers/XLM-RoBERTa safetensors model under `Setur/BRAGD`.
+## Model Details
+- **Model name:** BRAGD
+- **Repository:** `Setur/BRAGD`
+- **Architecture:** `XLMRobertaForTokenClassification`
+- **Base model:** `vesteinn/ScandiBERT`
+- **Task:** Faroese POS + morphological tagging
+- **Output format:** 73 binary features per token, decoded into BRAGD tags
+## Performance
+In the accompanying paper, the constrained multi-label BRAGD model achieves:
+- **97.5% composite tag accuracy** on the **Sosialurin-BRAGD** corpus (10-fold cross-validation)
+- **96.2% composite tag accuracy** on **OOD-BRAGD** out-of-domain data
+These numbers describe the evaluated research setup reported in the paper, not this release model trained on the combined data.
+## Training Data
+The model is based on the BRAGD annotation scheme for Faroese.
+### Sosialurin-BRAGD
+- **6,099 sentences**
+- about **123k tokens**
+- **651 unique tags**
+- each tag decomposed into **73 binary features**
+### OOD-BRAGD
+- **500 sentences**
+- mixed-genre out-of-domain Faroese evaluation data
+The release model in this repository was trained on **both** datasets.
+## Label Structure
+The 73 output dimensions are organized as follows:
+- **0–14:** Word class
+- **15–29:** Subcategory
+- **30–33:** Gender
+- **34–36:** Number
+- **37–41:** Case
+- **42–43:** Article
+- **44–45:** Proper noun
+- **46–50:** Degree
+- **51–53:** Declension
+- **54–60:** Mood
+- **61–63:** Voice
+- **64–66:** Tense
+- **67–70:** Person
+- **71–72:** Definiteness
+## Using the Model
+This model predicts **feature vectors**, not directly formatted BRAGD tags. To get the final BRAGD tag and readable features, you should:
+1. run the model,
+2. select the most likely word class,
+3. activate only the valid feature groups for that word class using `constraint_mask.json`,
+4. map the resulting feature vector back to a BRAGD tag using `tag_mappings.json`.
+### Install requirements
+```bash
+pip install numpy torch "transformers==4.57.1" sentencepiece huggingface_hub
+```
+### Python example
+```python
+import json
+import numpy as np
+import torch
+from huggingface_hub import hf_hub_download
+from transformers import XLMRobertaTokenizerFast, XLMRobertaForTokenClassification
+model_name = "Setur/BRAGD"
+tokenizer = XLMRobertaTokenizerFast.from_pretrained(model_name)
+model = XLMRobertaForTokenClassification.from_pretrained(model_name)
+model.eval()
+# Download decoding assets
+constraint_mask_path = hf_hub_download(model_name, "constraint_mask.json")
+tag_mappings_path = hf_hub_download(model_name, "tag_mappings.json")
+with open(constraint_mask_path, "r", encoding="utf-8") as f:
+    raw_mask = json.load(f)
+constraint_mask = {int(k): [tuple(x) for x in v] for k, v in raw_mask.items()}
+with open(tag_mappings_path, "r", encoding="utf-8") as f:
+    raw_map = json.load(f)
+features_to_tag = {tuple(map(int, k.split(","))): v for k, v in raw_map.items()}
+WORD_CLASS_NAMES = {
+    0: "Noun",
+    1: "Adjective",
+    2: "Pronoun",
+    3: "Number",
+    4: "Verb",
+    5: "Participle",
+    6: "Adverb",
+    7: "Conjunction",
+    8: "Foreign",
+    9: "Unanalyzed",
+    10: "Abbreviation",
+    11: "Web",
+    12: "Punctuation",
+    13: "Symbol",
+    14: "Article",
+}
+INTERVAL_NAMES = {
+    (15, 29): "subcategory",
+    (30, 33): "gender",
+    (34, 36): "number",
+    (37, 41): "case",
+    (42, 43): "article",
+    (44, 45): "proper_noun",
+    (46, 50): "degree",
+    (51, 53): "declension",
+    (54, 60): "mood",
+    (61, 63): "voice",
+    (64, 66): "tense",
+    (67, 70): "person",
+    (71, 72): "definiteness",
+}
+FEATURE_COLUMNS = [
+    "S", "A", "P", "N", "V", "L", "D", "C", "F", "X", "T", "W", "K", "M", "R",
+    "D", "B", "E", "I", "P", "Q", "N", "G", "R", "X", "S", "C", "O", "T", "s",
+    "M", "F", "N", "g",
+    "S", "P", "n",
+    "N", "A", "D", "G", "c",
+    "A", "a",
+    "P", "r",
+    "P", "C", "S", "A", "d",
+    "S", "W", "e",
+    "I", "M", "N", "S", "P", "E", "U",
+    "A", "M", "v",
+    "P", "A", "t",
+    "1", "2", "3", "p",
+    "D", "I",
+]
+def decode_token(logits):
+    pred = np.zeros(logits.shape[0], dtype=int)
+    # predict word class
+    wc = int(np.argmax(logits[:15]))
+    pred[wc] = 1
+    # predict only valid feature groups for this word class
+    for start, end in constraint_mask.get(wc, []):
+        group = logits[start:end+1]
+        pred[start + int(np.argmax(group))] = 1
+    tag = features_to_tag.get(tuple(pred.tolist()), None)
+    features = {"word_class": WORD_CLASS_NAMES.get(wc, str(wc))}
+    for (start, end), name in INTERVAL_NAMES.items():
+        group = pred[start:end+1]
+        active = np.where(group == 1)[0]
+        if len(active) == 1:
+            features[name] = FEATURE_COLUMNS[start + active[0]]
+    return tag, features
+text = "Hetta er eitt føroyskt dømi"
+words = text.split()
+enc = tokenizer(
+    [words],
+    is_split_into_words=True,
+    return_tensors="pt",
+    padding=True,
+    truncation=True,
+)
+with torch.no_grad():
+    logits = model(**enc).logits[0]
+word_ids = enc.word_ids(batch_index=0)
+seen = set()
+for i, word_id in enumerate(word_ids):
+    if word_id is None or word_id in seen:
+        continue
+    seen.add(word_id)
+    tag, features = decode_token(logits[i].cpu().numpy())
+    print(f"{words[word_id]:15s} {str(tag):10s} {features}")
+```
+### Example output
+```text
+Hetta           PDNpSN     {'word_class': 'Pronoun', 'subcategory': 'D', 'gender': 'N', 'number': 'S', 'case': 'N', 'person': 'p'}
+er              VNAPS3     {'word_class': 'Verb', 'number': 'S', 'mood': 'N', 'voice': 'A', 'tense': 'P', 'person': '3'}
+eitt            RNSNI      {'word_class': 'Article', 'gender': 'N', 'number': 'S', 'case': 'N', 'definiteness': 'I'}
+føroyskt        APSNSN     {'word_class': 'Adjective', 'gender': 'N', 'number': 'S', 'case': 'N', 'degree': 'P', 'declension': 'S'}
+dømi            SNSNar     {'word_class': 'Noun', 'gender': 'N', 'number': 'S', 'case': 'N', 'article': 'a', 'proper_noun': 'r'}
+```
+## Files in this Repository
+This model repository contains model and decoding files, including:
+- `model.safetensors`
+- `config.json`
+- tokenizer files
+- `constraint_mask.json`
+- `tag_mappings.json` :contentReference[oaicite:2]{index=2}
+## Further Resources
+For full training code, data preparation, and paper-related experiments, see the GitHub repository:
+`https://github.com/Maltoknidepilin/BRAGD.git`
+## Citation
+```bibtex
+@inproceedings{simonsen2026bragd,
+    title={{BRAGD}: Constrained Multi-Label {POS} Tagging for {F}aroese},
+    author={Simonsen, Annika and Scalvini, Barbara and Johannesen, Uni and Debess, Iben Nyholm and Einarsson, Hafsteinn and Sn{\ae}bjarnarson, V{\'e}steinn},
+    booktitle={Proceedings of the 2026 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2026)},
+    year={2026}
+}
+```
+## Authors
+Annika Simonsen, Barbara Scalvini, Uni Johannesen, Iben Nyholm Debess, Hafsteinn Einarsson, and Vésteinn Snæbjarnarson
+## License
+This repository is marked as **CC BY 4.0** on Hugging Face. :contentReference[oaicite:3]{index=3}