| --- |
| language: |
| - fo |
| license: cc-by-4.0 |
| library_name: transformers |
| pipeline_tag: token-classification |
| tags: |
| - faroese |
| - pos-tagging |
| - morphology |
| - xlm-roberta |
| - token-classification |
| - lrec-coling-2026 |
| base_model: vesteinn/ScandiBERT |
| model_creator: Setur |
| --- |
| |
| # BRAGD: Constrained Multi-Label POS Tagging for Faroese |
|
|
| BRAGD is a Faroese POS and morphological tagging model based on ScandiBERT. It predicts a **73-dimensional binary feature vector** for each token, covering word class, subcategory, gender, number, case, article, proper noun status, degree, declension, mood, voice, tense, person, and definiteness. |
|
|
| This Hugging Face repository contains a fine-tuned `XLMRobertaForTokenClassification` checkpoint with **73 output labels**, along with the decoding files `constraint_mask.json` and `tag_mappings.json`. The repository is currently published as a Transformers/XLM-RoBERTa safetensors model under `Setur/BRAGD`. |
|
|
| ## Model Details |
|
|
| - **Model name:** BRAGD |
| - **Repository:** `Setur/BRAGD` |
| - **Architecture:** `XLMRobertaForTokenClassification` |
| - **Base model:** `vesteinn/ScandiBERT` |
| - **Task:** Faroese POS + morphological tagging |
| - **Output format:** 73 binary features per token, decoded into BRAGD tags |
|
|
| ## Performance |
|
|
| In the accompanying paper, the constrained multi-label BRAGD model achieves: |
|
|
| - **97.5% composite tag accuracy** on the **Sosialurin-BRAGD** corpus (10-fold cross-validation) |
| - **96.2% composite tag accuracy** on **OOD-BRAGD** out-of-domain data |
|
|
| These numbers describe the evaluated research setup reported in the paper, not this release model trained on the combined data. |
|
|
| ## Training Data |
|
|
| The model is based on the BRAGD annotation scheme for Faroese. |
|
|
| ### Sosialurin-BRAGD |
| - **6,099 sentences** |
| - about **123k tokens** |
| - **651 unique tags** |
| - each tag decomposed into **73 binary features** |
|
|
| ### OOD-BRAGD |
| - **500 sentences** |
| - mixed-genre out-of-domain Faroese evaluation data |
|
|
| The release model in this repository was trained on **both** datasets. |
|
|
| ## Label Structure |
|
|
| The 73 output dimensions are organized as follows: |
|
|
| - **0–14:** Word class |
| - **15–29:** Subcategory |
| - **30–33:** Gender |
| - **34–36:** Number |
| - **37–41:** Case |
| - **42–43:** Article |
| - **44–45:** Proper noun |
| - **46–50:** Degree |
| - **51–53:** Declension |
| - **54–60:** Mood |
| - **61–63:** Voice |
| - **64–66:** Tense |
| - **67–70:** Person |
| - **71–72:** Definiteness |
|
|
| ## Using the Model |
|
|
| This model predicts **feature vectors**, not directly formatted BRAGD tags. To get the final BRAGD tag and readable features, you should: |
|
|
| 1. run the model, |
| 2. select the most likely word class, |
| 3. activate only the valid feature groups for that word class using `constraint_mask.json`, |
| 4. map the resulting feature vector back to a BRAGD tag using `tag_mappings.json`. |
|
|
| ### Install requirements |
|
|
| ```bash |
| pip install numpy torch "transformers==4.57.1" sentencepiece huggingface_hub |
| ``` |
|
|
| ### Python example |
|
|
| ```python |
| import json |
| import numpy as np |
| import torch |
| from huggingface_hub import hf_hub_download |
| from transformers import XLMRobertaTokenizerFast, XLMRobertaForTokenClassification |
| |
| model_name = "Setur/BRAGD" |
| |
| tokenizer = XLMRobertaTokenizerFast.from_pretrained(model_name) |
| model = XLMRobertaForTokenClassification.from_pretrained(model_name) |
| model.eval() |
| |
| # Download decoding assets |
| constraint_mask_path = hf_hub_download(model_name, "constraint_mask.json") |
| tag_mappings_path = hf_hub_download(model_name, "tag_mappings.json") |
| |
| with open(constraint_mask_path, "r", encoding="utf-8") as f: |
| raw_mask = json.load(f) |
| constraint_mask = {int(k): [tuple(x) for x in v] for k, v in raw_mask.items()} |
| |
| with open(tag_mappings_path, "r", encoding="utf-8") as f: |
| raw_map = json.load(f) |
| features_to_tag = {tuple(map(int, k.split(","))): v for k, v in raw_map.items()} |
| |
| WORD_CLASS_NAMES = { |
| 0: "Noun", |
| 1: "Adjective", |
| 2: "Pronoun", |
| 3: "Number", |
| 4: "Verb", |
| 5: "Participle", |
| 6: "Adverb", |
| 7: "Conjunction", |
| 8: "Foreign", |
| 9: "Unanalyzed", |
| 10: "Abbreviation", |
| 11: "Web", |
| 12: "Punctuation", |
| 13: "Symbol", |
| 14: "Article", |
| } |
| |
| INTERVAL_NAMES = { |
| (15, 29): "subcategory", |
| (30, 33): "gender", |
| (34, 36): "number", |
| (37, 41): "case", |
| (42, 43): "article", |
| (44, 45): "proper_noun", |
| (46, 50): "degree", |
| (51, 53): "declension", |
| (54, 60): "mood", |
| (61, 63): "voice", |
| (64, 66): "tense", |
| (67, 70): "person", |
| (71, 72): "definiteness", |
| } |
| |
| FEATURE_COLUMNS = [ |
| "S", "A", "P", "N", "V", "L", "D", "C", "F", "X", "T", "W", "K", "M", "R", |
| "D", "B", "E", "I", "P", "Q", "N", "G", "R", "X", "S", "C", "O", "T", "s", |
| "M", "F", "N", "g", |
| "S", "P", "n", |
| "N", "A", "D", "G", "c", |
| "A", "a", |
| "P", "r", |
| "P", "C", "S", "A", "d", |
| "S", "W", "e", |
| "I", "M", "N", "S", "P", "E", "U", |
| "A", "M", "v", |
| "P", "A", "t", |
| "1", "2", "3", "p", |
| "D", "I", |
| ] |
| |
| def decode_token(logits): |
| pred = np.zeros(logits.shape[0], dtype=int) |
| |
| # predict word class |
| wc = int(np.argmax(logits[:15])) |
| pred[wc] = 1 |
| |
| # predict only valid feature groups for this word class |
| for start, end in constraint_mask.get(wc, []): |
| group = logits[start:end+1] |
| pred[start + int(np.argmax(group))] = 1 |
| |
| tag = features_to_tag.get(tuple(pred.tolist()), None) |
| |
| features = {"word_class": WORD_CLASS_NAMES.get(wc, str(wc))} |
| for (start, end), name in INTERVAL_NAMES.items(): |
| group = pred[start:end+1] |
| active = np.where(group == 1)[0] |
| if len(active) == 1: |
| features[name] = FEATURE_COLUMNS[start + active[0]] |
| |
| return tag, features |
| |
| text = "Hetta er eitt føroyskt dømi" |
| words = text.split() |
| |
| enc = tokenizer( |
| [words], |
| is_split_into_words=True, |
| return_tensors="pt", |
| padding=True, |
| truncation=True, |
| ) |
| |
| with torch.no_grad(): |
| logits = model(**enc).logits[0] |
| |
| word_ids = enc.word_ids(batch_index=0) |
| seen = set() |
| |
| for i, word_id in enumerate(word_ids): |
| if word_id is None or word_id in seen: |
| continue |
| seen.add(word_id) |
| |
| tag, features = decode_token(logits[i].cpu().numpy()) |
| print(f"{words[word_id]:15s} {str(tag):10s} {features}") |
| ``` |
|
|
| ### Example output |
|
|
| ```text |
| Hetta PDNpSN {'word_class': 'Pronoun', 'subcategory': 'D', 'gender': 'N', 'number': 'S', 'case': 'N', 'person': 'p'} |
| er VNAPS3 {'word_class': 'Verb', 'number': 'S', 'mood': 'N', 'voice': 'A', 'tense': 'P', 'person': '3'} |
| eitt RNSNI {'word_class': 'Article', 'gender': 'N', 'number': 'S', 'case': 'N', 'definiteness': 'I'} |
| føroyskt APSNSN {'word_class': 'Adjective', 'gender': 'N', 'number': 'S', 'case': 'N', 'degree': 'P', 'declension': 'S'} |
| dømi SNSNar {'word_class': 'Noun', 'gender': 'N', 'number': 'S', 'case': 'N', 'article': 'a', 'proper_noun': 'r'} |
| ``` |
|
|
| ## Files in this Repository |
|
|
| This model repository contains model and decoding files, including: |
|
|
| - `model.safetensors` |
| - `config.json` |
| - tokenizer files |
| - `constraint_mask.json` |
| - `tag_mappings.json` :contentReference[oaicite:2]{index=2} |
|
|
| ## Further Resources |
|
|
| For full training code, data preparation, and paper-related experiments, see the GitHub repository: |
|
|
| `https://github.com/Maltoknidepilin/BRAGD.git` |
|
|
| ## Citation |
|
|
| ```bibtex |
| @inproceedings{simonsen2026bragd, |
| title={{BRAGD}: Constrained Multi-Label {POS} Tagging for {F}aroese}, |
| author={Simonsen, Annika and Scalvini, Barbara and Johannesen, Uni and Debess, Iben Nyholm and Einarsson, Hafsteinn and Sn{\ae}bjarnarson, V{\'e}steinn}, |
| booktitle={Proceedings of the 2026 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2026)}, |
| year={2026} |
| } |
| ``` |
|
|
| ## Authors |
|
|
| Annika Simonsen, Barbara Scalvini, Uni Johannesen, Iben Nyholm Debess, Hafsteinn Einarsson, and Vésteinn Snæbjarnarson |
|
|
| ## License |
|
|
| This repository is marked as **CC BY 4.0** on Hugging Face. :contentReference[oaicite:3]{index=3} |