File size: 7,937 Bytes

270c7b1

---
language:
- fo
license: cc-by-4.0
library_name: transformers
pipeline_tag: token-classification
tags:
- faroese
- pos-tagging
- morphology
- xlm-roberta
- token-classification
- lrec-coling-2026
base_model: vesteinn/ScandiBERT
model_creator: Setur
---

# BRAGD: Constrained Multi-Label POS Tagging for Faroese

BRAGD is a Faroese POS and morphological tagging model based on ScandiBERT. It predicts a **73-dimensional binary feature vector** for each token, covering word class, subcategory, gender, number, case, article, proper noun status, degree, declension, mood, voice, tense, person, and definiteness.

This Hugging Face repository contains a fine-tuned `XLMRobertaForTokenClassification` checkpoint with **73 output labels**, along with the decoding files `constraint_mask.json` and `tag_mappings.json`. The repository is currently published as a Transformers/XLM-RoBERTa safetensors model under `Setur/BRAGD`.

## Model Details

- **Model name:** BRAGD
- **Repository:** `Setur/BRAGD`
- **Architecture:** `XLMRobertaForTokenClassification`
- **Base model:** `vesteinn/ScandiBERT`
- **Task:** Faroese POS + morphological tagging
- **Output format:** 73 binary features per token, decoded into BRAGD tags

## Performance

In the accompanying paper, the constrained multi-label BRAGD model achieves:

- **97.5% composite tag accuracy** on the **Sosialurin-BRAGD** corpus (10-fold cross-validation)
- **96.2% composite tag accuracy** on **OOD-BRAGD** out-of-domain data

These numbers describe the evaluated research setup reported in the paper, not this release model trained on the combined data.

## Training Data

The model is based on the BRAGD annotation scheme for Faroese.

### Sosialurin-BRAGD
- **6,099 sentences**
- about **123k tokens**
- **651 unique tags**
- each tag decomposed into **73 binary features**

### OOD-BRAGD
- **500 sentences**
- mixed-genre out-of-domain Faroese evaluation data

The release model in this repository was trained on **both** datasets.

## Label Structure

The 73 output dimensions are organized as follows:

- **0–14:** Word class
- **15–29:** Subcategory
- **30–33:** Gender
- **34–36:** Number
- **37–41:** Case
- **42–43:** Article
- **44–45:** Proper noun
- **46–50:** Degree
- **51–53:** Declension
- **54–60:** Mood
- **61–63:** Voice
- **64–66:** Tense
- **67–70:** Person
- **71–72:** Definiteness

## Using the Model

This model predicts **feature vectors**, not directly formatted BRAGD tags. To get the final BRAGD tag and readable features, you should:

1. run the model,
2. select the most likely word class,
3. activate only the valid feature groups for that word class using `constraint_mask.json`,
4. map the resulting feature vector back to a BRAGD tag using `tag_mappings.json`.

### Install requirements

```bash
pip install numpy torch "transformers==4.57.1" sentencepiece huggingface_hub
```

### Python example

```python
import json
import numpy as np
import torch
from huggingface_hub import hf_hub_download
from transformers import XLMRobertaTokenizerFast, XLMRobertaForTokenClassification

model_name = "Setur/BRAGD"

tokenizer = XLMRobertaTokenizerFast.from_pretrained(model_name)
model = XLMRobertaForTokenClassification.from_pretrained(model_name)
model.eval()

# Download decoding assets
constraint_mask_path = hf_hub_download(model_name, "constraint_mask.json")
tag_mappings_path = hf_hub_download(model_name, "tag_mappings.json")

with open(constraint_mask_path, "r", encoding="utf-8") as f:
    raw_mask = json.load(f)
constraint_mask = {int(k): [tuple(x) for x in v] for k, v in raw_mask.items()}

with open(tag_mappings_path, "r", encoding="utf-8") as f:
    raw_map = json.load(f)
features_to_tag = {tuple(map(int, k.split(","))): v for k, v in raw_map.items()}

WORD_CLASS_NAMES = {
    0: "Noun",
    1: "Adjective",
    2: "Pronoun",
    3: "Number",
    4: "Verb",
    5: "Participle",
    6: "Adverb",
    7: "Conjunction",
    8: "Foreign",
    9: "Unanalyzed",
    10: "Abbreviation",
    11: "Web",
    12: "Punctuation",
    13: "Symbol",
    14: "Article",
}

INTERVAL_NAMES = {
    (15, 29): "subcategory",
    (30, 33): "gender",
    (34, 36): "number",
    (37, 41): "case",
    (42, 43): "article",
    (44, 45): "proper_noun",
    (46, 50): "degree",
    (51, 53): "declension",
    (54, 60): "mood",
    (61, 63): "voice",
    (64, 66): "tense",
    (67, 70): "person",
    (71, 72): "definiteness",
}

FEATURE_COLUMNS = [
    "S", "A", "P", "N", "V", "L", "D", "C", "F", "X", "T", "W", "K", "M", "R",
    "D", "B", "E", "I", "P", "Q", "N", "G", "R", "X", "S", "C", "O", "T", "s",
    "M", "F", "N", "g",
    "S", "P", "n",
    "N", "A", "D", "G", "c",
    "A", "a",
    "P", "r",
    "P", "C", "S", "A", "d",
    "S", "W", "e",
    "I", "M", "N", "S", "P", "E", "U",
    "A", "M", "v",
    "P", "A", "t",
    "1", "2", "3", "p",
    "D", "I",
]

def decode_token(logits):
    pred = np.zeros(logits.shape[0], dtype=int)

    # predict word class
    wc = int(np.argmax(logits[:15]))
    pred[wc] = 1

    # predict only valid feature groups for this word class
    for start, end in constraint_mask.get(wc, []):
        group = logits[start:end+1]
        pred[start + int(np.argmax(group))] = 1

    tag = features_to_tag.get(tuple(pred.tolist()), None)

    features = {"word_class": WORD_CLASS_NAMES.get(wc, str(wc))}
    for (start, end), name in INTERVAL_NAMES.items():
        group = pred[start:end+1]
        active = np.where(group == 1)[0]
        if len(active) == 1:
            features[name] = FEATURE_COLUMNS[start + active[0]]

    return tag, features

text = "Hetta er eitt føroyskt dømi"
words = text.split()

enc = tokenizer(
    [words],
    is_split_into_words=True,
    return_tensors="pt",
    padding=True,
    truncation=True,
)

with torch.no_grad():
    logits = model(**enc).logits[0]

word_ids = enc.word_ids(batch_index=0)
seen = set()

for i, word_id in enumerate(word_ids):
    if word_id is None or word_id in seen:
        continue
    seen.add(word_id)

    tag, features = decode_token(logits[i].cpu().numpy())
    print(f"{words[word_id]:15s} {str(tag):10s} {features}")
```

### Example output

```text
Hetta           PDNpSN     {'word_class': 'Pronoun', 'subcategory': 'D', 'gender': 'N', 'number': 'S', 'case': 'N', 'person': 'p'}
er              VNAPS3     {'word_class': 'Verb', 'number': 'S', 'mood': 'N', 'voice': 'A', 'tense': 'P', 'person': '3'}
eitt            RNSNI      {'word_class': 'Article', 'gender': 'N', 'number': 'S', 'case': 'N', 'definiteness': 'I'}
føroyskt        APSNSN     {'word_class': 'Adjective', 'gender': 'N', 'number': 'S', 'case': 'N', 'degree': 'P', 'declension': 'S'}
dømi            SNSNar     {'word_class': 'Noun', 'gender': 'N', 'number': 'S', 'case': 'N', 'article': 'a', 'proper_noun': 'r'}
```

## Files in this Repository

This model repository contains model and decoding files, including:

- `model.safetensors`
- `config.json`
- tokenizer files
- `constraint_mask.json`
- `tag_mappings.json` :contentReference[oaicite:2]{index=2}

## Further Resources

For full training code, data preparation, and paper-related experiments, see the GitHub repository:

`https://github.com/Maltoknidepilin/BRAGD.git`

## Citation

```bibtex
@inproceedings{simonsen2026bragd,
    title={{BRAGD}: Constrained Multi-Label {POS} Tagging for {F}aroese},
    author={Simonsen, Annika and Scalvini, Barbara and Johannesen, Uni and Debess, Iben Nyholm and Einarsson, Hafsteinn and Sn{\ae}bjarnarson, V{\'e}steinn},
    booktitle={Proceedings of the 2026 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2026)},
    year={2026}
}
```

## Authors

Annika Simonsen, Barbara Scalvini, Uni Johannesen, Iben Nyholm Debess, Hafsteinn Einarsson, and Vésteinn Snæbjarnarson

## License

This repository is marked as **CC BY 4.0** on Hugging Face. :contentReference[oaicite:3]{index=3}