File size: 7,937 Bytes
270c7b1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 | ---
language:
- fo
license: cc-by-4.0
library_name: transformers
pipeline_tag: token-classification
tags:
- faroese
- pos-tagging
- morphology
- xlm-roberta
- token-classification
- lrec-coling-2026
base_model: vesteinn/ScandiBERT
model_creator: Setur
---
# BRAGD: Constrained Multi-Label POS Tagging for Faroese
BRAGD is a Faroese POS and morphological tagging model based on ScandiBERT. It predicts a **73-dimensional binary feature vector** for each token, covering word class, subcategory, gender, number, case, article, proper noun status, degree, declension, mood, voice, tense, person, and definiteness.
This Hugging Face repository contains a fine-tuned `XLMRobertaForTokenClassification` checkpoint with **73 output labels**, along with the decoding files `constraint_mask.json` and `tag_mappings.json`. The repository is currently published as a Transformers/XLM-RoBERTa safetensors model under `Setur/BRAGD`.
## Model Details
- **Model name:** BRAGD
- **Repository:** `Setur/BRAGD`
- **Architecture:** `XLMRobertaForTokenClassification`
- **Base model:** `vesteinn/ScandiBERT`
- **Task:** Faroese POS + morphological tagging
- **Output format:** 73 binary features per token, decoded into BRAGD tags
## Performance
In the accompanying paper, the constrained multi-label BRAGD model achieves:
- **97.5% composite tag accuracy** on the **Sosialurin-BRAGD** corpus (10-fold cross-validation)
- **96.2% composite tag accuracy** on **OOD-BRAGD** out-of-domain data
These numbers describe the evaluated research setup reported in the paper, not this release model trained on the combined data.
## Training Data
The model is based on the BRAGD annotation scheme for Faroese.
### Sosialurin-BRAGD
- **6,099 sentences**
- about **123k tokens**
- **651 unique tags**
- each tag decomposed into **73 binary features**
### OOD-BRAGD
- **500 sentences**
- mixed-genre out-of-domain Faroese evaluation data
The release model in this repository was trained on **both** datasets.
## Label Structure
The 73 output dimensions are organized as follows:
- **0–14:** Word class
- **15–29:** Subcategory
- **30–33:** Gender
- **34–36:** Number
- **37–41:** Case
- **42–43:** Article
- **44–45:** Proper noun
- **46–50:** Degree
- **51–53:** Declension
- **54–60:** Mood
- **61–63:** Voice
- **64–66:** Tense
- **67–70:** Person
- **71–72:** Definiteness
## Using the Model
This model predicts **feature vectors**, not directly formatted BRAGD tags. To get the final BRAGD tag and readable features, you should:
1. run the model,
2. select the most likely word class,
3. activate only the valid feature groups for that word class using `constraint_mask.json`,
4. map the resulting feature vector back to a BRAGD tag using `tag_mappings.json`.
### Install requirements
```bash
pip install numpy torch "transformers==4.57.1" sentencepiece huggingface_hub
```
### Python example
```python
import json
import numpy as np
import torch
from huggingface_hub import hf_hub_download
from transformers import XLMRobertaTokenizerFast, XLMRobertaForTokenClassification
model_name = "Setur/BRAGD"
tokenizer = XLMRobertaTokenizerFast.from_pretrained(model_name)
model = XLMRobertaForTokenClassification.from_pretrained(model_name)
model.eval()
# Download decoding assets
constraint_mask_path = hf_hub_download(model_name, "constraint_mask.json")
tag_mappings_path = hf_hub_download(model_name, "tag_mappings.json")
with open(constraint_mask_path, "r", encoding="utf-8") as f:
raw_mask = json.load(f)
constraint_mask = {int(k): [tuple(x) for x in v] for k, v in raw_mask.items()}
with open(tag_mappings_path, "r", encoding="utf-8") as f:
raw_map = json.load(f)
features_to_tag = {tuple(map(int, k.split(","))): v for k, v in raw_map.items()}
WORD_CLASS_NAMES = {
0: "Noun",
1: "Adjective",
2: "Pronoun",
3: "Number",
4: "Verb",
5: "Participle",
6: "Adverb",
7: "Conjunction",
8: "Foreign",
9: "Unanalyzed",
10: "Abbreviation",
11: "Web",
12: "Punctuation",
13: "Symbol",
14: "Article",
}
INTERVAL_NAMES = {
(15, 29): "subcategory",
(30, 33): "gender",
(34, 36): "number",
(37, 41): "case",
(42, 43): "article",
(44, 45): "proper_noun",
(46, 50): "degree",
(51, 53): "declension",
(54, 60): "mood",
(61, 63): "voice",
(64, 66): "tense",
(67, 70): "person",
(71, 72): "definiteness",
}
FEATURE_COLUMNS = [
"S", "A", "P", "N", "V", "L", "D", "C", "F", "X", "T", "W", "K", "M", "R",
"D", "B", "E", "I", "P", "Q", "N", "G", "R", "X", "S", "C", "O", "T", "s",
"M", "F", "N", "g",
"S", "P", "n",
"N", "A", "D", "G", "c",
"A", "a",
"P", "r",
"P", "C", "S", "A", "d",
"S", "W", "e",
"I", "M", "N", "S", "P", "E", "U",
"A", "M", "v",
"P", "A", "t",
"1", "2", "3", "p",
"D", "I",
]
def decode_token(logits):
pred = np.zeros(logits.shape[0], dtype=int)
# predict word class
wc = int(np.argmax(logits[:15]))
pred[wc] = 1
# predict only valid feature groups for this word class
for start, end in constraint_mask.get(wc, []):
group = logits[start:end+1]
pred[start + int(np.argmax(group))] = 1
tag = features_to_tag.get(tuple(pred.tolist()), None)
features = {"word_class": WORD_CLASS_NAMES.get(wc, str(wc))}
for (start, end), name in INTERVAL_NAMES.items():
group = pred[start:end+1]
active = np.where(group == 1)[0]
if len(active) == 1:
features[name] = FEATURE_COLUMNS[start + active[0]]
return tag, features
text = "Hetta er eitt føroyskt dømi"
words = text.split()
enc = tokenizer(
[words],
is_split_into_words=True,
return_tensors="pt",
padding=True,
truncation=True,
)
with torch.no_grad():
logits = model(**enc).logits[0]
word_ids = enc.word_ids(batch_index=0)
seen = set()
for i, word_id in enumerate(word_ids):
if word_id is None or word_id in seen:
continue
seen.add(word_id)
tag, features = decode_token(logits[i].cpu().numpy())
print(f"{words[word_id]:15s} {str(tag):10s} {features}")
```
### Example output
```text
Hetta PDNpSN {'word_class': 'Pronoun', 'subcategory': 'D', 'gender': 'N', 'number': 'S', 'case': 'N', 'person': 'p'}
er VNAPS3 {'word_class': 'Verb', 'number': 'S', 'mood': 'N', 'voice': 'A', 'tense': 'P', 'person': '3'}
eitt RNSNI {'word_class': 'Article', 'gender': 'N', 'number': 'S', 'case': 'N', 'definiteness': 'I'}
føroyskt APSNSN {'word_class': 'Adjective', 'gender': 'N', 'number': 'S', 'case': 'N', 'degree': 'P', 'declension': 'S'}
dømi SNSNar {'word_class': 'Noun', 'gender': 'N', 'number': 'S', 'case': 'N', 'article': 'a', 'proper_noun': 'r'}
```
## Files in this Repository
This model repository contains model and decoding files, including:
- `model.safetensors`
- `config.json`
- tokenizer files
- `constraint_mask.json`
- `tag_mappings.json` :contentReference[oaicite:2]{index=2}
## Further Resources
For full training code, data preparation, and paper-related experiments, see the GitHub repository:
`https://github.com/Maltoknidepilin/BRAGD.git`
## Citation
```bibtex
@inproceedings{simonsen2026bragd,
title={{BRAGD}: Constrained Multi-Label {POS} Tagging for {F}aroese},
author={Simonsen, Annika and Scalvini, Barbara and Johannesen, Uni and Debess, Iben Nyholm and Einarsson, Hafsteinn and Sn{\ae}bjarnarson, V{\'e}steinn},
booktitle={Proceedings of the 2026 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2026)},
year={2026}
}
```
## Authors
Annika Simonsen, Barbara Scalvini, Uni Johannesen, Iben Nyholm Debess, Hafsteinn Einarsson, and Vésteinn Snæbjarnarson
## License
This repository is marked as **CC BY 4.0** on Hugging Face. :contentReference[oaicite:3]{index=3} |