BRAGD / README.md
unijoh's picture
Update README.md
270c7b1 verified
---
language:
- fo
license: cc-by-4.0
library_name: transformers
pipeline_tag: token-classification
tags:
- faroese
- pos-tagging
- morphology
- xlm-roberta
- token-classification
- lrec-coling-2026
base_model: vesteinn/ScandiBERT
model_creator: Setur
---
# BRAGD: Constrained Multi-Label POS Tagging for Faroese
BRAGD is a Faroese POS and morphological tagging model based on ScandiBERT. It predicts a **73-dimensional binary feature vector** for each token, covering word class, subcategory, gender, number, case, article, proper noun status, degree, declension, mood, voice, tense, person, and definiteness.
This Hugging Face repository contains a fine-tuned `XLMRobertaForTokenClassification` checkpoint with **73 output labels**, along with the decoding files `constraint_mask.json` and `tag_mappings.json`. The repository is currently published as a Transformers/XLM-RoBERTa safetensors model under `Setur/BRAGD`.
## Model Details
- **Model name:** BRAGD
- **Repository:** `Setur/BRAGD`
- **Architecture:** `XLMRobertaForTokenClassification`
- **Base model:** `vesteinn/ScandiBERT`
- **Task:** Faroese POS + morphological tagging
- **Output format:** 73 binary features per token, decoded into BRAGD tags
## Performance
In the accompanying paper, the constrained multi-label BRAGD model achieves:
- **97.5% composite tag accuracy** on the **Sosialurin-BRAGD** corpus (10-fold cross-validation)
- **96.2% composite tag accuracy** on **OOD-BRAGD** out-of-domain data
These numbers describe the evaluated research setup reported in the paper, not this release model trained on the combined data.
## Training Data
The model is based on the BRAGD annotation scheme for Faroese.
### Sosialurin-BRAGD
- **6,099 sentences**
- about **123k tokens**
- **651 unique tags**
- each tag decomposed into **73 binary features**
### OOD-BRAGD
- **500 sentences**
- mixed-genre out-of-domain Faroese evaluation data
The release model in this repository was trained on **both** datasets.
## Label Structure
The 73 output dimensions are organized as follows:
- **0–14:** Word class
- **15–29:** Subcategory
- **30–33:** Gender
- **34–36:** Number
- **37–41:** Case
- **42–43:** Article
- **44–45:** Proper noun
- **46–50:** Degree
- **51–53:** Declension
- **54–60:** Mood
- **61–63:** Voice
- **64–66:** Tense
- **67–70:** Person
- **71–72:** Definiteness
## Using the Model
This model predicts **feature vectors**, not directly formatted BRAGD tags. To get the final BRAGD tag and readable features, you should:
1. run the model,
2. select the most likely word class,
3. activate only the valid feature groups for that word class using `constraint_mask.json`,
4. map the resulting feature vector back to a BRAGD tag using `tag_mappings.json`.
### Install requirements
```bash
pip install numpy torch "transformers==4.57.1" sentencepiece huggingface_hub
```
### Python example
```python
import json
import numpy as np
import torch
from huggingface_hub import hf_hub_download
from transformers import XLMRobertaTokenizerFast, XLMRobertaForTokenClassification
model_name = "Setur/BRAGD"
tokenizer = XLMRobertaTokenizerFast.from_pretrained(model_name)
model = XLMRobertaForTokenClassification.from_pretrained(model_name)
model.eval()
# Download decoding assets
constraint_mask_path = hf_hub_download(model_name, "constraint_mask.json")
tag_mappings_path = hf_hub_download(model_name, "tag_mappings.json")
with open(constraint_mask_path, "r", encoding="utf-8") as f:
raw_mask = json.load(f)
constraint_mask = {int(k): [tuple(x) for x in v] for k, v in raw_mask.items()}
with open(tag_mappings_path, "r", encoding="utf-8") as f:
raw_map = json.load(f)
features_to_tag = {tuple(map(int, k.split(","))): v for k, v in raw_map.items()}
WORD_CLASS_NAMES = {
0: "Noun",
1: "Adjective",
2: "Pronoun",
3: "Number",
4: "Verb",
5: "Participle",
6: "Adverb",
7: "Conjunction",
8: "Foreign",
9: "Unanalyzed",
10: "Abbreviation",
11: "Web",
12: "Punctuation",
13: "Symbol",
14: "Article",
}
INTERVAL_NAMES = {
(15, 29): "subcategory",
(30, 33): "gender",
(34, 36): "number",
(37, 41): "case",
(42, 43): "article",
(44, 45): "proper_noun",
(46, 50): "degree",
(51, 53): "declension",
(54, 60): "mood",
(61, 63): "voice",
(64, 66): "tense",
(67, 70): "person",
(71, 72): "definiteness",
}
FEATURE_COLUMNS = [
"S", "A", "P", "N", "V", "L", "D", "C", "F", "X", "T", "W", "K", "M", "R",
"D", "B", "E", "I", "P", "Q", "N", "G", "R", "X", "S", "C", "O", "T", "s",
"M", "F", "N", "g",
"S", "P", "n",
"N", "A", "D", "G", "c",
"A", "a",
"P", "r",
"P", "C", "S", "A", "d",
"S", "W", "e",
"I", "M", "N", "S", "P", "E", "U",
"A", "M", "v",
"P", "A", "t",
"1", "2", "3", "p",
"D", "I",
]
def decode_token(logits):
pred = np.zeros(logits.shape[0], dtype=int)
# predict word class
wc = int(np.argmax(logits[:15]))
pred[wc] = 1
# predict only valid feature groups for this word class
for start, end in constraint_mask.get(wc, []):
group = logits[start:end+1]
pred[start + int(np.argmax(group))] = 1
tag = features_to_tag.get(tuple(pred.tolist()), None)
features = {"word_class": WORD_CLASS_NAMES.get(wc, str(wc))}
for (start, end), name in INTERVAL_NAMES.items():
group = pred[start:end+1]
active = np.where(group == 1)[0]
if len(active) == 1:
features[name] = FEATURE_COLUMNS[start + active[0]]
return tag, features
text = "Hetta er eitt føroyskt dømi"
words = text.split()
enc = tokenizer(
[words],
is_split_into_words=True,
return_tensors="pt",
padding=True,
truncation=True,
)
with torch.no_grad():
logits = model(**enc).logits[0]
word_ids = enc.word_ids(batch_index=0)
seen = set()
for i, word_id in enumerate(word_ids):
if word_id is None or word_id in seen:
continue
seen.add(word_id)
tag, features = decode_token(logits[i].cpu().numpy())
print(f"{words[word_id]:15s} {str(tag):10s} {features}")
```
### Example output
```text
Hetta PDNpSN {'word_class': 'Pronoun', 'subcategory': 'D', 'gender': 'N', 'number': 'S', 'case': 'N', 'person': 'p'}
er VNAPS3 {'word_class': 'Verb', 'number': 'S', 'mood': 'N', 'voice': 'A', 'tense': 'P', 'person': '3'}
eitt RNSNI {'word_class': 'Article', 'gender': 'N', 'number': 'S', 'case': 'N', 'definiteness': 'I'}
føroyskt APSNSN {'word_class': 'Adjective', 'gender': 'N', 'number': 'S', 'case': 'N', 'degree': 'P', 'declension': 'S'}
dømi SNSNar {'word_class': 'Noun', 'gender': 'N', 'number': 'S', 'case': 'N', 'article': 'a', 'proper_noun': 'r'}
```
## Files in this Repository
This model repository contains model and decoding files, including:
- `model.safetensors`
- `config.json`
- tokenizer files
- `constraint_mask.json`
- `tag_mappings.json` :contentReference[oaicite:2]{index=2}
## Further Resources
For full training code, data preparation, and paper-related experiments, see the GitHub repository:
`https://github.com/Maltoknidepilin/BRAGD.git`
## Citation
```bibtex
@inproceedings{simonsen2026bragd,
title={{BRAGD}: Constrained Multi-Label {POS} Tagging for {F}aroese},
author={Simonsen, Annika and Scalvini, Barbara and Johannesen, Uni and Debess, Iben Nyholm and Einarsson, Hafsteinn and Sn{\ae}bjarnarson, V{\'e}steinn},
booktitle={Proceedings of the 2026 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2026)},
year={2026}
}
```
## Authors
Annika Simonsen, Barbara Scalvini, Uni Johannesen, Iben Nyholm Debess, Hafsteinn Einarsson, and Vésteinn Snæbjarnarson
## License
This repository is marked as **CC BY 4.0** on Hugging Face. :contentReference[oaicite:3]{index=3}