BRAGD / README.md

Update README.md

270c7b1 verified 3 days ago

7.94 kB

	---
	language:
	- fo
	license: cc-by-4.0
	library_name: transformers
	pipeline_tag: token-classification
	tags:
	- faroese
	- pos-tagging
	- morphology
	- xlm-roberta
	- token-classification
	- lrec-coling-2026
	base_model: vesteinn/ScandiBERT
	model_creator: Setur
	---

	# BRAGD: Constrained Multi-Label POS Tagging for Faroese

	BRAGD is a Faroese POS and morphological tagging model based on ScandiBERT. It predicts a 73-dimensional binary feature vector for each token, covering word class, subcategory, gender, number, case, article, proper noun status, degree, declension, mood, voice, tense, person, and definiteness.

	This Hugging Face repository contains a fine-tuned `XLMRobertaForTokenClassification` checkpoint with 73 output labels, along with the decoding files `constraint_mask.json` and `tag_mappings.json`. The repository is currently published as a Transformers/XLM-RoBERTa safetensors model under `Setur/BRAGD`.

	## Model Details

	- Model name: BRAGD
	- Repository: `Setur/BRAGD`
	- Architecture: `XLMRobertaForTokenClassification`
	- Base model: `vesteinn/ScandiBERT`
	- Task: Faroese POS + morphological tagging
	- Output format: 73 binary features per token, decoded into BRAGD tags

	## Performance

	In the accompanying paper, the constrained multi-label BRAGD model achieves:

	- 97.5% composite tag accuracy on the Sosialurin-BRAGD corpus (10-fold cross-validation)
	- 96.2% composite tag accuracy on OOD-BRAGD out-of-domain data

	These numbers describe the evaluated research setup reported in the paper, not this release model trained on the combined data.

	## Training Data

	The model is based on the BRAGD annotation scheme for Faroese.

	### Sosialurin-BRAGD
	- 6,099 sentences
	- about 123k tokens
	- 651 unique tags
	- each tag decomposed into 73 binary features

	### OOD-BRAGD
	- 500 sentences
	- mixed-genre out-of-domain Faroese evaluation data

	The release model in this repository was trained on both datasets.

	## Label Structure

	The 73 output dimensions are organized as follows:

	- 0–14: Word class
	- 15–29: Subcategory
	- 30–33: Gender
	- 34–36: Number
	- 37–41: Case
	- 42–43: Article
	- 44–45: Proper noun
	- 46–50: Degree
	- 51–53: Declension
	- 54–60: Mood
	- 61–63: Voice
	- 64–66: Tense
	- 67–70: Person
	- 71–72: Definiteness

	## Using the Model

	This model predicts feature vectors, not directly formatted BRAGD tags. To get the final BRAGD tag and readable features, you should:

	1. run the model,
	2. select the most likely word class,
	3. activate only the valid feature groups for that word class using `constraint_mask.json`,
	4. map the resulting feature vector back to a BRAGD tag using `tag_mappings.json`.

	### Install requirements

	```bash
	pip install numpy torch "transformers==4.57.1" sentencepiece huggingface_hub
	```

	### Python example

	```python
	import json
	import numpy as np
	import torch
	from huggingface_hub import hf_hub_download
	from transformers import XLMRobertaTokenizerFast, XLMRobertaForTokenClassification

	model_name = "Setur/BRAGD"

	tokenizer = XLMRobertaTokenizerFast.from_pretrained(model_name)
	model = XLMRobertaForTokenClassification.from_pretrained(model_name)
	model.eval()

	# Download decoding assets
	constraint_mask_path = hf_hub_download(model_name, "constraint_mask.json")
	tag_mappings_path = hf_hub_download(model_name, "tag_mappings.json")

	with open(constraint_mask_path, "r", encoding="utf-8") as f:
	raw_mask = json.load(f)
	constraint_mask = {int(k): [tuple(x) for x in v] for k, v in raw_mask.items()}

	with open(tag_mappings_path, "r", encoding="utf-8") as f:
	raw_map = json.load(f)
	features_to_tag = {tuple(map(int, k.split(","))): v for k, v in raw_map.items()}

	WORD_CLASS_NAMES = {
	0: "Noun",
	1: "Adjective",
	2: "Pronoun",
	3: "Number",
	4: "Verb",
	5: "Participle",
	6: "Adverb",
	7: "Conjunction",
	8: "Foreign",
	9: "Unanalyzed",
	10: "Abbreviation",
	11: "Web",
	12: "Punctuation",
	13: "Symbol",
	14: "Article",
	}

	INTERVAL_NAMES = {
	(15, 29): "subcategory",
	(30, 33): "gender",
	(34, 36): "number",
	(37, 41): "case",
	(42, 43): "article",
	(44, 45): "proper_noun",
	(46, 50): "degree",
	(51, 53): "declension",
	(54, 60): "mood",
	(61, 63): "voice",
	(64, 66): "tense",
	(67, 70): "person",
	(71, 72): "definiteness",
	}

	FEATURE_COLUMNS = [
	"S", "A", "P", "N", "V", "L", "D", "C", "F", "X", "T", "W", "K", "M", "R",
	"D", "B", "E", "I", "P", "Q", "N", "G", "R", "X", "S", "C", "O", "T", "s",
	"M", "F", "N", "g",
	"S", "P", "n",
	"N", "A", "D", "G", "c",
	"A", "a",
	"P", "r",
	"P", "C", "S", "A", "d",
	"S", "W", "e",
	"I", "M", "N", "S", "P", "E", "U",
	"A", "M", "v",
	"P", "A", "t",
	"1", "2", "3", "p",
	"D", "I",
	]

	def decode_token(logits):
	pred = np.zeros(logits.shape[0], dtype=int)

	# predict word class
	wc = int(np.argmax(logits[:15]))
	pred[wc] = 1

	# predict only valid feature groups for this word class
	for start, end in constraint_mask.get(wc, []):
	group = logits[start:end+1]
	pred[start + int(np.argmax(group))] = 1

	tag = features_to_tag.get(tuple(pred.tolist()), None)

	features = {"word_class": WORD_CLASS_NAMES.get(wc, str(wc))}
	for (start, end), name in INTERVAL_NAMES.items():
	group = pred[start:end+1]
	active = np.where(group == 1)[0]
	if len(active) == 1:
	features[name] = FEATURE_COLUMNS[start + active[0]]

	return tag, features

	text = "Hetta er eitt føroyskt dømi"
	words = text.split()

	enc = tokenizer(
	[words],
	is_split_into_words=True,
	return_tensors="pt",
	padding=True,
	truncation=True,
	)

	with torch.no_grad():
	logits = model(**enc).logits[0]

	word_ids = enc.word_ids(batch_index=0)
	seen = set()

	for i, word_id in enumerate(word_ids):
	if word_id is None or word_id in seen:
	continue
	seen.add(word_id)

	tag, features = decode_token(logits[i].cpu().numpy())
	print(f"{words[word_id]:15s} {str(tag):10s} {features}")
	```

	### Example output

	```text
	Hetta PDNpSN {'word_class': 'Pronoun', 'subcategory': 'D', 'gender': 'N', 'number': 'S', 'case': 'N', 'person': 'p'}
	er VNAPS3 {'word_class': 'Verb', 'number': 'S', 'mood': 'N', 'voice': 'A', 'tense': 'P', 'person': '3'}
	eitt RNSNI {'word_class': 'Article', 'gender': 'N', 'number': 'S', 'case': 'N', 'definiteness': 'I'}
	føroyskt APSNSN {'word_class': 'Adjective', 'gender': 'N', 'number': 'S', 'case': 'N', 'degree': 'P', 'declension': 'S'}
	dømi SNSNar {'word_class': 'Noun', 'gender': 'N', 'number': 'S', 'case': 'N', 'article': 'a', 'proper_noun': 'r'}
	```

	## Files in this Repository

	This model repository contains model and decoding files, including:

	- `model.safetensors`
	- `config.json`
	- tokenizer files
	- `constraint_mask.json`
	- `tag_mappings.json` :contentReference[oaicite:2]{index=2}

	## Further Resources

	For full training code, data preparation, and paper-related experiments, see the GitHub repository:

	`https://github.com/Maltoknidepilin/BRAGD.git`

	## Citation

	```bibtex
	@inproceedings{simonsen2026bragd,
	title={{BRAGD}: Constrained Multi-Label {POS} Tagging for {F}aroese},
	author={Simonsen, Annika and Scalvini, Barbara and Johannesen, Uni and Debess, Iben Nyholm and Einarsson, Hafsteinn and Sn{\ae}bjarnarson, V{\'e}steinn},
	booktitle={Proceedings of the 2026 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2026)},
	year={2026}
	}
	```

	## Authors

	Annika Simonsen, Barbara Scalvini, Uni Johannesen, Iben Nyholm Debess, Hafsteinn Einarsson, and Vésteinn Snæbjarnarson

	## License

	This repository is marked as CC BY 4.0 on Hugging Face. :contentReference[oaicite:3]{index=3}