unijoh commited on
Commit
270c7b1
·
verified ·
1 Parent(s): 0a94245

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +262 -3
README.md CHANGED
@@ -1,3 +1,262 @@
1
- ---
2
- license: cc-by-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - fo
4
+ license: cc-by-4.0
5
+ library_name: transformers
6
+ pipeline_tag: token-classification
7
+ tags:
8
+ - faroese
9
+ - pos-tagging
10
+ - morphology
11
+ - xlm-roberta
12
+ - token-classification
13
+ - lrec-coling-2026
14
+ base_model: vesteinn/ScandiBERT
15
+ model_creator: Setur
16
+ ---
17
+
18
+ # BRAGD: Constrained Multi-Label POS Tagging for Faroese
19
+
20
+ BRAGD is a Faroese POS and morphological tagging model based on ScandiBERT. It predicts a **73-dimensional binary feature vector** for each token, covering word class, subcategory, gender, number, case, article, proper noun status, degree, declension, mood, voice, tense, person, and definiteness.
21
+
22
+ This Hugging Face repository contains a fine-tuned `XLMRobertaForTokenClassification` checkpoint with **73 output labels**, along with the decoding files `constraint_mask.json` and `tag_mappings.json`. The repository is currently published as a Transformers/XLM-RoBERTa safetensors model under `Setur/BRAGD`.
23
+
24
+ ## Model Details
25
+
26
+ - **Model name:** BRAGD
27
+ - **Repository:** `Setur/BRAGD`
28
+ - **Architecture:** `XLMRobertaForTokenClassification`
29
+ - **Base model:** `vesteinn/ScandiBERT`
30
+ - **Task:** Faroese POS + morphological tagging
31
+ - **Output format:** 73 binary features per token, decoded into BRAGD tags
32
+
33
+ ## Performance
34
+
35
+ In the accompanying paper, the constrained multi-label BRAGD model achieves:
36
+
37
+ - **97.5% composite tag accuracy** on the **Sosialurin-BRAGD** corpus (10-fold cross-validation)
38
+ - **96.2% composite tag accuracy** on **OOD-BRAGD** out-of-domain data
39
+
40
+ These numbers describe the evaluated research setup reported in the paper, not this release model trained on the combined data.
41
+
42
+ ## Training Data
43
+
44
+ The model is based on the BRAGD annotation scheme for Faroese.
45
+
46
+ ### Sosialurin-BRAGD
47
+ - **6,099 sentences**
48
+ - about **123k tokens**
49
+ - **651 unique tags**
50
+ - each tag decomposed into **73 binary features**
51
+
52
+ ### OOD-BRAGD
53
+ - **500 sentences**
54
+ - mixed-genre out-of-domain Faroese evaluation data
55
+
56
+ The release model in this repository was trained on **both** datasets.
57
+
58
+ ## Label Structure
59
+
60
+ The 73 output dimensions are organized as follows:
61
+
62
+ - **0–14:** Word class
63
+ - **15–29:** Subcategory
64
+ - **30–33:** Gender
65
+ - **34–36:** Number
66
+ - **37–41:** Case
67
+ - **42–43:** Article
68
+ - **44–45:** Proper noun
69
+ - **46–50:** Degree
70
+ - **51–53:** Declension
71
+ - **54–60:** Mood
72
+ - **61–63:** Voice
73
+ - **64–66:** Tense
74
+ - **67–70:** Person
75
+ - **71–72:** Definiteness
76
+
77
+ ## Using the Model
78
+
79
+ This model predicts **feature vectors**, not directly formatted BRAGD tags. To get the final BRAGD tag and readable features, you should:
80
+
81
+ 1. run the model,
82
+ 2. select the most likely word class,
83
+ 3. activate only the valid feature groups for that word class using `constraint_mask.json`,
84
+ 4. map the resulting feature vector back to a BRAGD tag using `tag_mappings.json`.
85
+
86
+ ### Install requirements
87
+
88
+ ```bash
89
+ pip install numpy torch "transformers==4.57.1" sentencepiece huggingface_hub
90
+ ```
91
+
92
+ ### Python example
93
+
94
+ ```python
95
+ import json
96
+ import numpy as np
97
+ import torch
98
+ from huggingface_hub import hf_hub_download
99
+ from transformers import XLMRobertaTokenizerFast, XLMRobertaForTokenClassification
100
+
101
+ model_name = "Setur/BRAGD"
102
+
103
+ tokenizer = XLMRobertaTokenizerFast.from_pretrained(model_name)
104
+ model = XLMRobertaForTokenClassification.from_pretrained(model_name)
105
+ model.eval()
106
+
107
+ # Download decoding assets
108
+ constraint_mask_path = hf_hub_download(model_name, "constraint_mask.json")
109
+ tag_mappings_path = hf_hub_download(model_name, "tag_mappings.json")
110
+
111
+ with open(constraint_mask_path, "r", encoding="utf-8") as f:
112
+ raw_mask = json.load(f)
113
+ constraint_mask = {int(k): [tuple(x) for x in v] for k, v in raw_mask.items()}
114
+
115
+ with open(tag_mappings_path, "r", encoding="utf-8") as f:
116
+ raw_map = json.load(f)
117
+ features_to_tag = {tuple(map(int, k.split(","))): v for k, v in raw_map.items()}
118
+
119
+ WORD_CLASS_NAMES = {
120
+ 0: "Noun",
121
+ 1: "Adjective",
122
+ 2: "Pronoun",
123
+ 3: "Number",
124
+ 4: "Verb",
125
+ 5: "Participle",
126
+ 6: "Adverb",
127
+ 7: "Conjunction",
128
+ 8: "Foreign",
129
+ 9: "Unanalyzed",
130
+ 10: "Abbreviation",
131
+ 11: "Web",
132
+ 12: "Punctuation",
133
+ 13: "Symbol",
134
+ 14: "Article",
135
+ }
136
+
137
+ INTERVAL_NAMES = {
138
+ (15, 29): "subcategory",
139
+ (30, 33): "gender",
140
+ (34, 36): "number",
141
+ (37, 41): "case",
142
+ (42, 43): "article",
143
+ (44, 45): "proper_noun",
144
+ (46, 50): "degree",
145
+ (51, 53): "declension",
146
+ (54, 60): "mood",
147
+ (61, 63): "voice",
148
+ (64, 66): "tense",
149
+ (67, 70): "person",
150
+ (71, 72): "definiteness",
151
+ }
152
+
153
+ FEATURE_COLUMNS = [
154
+ "S", "A", "P", "N", "V", "L", "D", "C", "F", "X", "T", "W", "K", "M", "R",
155
+ "D", "B", "E", "I", "P", "Q", "N", "G", "R", "X", "S", "C", "O", "T", "s",
156
+ "M", "F", "N", "g",
157
+ "S", "P", "n",
158
+ "N", "A", "D", "G", "c",
159
+ "A", "a",
160
+ "P", "r",
161
+ "P", "C", "S", "A", "d",
162
+ "S", "W", "e",
163
+ "I", "M", "N", "S", "P", "E", "U",
164
+ "A", "M", "v",
165
+ "P", "A", "t",
166
+ "1", "2", "3", "p",
167
+ "D", "I",
168
+ ]
169
+
170
+ def decode_token(logits):
171
+ pred = np.zeros(logits.shape[0], dtype=int)
172
+
173
+ # predict word class
174
+ wc = int(np.argmax(logits[:15]))
175
+ pred[wc] = 1
176
+
177
+ # predict only valid feature groups for this word class
178
+ for start, end in constraint_mask.get(wc, []):
179
+ group = logits[start:end+1]
180
+ pred[start + int(np.argmax(group))] = 1
181
+
182
+ tag = features_to_tag.get(tuple(pred.tolist()), None)
183
+
184
+ features = {"word_class": WORD_CLASS_NAMES.get(wc, str(wc))}
185
+ for (start, end), name in INTERVAL_NAMES.items():
186
+ group = pred[start:end+1]
187
+ active = np.where(group == 1)[0]
188
+ if len(active) == 1:
189
+ features[name] = FEATURE_COLUMNS[start + active[0]]
190
+
191
+ return tag, features
192
+
193
+ text = "Hetta er eitt føroyskt dømi"
194
+ words = text.split()
195
+
196
+ enc = tokenizer(
197
+ [words],
198
+ is_split_into_words=True,
199
+ return_tensors="pt",
200
+ padding=True,
201
+ truncation=True,
202
+ )
203
+
204
+ with torch.no_grad():
205
+ logits = model(**enc).logits[0]
206
+
207
+ word_ids = enc.word_ids(batch_index=0)
208
+ seen = set()
209
+
210
+ for i, word_id in enumerate(word_ids):
211
+ if word_id is None or word_id in seen:
212
+ continue
213
+ seen.add(word_id)
214
+
215
+ tag, features = decode_token(logits[i].cpu().numpy())
216
+ print(f"{words[word_id]:15s} {str(tag):10s} {features}")
217
+ ```
218
+
219
+ ### Example output
220
+
221
+ ```text
222
+ Hetta PDNpSN {'word_class': 'Pronoun', 'subcategory': 'D', 'gender': 'N', 'number': 'S', 'case': 'N', 'person': 'p'}
223
+ er VNAPS3 {'word_class': 'Verb', 'number': 'S', 'mood': 'N', 'voice': 'A', 'tense': 'P', 'person': '3'}
224
+ eitt RNSNI {'word_class': 'Article', 'gender': 'N', 'number': 'S', 'case': 'N', 'definiteness': 'I'}
225
+ føroyskt APSNSN {'word_class': 'Adjective', 'gender': 'N', 'number': 'S', 'case': 'N', 'degree': 'P', 'declension': 'S'}
226
+ dømi SNSNar {'word_class': 'Noun', 'gender': 'N', 'number': 'S', 'case': 'N', 'article': 'a', 'proper_noun': 'r'}
227
+ ```
228
+
229
+ ## Files in this Repository
230
+
231
+ This model repository contains model and decoding files, including:
232
+
233
+ - `model.safetensors`
234
+ - `config.json`
235
+ - tokenizer files
236
+ - `constraint_mask.json`
237
+ - `tag_mappings.json` :contentReference[oaicite:2]{index=2}
238
+
239
+ ## Further Resources
240
+
241
+ For full training code, data preparation, and paper-related experiments, see the GitHub repository:
242
+
243
+ `https://github.com/Maltoknidepilin/BRAGD.git`
244
+
245
+ ## Citation
246
+
247
+ ```bibtex
248
+ @inproceedings{simonsen2026bragd,
249
+ title={{BRAGD}: Constrained Multi-Label {POS} Tagging for {F}aroese},
250
+ author={Simonsen, Annika and Scalvini, Barbara and Johannesen, Uni and Debess, Iben Nyholm and Einarsson, Hafsteinn and Sn{\ae}bjarnarson, V{\'e}steinn},
251
+ booktitle={Proceedings of the 2026 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2026)},
252
+ year={2026}
253
+ }
254
+ ```
255
+
256
+ ## Authors
257
+
258
+ Annika Simonsen, Barbara Scalvini, Uni Johannesen, Iben Nyholm Debess, Hafsteinn Einarsson, and Vésteinn Snæbjarnarson
259
+
260
+ ## License
261
+
262
+ This repository is marked as **CC BY 4.0** on Hugging Face. :contentReference[oaicite:3]{index=3}