File size: 2,093 Bytes
119f490
 
 
 
 
 
 
 
 
 
 
 
 
 
db5e273
119f490
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1ca9802
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
---
language:
- lb
license: cc-by-sa-4.0
library_name: transformers
pipeline_tag: fill-mask
tags:
- modernbert
- encoder
- luxembourgish
- multilingual
- masked-language-modeling
---

# LTZ E1 (mini)

A ModernBERT-based masked language model pretrained on Luxembourgish, following the Ettin recipe (see here: https://huggingface.co/jhu-clsp/ettin-encoder-68m)

## Model Details

- **Architecture:** ModernBERT (encoder)
- **Size:** mini
- **Vocabulary:** 50,368 tokens (BPE, GPTNeoXTokenizerFast)
- **Context length:** 1,024 tokens
- **Language:** Luxembourgish (`lb`/`ltz`)
- **License:** CC BY-SA 4.0

## Usage

Requires `transformers>=4.48.0`.

```python
from transformers import AutoModelForMaskedLM, AutoTokenizer
import torch

tokenizer = AutoTokenizer.from_pretrained("instilux/ltz-e1-mini")
model = AutoModelForMaskedLM.from_pretrained("instilux/ltz-e1-mini")

inputs = tokenizer("Wéi spéit [MASK] et?", return_tensors="pt")
mask_pos = (inputs["input_ids"] == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]

with torch.no_grad():
    outputs = model(**inputs)

top_tokens = outputs.logits[0, mask_pos].topk(5)
for token_id, score in zip(top_tokens.indices[0], top_tokens.values[0]):
    token = tokenizer.decode(token_id)
    print(f"{token:15s} {score:.3f}")
```

## Tokenizer Notes

The tokenizer is BPE-based (`GPTNeoXTokenizerFast`) with BERT-style special tokens (`[CLS]`, `[SEP]`, `[MASK]`, `[PAD]`). A `[CLS]` token is prepended automatically (`add_bos_token: true`).

## Citation

Please cite this paper (preprint, accepted to ACL 2026 Findings) if you use this model in your work.

@misc{plum2026ltzglueluxembourgishgenerallanguage,
  title={ltzGLUE: Luxembourgish General Language Understanding Evaluation}, 
  author={Alistair Plum and Felicia Körner and Anne-Marie Lutgen and Laura Bernardy and Fred Philippy and Emilia Milano and Nils Rehlinger and Cédric Lothritz and Tharindu Ranasinghe and Barbara Plank and Christoph Purschke},
  year={2026},
  eprint={2604.17976},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2604.17976}, 
}