fkoerner commited on
Commit
119f490
·
verified ·
1 Parent(s): 66ce7c6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +57 -3
README.md CHANGED
@@ -1,3 +1,57 @@
1
- ---
2
- license: cc-by-sa-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - lb
4
+ license: cc-by-sa-4.0
5
+ library_name: transformers
6
+ pipeline_tag: fill-mask
7
+ tags:
8
+ - modernbert
9
+ - encoder
10
+ - luxembourgish
11
+ - multilingual
12
+ - masked-language-modeling
13
+ ---
14
+
15
+ # LTZ Encoder (mini)
16
+
17
+ A ModernBERT-based masked language model pretrained on Luxembourgish, following the Ettin recipe (see here: https://huggingface.co/jhu-clsp/ettin-encoder-68m)
18
+
19
+ ## Model Details
20
+
21
+ - **Architecture:** ModernBERT (encoder)
22
+ - **Size:** mini
23
+ - **Vocabulary:** 50,368 tokens (BPE, GPTNeoXTokenizerFast)
24
+ - **Context length:** 1,024 tokens
25
+ - **Language:** Luxembourgish (`lb`/`ltz`)
26
+ - **License:** CC BY-SA 4.0
27
+
28
+ ## Usage
29
+
30
+ Requires `transformers>=4.48.0`.
31
+
32
+ ```python
33
+ from transformers import AutoModelForMaskedLM, AutoTokenizer
34
+ import torch
35
+
36
+ tokenizer = AutoTokenizer.from_pretrained("instilux/ltz-e1-mini")
37
+ model = AutoModelForMaskedLM.from_pretrained("instilux/ltz-e1-mini")
38
+
39
+ inputs = tokenizer("Wéi spéit [MASK] et?", return_tensors="pt")
40
+ mask_pos = (inputs["input_ids"] == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
41
+
42
+ with torch.no_grad():
43
+ outputs = model(**inputs)
44
+
45
+ top_tokens = outputs.logits[0, mask_pos].topk(5)
46
+ for token_id, score in zip(top_tokens.indices[0], top_tokens.values[0]):
47
+ token = tokenizer.decode(token_id)
48
+ print(f"{token:15s} {score:.3f}")
49
+ ```
50
+
51
+ ## Tokenizer Notes
52
+
53
+ The tokenizer is BPE-based (`GPTNeoXTokenizerFast`) with BERT-style special tokens (`[CLS]`, `[SEP]`, `[MASK]`, `[PAD]`). A `[CLS]` token is prepended automatically (`add_bos_token: true`).
54
+
55
+ ## Citation
56
+
57
+ A paper describing this model will be published soon. In the meantime, please cite this repository if you use this model in your work.