Fill-Mask
Transformers
Safetensors
Luxembourgish
modernbert
encoder
luxembourgish
multilingual
masked-language-modeling
Instructions to use instilux/ltz-e1-base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use instilux/ltz-e1-base with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("fill-mask", model="instilux/ltz-e1-base")# Load model directly from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("instilux/ltz-e1-base") model = AutoModelForMaskedLM.from_pretrained("instilux/ltz-e1-base") - Notebooks
- Google Colab
- Kaggle
Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,57 @@
|
|
| 1 |
-
---
|
| 2 |
-
|
| 3 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language:
|
| 3 |
+
- lb
|
| 4 |
+
license: cc-by-sa-4.0
|
| 5 |
+
library_name: transformers
|
| 6 |
+
pipeline_tag: fill-mask
|
| 7 |
+
tags:
|
| 8 |
+
- modernbert
|
| 9 |
+
- encoder
|
| 10 |
+
- luxembourgish
|
| 11 |
+
- multilingual
|
| 12 |
+
- masked-language-modeling
|
| 13 |
+
---
|
| 14 |
+
|
| 15 |
+
# LTZ E1 (base)
|
| 16 |
+
|
| 17 |
+
A ModernBERT-based masked language model pretrained on Luxembourgish, following the Ettin recipe (see here: https://huggingface.co/jhu-clsp/ettin-encoder-150m)
|
| 18 |
+
|
| 19 |
+
## Model Details
|
| 20 |
+
|
| 21 |
+
- **Architecture:** ModernBERT (encoder)
|
| 22 |
+
- **Size:** base
|
| 23 |
+
- **Vocabulary:** 50,368 tokens (BPE, GPTNeoXTokenizerFast)
|
| 24 |
+
- **Context length:** 1,024 tokens
|
| 25 |
+
- **Language:** Luxembourgish (`lb`/`ltz`)
|
| 26 |
+
- **License:** CC BY-SA 4.0
|
| 27 |
+
|
| 28 |
+
## Usage
|
| 29 |
+
|
| 30 |
+
Requires `transformers>=4.48.0`.
|
| 31 |
+
|
| 32 |
+
```python
|
| 33 |
+
from transformers import AutoModelForMaskedLM, AutoTokenizer
|
| 34 |
+
import torch
|
| 35 |
+
|
| 36 |
+
tokenizer = AutoTokenizer.from_pretrained("instilux/ltz-e1-base")
|
| 37 |
+
model = AutoModelForMaskedLM.from_pretrained("instilux/ltz-e1-base")
|
| 38 |
+
|
| 39 |
+
inputs = tokenizer("Wéi spéit [MASK] et?", return_tensors="pt")
|
| 40 |
+
mask_pos = (inputs["input_ids"] == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
|
| 41 |
+
|
| 42 |
+
with torch.no_grad():
|
| 43 |
+
outputs = model(**inputs)
|
| 44 |
+
|
| 45 |
+
top_tokens = outputs.logits[0, mask_pos].topk(5)
|
| 46 |
+
for token_id, score in zip(top_tokens.indices[0], top_tokens.values[0]):
|
| 47 |
+
token = tokenizer.decode(token_id)
|
| 48 |
+
print(f"{token:15s} {score:.3f}")
|
| 49 |
+
```
|
| 50 |
+
|
| 51 |
+
## Tokenizer Notes
|
| 52 |
+
|
| 53 |
+
The tokenizer is BPE-based (`GPTNeoXTokenizerFast`) with BERT-style special tokens (`[CLS]`, `[SEP]`, `[MASK]`, `[PAD]`). A `[CLS]` token is prepended automatically (`add_bos_token: true`).
|
| 54 |
+
|
| 55 |
+
## Citation
|
| 56 |
+
|
| 57 |
+
A paper describing this model will be published soon. In the meantime, please cite this repository if you use this model in your work.
|