File size: 3,459 Bytes
3a5054b 563faf3 3abcbfe 3a5054b 563faf3 ae99253 3abcbfe ae99253 fd94ee0 ae99253 ca8eb11 ae99253 7097751 ae99253 7097751 ae99253 7097751 08d8c66 e4f846c 08d8c66 e4f846c 08d8c66 7097751 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 |
---
language:
- cs
- sl
- sk
- pt
- pl
- 'no'
- it
- hr
- fr
- en
- da
- de
- sv
license: cc-by-4.0
tags:
- pretraining
---
# mELECTRA (Multilingual ELECTRA)
mELECTRA is an [Electra](https://arxiv.org/abs/2003.10555)-based model pretrained on a diverse multilingual corpus. It supports multiple languages, including **Swedish (SE), Slovenian (SL), Slovak (SK), Portuguese (PT), Polish (PL), Norwegian (NO), Italian (IT), Croatian (HR), French (FR), English (EN), Danish (DK), German (DE), and Czech (CZ)**. The model can be fine-tuned for various NLP tasks such as text classification, named entity recognition, and masked token prediction.
This model is released under the [CC BY 4.0 license](https://creativecommons.org/licenses/by/4.0/), allowing commercial use.
---
## Model Details
- **Architecture:** ELECTRA-Small
- **Languages Supported:** Swedish, Slovenian, Slovak, Portuguese, Spanish, Polish, Norwegian, Italian, Croatian, French, English, Danish, German, Czech
- **Pretraining Data:** Multilingual corpus (news articles, Wikipedia, and web texts)
- **Vocabulary:** SentencePiece-based tokenizer (`m.model`)
---
## Tokenization with SentencePiece
mELECTRA uses a **SentencePiece tokenizer** and requires a SentencePiece model file (`m.model`) for correct tokenization. Ensure that you properly load and use this tokenizer to maintain compatibility with the model.
### Example: Tokenization
#### Using HuggingFace AutoTokenizer (Recommended)
```python
from transformers import AutoTokenizer
# Load the tokenizer directly from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained("AILabTUL/mELECTRA")
# Or load from local directory
# tokenizer = AutoTokenizer.from_pretrained("./mELECTRA")
# Tokenize input text
sentence = "This is a multilingual model supporting multiple languages."
tokens = tokenizer.tokenize(sentence)
ids = tokenizer.encode(sentence)
print(f"Tokens: {tokens}")
print(f"IDs: {ids}")
# Decode back to text
decoded = tokenizer.decode(ids)
print(f"Decoded: {decoded}")
```
#### Using SentencePiece directly
```python
import sentencepiece as spm
# Load the SentencePiece model
sp = spm.SentencePieceProcessor()
sp.load("m.model")
# Tokenize input text (note: input should be lowercase)
sentence = "this is a multilingual model supporting multiple languages."
tokens = sp.encode(sentence, out_type=str)
print(tokens)
```
---
## Citation
This model was published as part of the research paper:
**"Study on Automatic Punctuation Restoration in Bilingual Broadcast Stream"**
```
@inproceedings{polacek-2025-study,
title = "Study on Automatic Punctuation Restoration in Bilingual Broadcast Stream",
author = "Polacek, Martin",
editor = "Velichkov, Boris and
Nikolova-Koleva, Ivelina and
Slavcheva, Milena",
booktitle = "Proceedings of the 9th Student Research Workshop associated with the International Conference Recent Advances in Natural Language Processing",
month = sep,
year = "2025",
address = "Varna, Bulgaria",
publisher = "INCOMA Ltd., Shoumen, Bulgaria",
url = "https://aclanthology.org/2025.ranlp-stud.5/",
pages = "37--43",
doi = "10.26615/issn.2603-2821.2025_005"
}
```
---
## Related Models
- **Czech-Slovak**: [AILabTUL/BiELECTRA-czech-slovak](https://huggingface.co/AILabTUL/BiELECTRA-czech-slovak)
- **Norwegian-Swedish**: [AILabTUL/BiELECTRA-norwegian-swedish](https://huggingface.co/AILabTUL/BiELECTRA-norwegian-swedish)
|