mELECTRA (Multilingual ELECTRA)
mELECTRA is an Electra-based model pretrained on a diverse multilingual corpus. It supports multiple languages, including Swedish (SE), Slovenian (SL), Slovak (SK), Portuguese (PT), Polish (PL), Norwegian (NO), Italian (IT), Croatian (HR), French (FR), English (EN), Danish (DK), German (DE), and Czech (CZ). The model can be fine-tuned for various NLP tasks such as text classification, named entity recognition, and masked token prediction.
This model is released under the CC BY 4.0 license, allowing commercial use.
Model Details
- Architecture: ELECTRA-Small
- Languages Supported: Swedish, Slovenian, Slovak, Portuguese, Spanish, Polish, Norwegian, Italian, Croatian, French, English, Danish, German, Czech
- Pretraining Data: Multilingual corpus (news articles, Wikipedia, and web texts)
- Vocabulary: SentencePiece-based tokenizer (
m.model)
Tokenization with SentencePiece
mELECTRA uses a SentencePiece tokenizer and requires a SentencePiece model file (m.model) for correct tokenization. Ensure that you properly load and use this tokenizer to maintain compatibility with the model.
Example: Tokenization
Using HuggingFace AutoTokenizer (Recommended)
from transformers import AutoTokenizer
# Load the tokenizer directly from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained("AILabTUL/mELECTRA")
# Or load from local directory
# tokenizer = AutoTokenizer.from_pretrained("./mELECTRA")
# Tokenize input text
sentence = "This is a multilingual model supporting multiple languages."
tokens = tokenizer.tokenize(sentence)
ids = tokenizer.encode(sentence)
print(f"Tokens: {tokens}")
print(f"IDs: {ids}")
# Decode back to text
decoded = tokenizer.decode(ids)
print(f"Decoded: {decoded}")
Using SentencePiece directly
import sentencepiece as spm
# Load the SentencePiece model
sp = spm.SentencePieceProcessor()
sp.load("m.model")
# Tokenize input text (note: input should be lowercase)
sentence = "this is a multilingual model supporting multiple languages."
tokens = sp.encode(sentence, out_type=str)
print(tokens)
Citation
This model was published as part of the research paper:
"Study on Automatic Punctuation Restoration in Bilingual Broadcast Stream"
@inproceedings{polacek-2025-study,
title = "Study on Automatic Punctuation Restoration in Bilingual Broadcast Stream",
author = "Polacek, Martin",
editor = "Velichkov, Boris and
Nikolova-Koleva, Ivelina and
Slavcheva, Milena",
booktitle = "Proceedings of the 9th Student Research Workshop associated with the International Conference Recent Advances in Natural Language Processing",
month = sep,
year = "2025",
address = "Varna, Bulgaria",
publisher = "INCOMA Ltd., Shoumen, Bulgaria",
url = "https://aclanthology.org/2025.ranlp-stud.5/",
pages = "37--43",
doi = "10.26615/issn.2603-2821.2025_005"
}
Related Models
- Czech-Slovak: AILabTUL/BiELECTRA-czech-slovak
- Norwegian-Swedish: AILabTUL/BiELECTRA-norwegian-swedish
- Downloads last month
- 88,131