mELECTRA (Multilingual ELECTRA)

mELECTRA is an Electra-based model pretrained on a diverse multilingual corpus. It supports multiple languages, including Swedish (SE), Slovenian (SL), Slovak (SK), Portuguese (PT), Polish (PL), Norwegian (NO), Italian (IT), Croatian (HR), French (FR), English (EN), Danish (DK), German (DE), and Czech (CZ). The model can be fine-tuned for various NLP tasks such as text classification, named entity recognition, and masked token prediction.

This model is released under the CC BY 4.0 license, allowing commercial use.


Model Details

  • Architecture: ELECTRA-Small
  • Languages Supported: Swedish, Slovenian, Slovak, Portuguese, Spanish, Polish, Norwegian, Italian, Croatian, French, English, Danish, German, Czech
  • Pretraining Data: Multilingual corpus (news articles, Wikipedia, and web texts)
  • Vocabulary: SentencePiece-based tokenizer (m.model)

Tokenization with SentencePiece

mELECTRA uses a SentencePiece tokenizer and requires a SentencePiece model file (m.model) for correct tokenization. Ensure that you properly load and use this tokenizer to maintain compatibility with the model.

Example: Tokenization

Using HuggingFace AutoTokenizer (Recommended)

from transformers import AutoTokenizer

# Load the tokenizer directly from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained("AILabTUL/mELECTRA")

# Or load from local directory
# tokenizer = AutoTokenizer.from_pretrained("./mELECTRA")

# Tokenize input text
sentence = "This is a multilingual model supporting multiple languages."
tokens = tokenizer.tokenize(sentence)
ids = tokenizer.encode(sentence)

print(f"Tokens: {tokens}")
print(f"IDs: {ids}")

# Decode back to text
decoded = tokenizer.decode(ids)
print(f"Decoded: {decoded}")

Using SentencePiece directly

import sentencepiece as spm

# Load the SentencePiece model
sp = spm.SentencePieceProcessor()
sp.load("m.model")

# Tokenize input text (note: input should be lowercase)
sentence = "this is a multilingual model supporting multiple languages."
tokens = sp.encode(sentence, out_type=str)
print(tokens)

Citation

This model was published as part of the research paper:

"Study on Automatic Punctuation Restoration in Bilingual Broadcast Stream"

@inproceedings{polacek-2025-study,
    title = "Study on Automatic Punctuation Restoration in Bilingual Broadcast Stream",
    author = "Polacek, Martin",
    editor = "Velichkov, Boris  and
      Nikolova-Koleva, Ivelina  and
      Slavcheva, Milena",
    booktitle = "Proceedings of the 9th Student Research Workshop associated with the International Conference Recent Advances in Natural Language Processing",
    month = sep,
    year = "2025",
    address = "Varna, Bulgaria",
    publisher = "INCOMA Ltd., Shoumen, Bulgaria",
    url = "https://aclanthology.org/2025.ranlp-stud.5/",
    pages = "37--43",
    doi = "10.26615/issn.2603-2821.2025_005"
}

Related Models

Downloads last month
88,131
Safetensors
Model size
14.3M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for AILabTUL/mELECTRA