AILabTUL
/

mELECTRA

 license: cc-by-4.0
 tags:
 - pretraining
+---
+# mELECTRA (Multilingual ELECTRA)
+mELECTRA is an [Electra](https://arxiv.org/abs/2003.10555)-based model pretrained on a diverse multilingual corpus. It supports multiple languages, including **Swedish (SE), Slovenian (SL), Slovak (SK), Portuguese (PT), Polish (PL), Norwegian (NO), Italian (IT), Croatian (HR), French (FR), English (EN), Danish (DK), German (DE), and Czech (CZ)**. The model can be fine-tuned for various NLP tasks such as text classification, named entity recognition, and masked token prediction.
+This model is released under the [CC BY 4.0 license](https://creativecommons.org/licenses/by/4.0/), allowing commercial use. If you encounter any issues, please visit our [GitHub repository](https://github.com/your-repo/mELECTRA).
+---
+## Model Details
+- **Architecture:** ELECTRA-Small
+- **Languages Supported:** Swedish, Slovenian, Slovak, Portuguese, Polish, Norwegian, Italian, Croatian, French, English, Danish, German, Czech
+- **Pretraining Data:** Multilingual corpus (news articles, Wikipedia, and web texts)
+- **Vocabulary:** SentencePiece-based tokenizer (`m.model`)
+---
+## Tokenization with SentencePiece
+mELECTRA uses a **SentencePiece tokenizer** and requires a SentencePiece model file (`m.model`) for correct tokenization. Ensure that you properly load and use this tokenizer to maintain compatibility with the model.
+### Example: Tokenization
+```python
+import sentencepiece as spm
+# Load the SentencePiece model
+sp = spm.SentencePieceProcessor()
+sp.load("m.model")
+# Tokenize input text
+sentence = "This is a multilingual model supporting multiple languages."
+tokens = sp.encode(sentence, out_type=str)
+print(tokens)