Update README.md
Browse files
README.md
CHANGED
|
@@ -15,4 +15,39 @@ language:
|
|
| 15 |
license: cc-by-4.0
|
| 16 |
tags:
|
| 17 |
- pretraining
|
| 18 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 15 |
license: cc-by-4.0
|
| 16 |
tags:
|
| 17 |
- pretraining
|
| 18 |
+
---
|
| 19 |
+
|
| 20 |
+
# mELECTRA (Multilingual ELECTRA)
|
| 21 |
+
|
| 22 |
+
mELECTRA is an [Electra](https://arxiv.org/abs/2003.10555)-based model pretrained on a diverse multilingual corpus. It supports multiple languages, including **Swedish (SE), Slovenian (SL), Slovak (SK), Portuguese (PT), Polish (PL), Norwegian (NO), Italian (IT), Croatian (HR), French (FR), English (EN), Danish (DK), German (DE), and Czech (CZ)**. The model can be fine-tuned for various NLP tasks such as text classification, named entity recognition, and masked token prediction.
|
| 23 |
+
|
| 24 |
+
This model is released under the [CC BY 4.0 license](https://creativecommons.org/licenses/by/4.0/), allowing commercial use. If you encounter any issues, please visit our [GitHub repository](https://github.com/your-repo/mELECTRA).
|
| 25 |
+
|
| 26 |
+
---
|
| 27 |
+
|
| 28 |
+
## Model Details
|
| 29 |
+
|
| 30 |
+
- **Architecture:** ELECTRA-Small
|
| 31 |
+
- **Languages Supported:** Swedish, Slovenian, Slovak, Portuguese, Polish, Norwegian, Italian, Croatian, French, English, Danish, German, Czech
|
| 32 |
+
- **Pretraining Data:** Multilingual corpus (news articles, Wikipedia, and web texts)
|
| 33 |
+
- **Vocabulary:** SentencePiece-based tokenizer (`m.model`)
|
| 34 |
+
|
| 35 |
+
---
|
| 36 |
+
|
| 37 |
+
## Tokenization with SentencePiece
|
| 38 |
+
|
| 39 |
+
mELECTRA uses a **SentencePiece tokenizer** and requires a SentencePiece model file (`m.model`) for correct tokenization. Ensure that you properly load and use this tokenizer to maintain compatibility with the model.
|
| 40 |
+
|
| 41 |
+
### Example: Tokenization
|
| 42 |
+
|
| 43 |
+
```python
|
| 44 |
+
import sentencepiece as spm
|
| 45 |
+
|
| 46 |
+
# Load the SentencePiece model
|
| 47 |
+
sp = spm.SentencePieceProcessor()
|
| 48 |
+
sp.load("m.model")
|
| 49 |
+
|
| 50 |
+
# Tokenize input text
|
| 51 |
+
sentence = "This is a multilingual model supporting multiple languages."
|
| 52 |
+
tokens = sp.encode(sentence, out_type=str)
|
| 53 |
+
print(tokens)
|