mpolacek commited on
Commit
ae99253
·
verified ·
1 Parent(s): aec6fb6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +36 -1
README.md CHANGED
@@ -15,4 +15,39 @@ language:
15
  license: cc-by-4.0
16
  tags:
17
  - pretraining
18
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
  license: cc-by-4.0
16
  tags:
17
  - pretraining
18
+ ---
19
+
20
+ # mELECTRA (Multilingual ELECTRA)
21
+
22
+ mELECTRA is an [Electra](https://arxiv.org/abs/2003.10555)-based model pretrained on a diverse multilingual corpus. It supports multiple languages, including **Swedish (SE), Slovenian (SL), Slovak (SK), Portuguese (PT), Polish (PL), Norwegian (NO), Italian (IT), Croatian (HR), French (FR), English (EN), Danish (DK), German (DE), and Czech (CZ)**. The model can be fine-tuned for various NLP tasks such as text classification, named entity recognition, and masked token prediction.
23
+
24
+ This model is released under the [CC BY 4.0 license](https://creativecommons.org/licenses/by/4.0/), allowing commercial use. If you encounter any issues, please visit our [GitHub repository](https://github.com/your-repo/mELECTRA).
25
+
26
+ ---
27
+
28
+ ## Model Details
29
+
30
+ - **Architecture:** ELECTRA-Small
31
+ - **Languages Supported:** Swedish, Slovenian, Slovak, Portuguese, Polish, Norwegian, Italian, Croatian, French, English, Danish, German, Czech
32
+ - **Pretraining Data:** Multilingual corpus (news articles, Wikipedia, and web texts)
33
+ - **Vocabulary:** SentencePiece-based tokenizer (`m.model`)
34
+
35
+ ---
36
+
37
+ ## Tokenization with SentencePiece
38
+
39
+ mELECTRA uses a **SentencePiece tokenizer** and requires a SentencePiece model file (`m.model`) for correct tokenization. Ensure that you properly load and use this tokenizer to maintain compatibility with the model.
40
+
41
+ ### Example: Tokenization
42
+
43
+ ```python
44
+ import sentencepiece as spm
45
+
46
+ # Load the SentencePiece model
47
+ sp = spm.SentencePieceProcessor()
48
+ sp.load("m.model")
49
+
50
+ # Tokenize input text
51
+ sentence = "This is a multilingual model supporting multiple languages."
52
+ tokens = sp.encode(sentence, out_type=str)
53
+ print(tokens)