File size: 3,459 Bytes
3a5054b
563faf3
 
 
 
 
 
 
 
 
 
 
 
 
3abcbfe
3a5054b
563faf3
 
ae99253
 
 
 
3abcbfe
ae99253
fd94ee0
ae99253
 
 
 
 
 
ca8eb11
ae99253
 
 
 
 
 
 
 
 
 
 
7097751
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ae99253
 
 
 
 
 
 
7097751
 
ae99253
 
7097751
 
 
 
 
 
 
 
 
 
08d8c66
e4f846c
 
 
 
 
 
 
 
 
 
 
 
 
 
08d8c66
e4f846c
08d8c66
7097751
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
---
language:
- cs
- sl
- sk
- pt
- pl
- 'no'
- it
- hr
- fr
- en
- da
- de
- sv
license: cc-by-4.0
tags:
- pretraining
---

# mELECTRA (Multilingual ELECTRA)

mELECTRA is an [Electra](https://arxiv.org/abs/2003.10555)-based model pretrained on a diverse multilingual corpus. It supports multiple languages, including **Swedish (SE), Slovenian (SL), Slovak (SK), Portuguese (PT), Polish (PL), Norwegian (NO), Italian (IT), Croatian (HR), French (FR), English (EN), Danish (DK), German (DE), and Czech (CZ)**. The model can be fine-tuned for various NLP tasks such as text classification, named entity recognition, and masked token prediction. 

This model is released under the [CC BY 4.0 license](https://creativecommons.org/licenses/by/4.0/), allowing commercial use. 

---

## Model Details

- **Architecture:** ELECTRA-Small
- **Languages Supported:** Swedish, Slovenian, Slovak, Portuguese, Spanish, Polish, Norwegian, Italian, Croatian, French, English, Danish, German, Czech
- **Pretraining Data:** Multilingual corpus (news articles, Wikipedia, and web texts)
- **Vocabulary:** SentencePiece-based tokenizer (`m.model`)

---

## Tokenization with SentencePiece

mELECTRA uses a **SentencePiece tokenizer** and requires a SentencePiece model file (`m.model`) for correct tokenization. Ensure that you properly load and use this tokenizer to maintain compatibility with the model.

### Example: Tokenization

#### Using HuggingFace AutoTokenizer (Recommended)

```python
from transformers import AutoTokenizer

# Load the tokenizer directly from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained("AILabTUL/mELECTRA")

# Or load from local directory
# tokenizer = AutoTokenizer.from_pretrained("./mELECTRA")

# Tokenize input text
sentence = "This is a multilingual model supporting multiple languages."
tokens = tokenizer.tokenize(sentence)
ids = tokenizer.encode(sentence)

print(f"Tokens: {tokens}")
print(f"IDs: {ids}")

# Decode back to text
decoded = tokenizer.decode(ids)
print(f"Decoded: {decoded}")
```

#### Using SentencePiece directly

```python
import sentencepiece as spm

# Load the SentencePiece model
sp = spm.SentencePieceProcessor()
sp.load("m.model")

# Tokenize input text (note: input should be lowercase)
sentence = "this is a multilingual model supporting multiple languages."
tokens = sp.encode(sentence, out_type=str)
print(tokens)
```

---

## Citation

This model was published as part of the research paper:

**"Study on Automatic Punctuation Restoration in Bilingual Broadcast Stream"**  

```
@inproceedings{polacek-2025-study,
    title = "Study on Automatic Punctuation Restoration in Bilingual Broadcast Stream",
    author = "Polacek, Martin",
    editor = "Velichkov, Boris  and
      Nikolova-Koleva, Ivelina  and
      Slavcheva, Milena",
    booktitle = "Proceedings of the 9th Student Research Workshop associated with the International Conference Recent Advances in Natural Language Processing",
    month = sep,
    year = "2025",
    address = "Varna, Bulgaria",
    publisher = "INCOMA Ltd., Shoumen, Bulgaria",
    url = "https://aclanthology.org/2025.ranlp-stud.5/",
    pages = "37--43",
    doi = "10.26615/issn.2603-2821.2025_005"
}

```
---

## Related Models

- **Czech-Slovak**: [AILabTUL/BiELECTRA-czech-slovak](https://huggingface.co/AILabTUL/BiELECTRA-czech-slovak)
- **Norwegian-Swedish**: [AILabTUL/BiELECTRA-norwegian-swedish](https://huggingface.co/AILabTUL/BiELECTRA-norwegian-swedish)