|
|
--- |
|
|
language: |
|
|
- cs |
|
|
- sl |
|
|
- sk |
|
|
- pt |
|
|
- pl |
|
|
- 'no' |
|
|
- it |
|
|
- hr |
|
|
- fr |
|
|
- en |
|
|
- da |
|
|
- de |
|
|
- sv |
|
|
license: cc-by-4.0 |
|
|
tags: |
|
|
- pretraining |
|
|
--- |
|
|
|
|
|
# mELECTRA (Multilingual ELECTRA) |
|
|
|
|
|
mELECTRA is an [Electra](https://arxiv.org/abs/2003.10555)-based model pretrained on a diverse multilingual corpus. It supports multiple languages, including **Swedish (SE), Slovenian (SL), Slovak (SK), Portuguese (PT), Polish (PL), Norwegian (NO), Italian (IT), Croatian (HR), French (FR), English (EN), Danish (DK), German (DE), and Czech (CZ)**. The model can be fine-tuned for various NLP tasks such as text classification, named entity recognition, and masked token prediction. |
|
|
|
|
|
This model is released under the [CC BY 4.0 license](https://creativecommons.org/licenses/by/4.0/), allowing commercial use. |
|
|
|
|
|
--- |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Architecture:** ELECTRA-Small |
|
|
- **Languages Supported:** Swedish, Slovenian, Slovak, Portuguese, Spanish, Polish, Norwegian, Italian, Croatian, French, English, Danish, German, Czech |
|
|
- **Pretraining Data:** Multilingual corpus (news articles, Wikipedia, and web texts) |
|
|
- **Vocabulary:** SentencePiece-based tokenizer (`m.model`) |
|
|
|
|
|
--- |
|
|
|
|
|
## Tokenization with SentencePiece |
|
|
|
|
|
mELECTRA uses a **SentencePiece tokenizer** and requires a SentencePiece model file (`m.model`) for correct tokenization. Ensure that you properly load and use this tokenizer to maintain compatibility with the model. |
|
|
|
|
|
### Example: Tokenization |
|
|
|
|
|
#### Using HuggingFace AutoTokenizer (Recommended) |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer |
|
|
|
|
|
# Load the tokenizer directly from HuggingFace Hub |
|
|
tokenizer = AutoTokenizer.from_pretrained("AILabTUL/mELECTRA") |
|
|
|
|
|
# Or load from local directory |
|
|
# tokenizer = AutoTokenizer.from_pretrained("./mELECTRA") |
|
|
|
|
|
# Tokenize input text |
|
|
sentence = "This is a multilingual model supporting multiple languages." |
|
|
tokens = tokenizer.tokenize(sentence) |
|
|
ids = tokenizer.encode(sentence) |
|
|
|
|
|
print(f"Tokens: {tokens}") |
|
|
print(f"IDs: {ids}") |
|
|
|
|
|
# Decode back to text |
|
|
decoded = tokenizer.decode(ids) |
|
|
print(f"Decoded: {decoded}") |
|
|
``` |
|
|
|
|
|
#### Using SentencePiece directly |
|
|
|
|
|
```python |
|
|
import sentencepiece as spm |
|
|
|
|
|
# Load the SentencePiece model |
|
|
sp = spm.SentencePieceProcessor() |
|
|
sp.load("m.model") |
|
|
|
|
|
# Tokenize input text (note: input should be lowercase) |
|
|
sentence = "this is a multilingual model supporting multiple languages." |
|
|
tokens = sp.encode(sentence, out_type=str) |
|
|
print(tokens) |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## Citation |
|
|
|
|
|
This model was published as part of the research paper: |
|
|
|
|
|
**"Study on Automatic Punctuation Restoration in Bilingual Broadcast Stream"** |
|
|
|
|
|
``` |
|
|
@inproceedings{polacek-2025-study, |
|
|
title = "Study on Automatic Punctuation Restoration in Bilingual Broadcast Stream", |
|
|
author = "Polacek, Martin", |
|
|
editor = "Velichkov, Boris and |
|
|
Nikolova-Koleva, Ivelina and |
|
|
Slavcheva, Milena", |
|
|
booktitle = "Proceedings of the 9th Student Research Workshop associated with the International Conference Recent Advances in Natural Language Processing", |
|
|
month = sep, |
|
|
year = "2025", |
|
|
address = "Varna, Bulgaria", |
|
|
publisher = "INCOMA Ltd., Shoumen, Bulgaria", |
|
|
url = "https://aclanthology.org/2025.ranlp-stud.5/", |
|
|
pages = "37--43", |
|
|
doi = "10.26615/issn.2603-2821.2025_005" |
|
|
} |
|
|
|
|
|
``` |
|
|
--- |
|
|
|
|
|
## Related Models |
|
|
|
|
|
- **Czech-Slovak**: [AILabTUL/BiELECTRA-czech-slovak](https://huggingface.co/AILabTUL/BiELECTRA-czech-slovak) |
|
|
- **Norwegian-Swedish**: [AILabTUL/BiELECTRA-norwegian-swedish](https://huggingface.co/AILabTUL/BiELECTRA-norwegian-swedish) |
|
|
|