Trans-Tokenization and Cross-lingual Vocabulary Transfers: Language Adaptation of LLMs for Low-Resource NLP
Paper
•
2408.04303
•
Published
•
22
This model is a ModernBert model aligned to Polish using Trans-tokenization technique on 24GB paraller sentences corpus from OpenSubtitles.
To be honest I have no idea if it works.
If it does please cite:
@misc {aleksander_obuchowski_2025,
author = { {Aleksander Obuchowski} },
title = { ModernBERT-pl (Revision 477647b) },
year = 2025,
url = { https://huggingface.co/AleksanderObuchowski/ModernBERT-pl },
doi = { 10.57967/hf/5052 },
publisher = { Hugging Face }
}
Base model
answerdotai/ModernBERT-base