Micka-Gen3
Author: Semantika Research
Model Description
Micka Gen3 is a specialized language model based on the Microsoft RetNet architecture, fine-tuned for Retrieval-Augmented Generation (RAG) usage in Slovenian Cultural Heritage Domain. It leverages an efficient retention mechanism, and should be used as baseline and in combination with the GAMS series of models.
A standalone series of models, based on the GaMS model will also be released.
Training Data
The model was trained from scratch on:
- GigaFida corpus (Slovenian)
- Slovenian Wikipedia
- Random subset of 10,000 English Wikipedia articles
The model underwent 20 epochs of training on the above datasets.
Finetuning
The final stage involved finetuning on 10,000 culturally relevant samples prepared specifically for the Povejmo Project, focusing on cultural heritage content.
Tokenizer
This model uses the following tokenizer:
- Tokenizer: klokedm/micka-32768
The tokenizer shares the same foundational training data, with additional cultural heritage samples included for domain specificity.
Architecture
The Micka-Gen3 is based on the Microsoft RetNet architecture with the following detailed layers:
- 10 decoder layers, each including:
- Retention layers (q_proj, k_proj, v_proj, g_proj, out_proj)
- Feed-forward layers (linear1, linear2)
- Embedding layer (
embedding.weight) - Output projection layers (
out.weight,out.bias)
The architecture is optimized for long-context document retrieval and generation tasks in combination with large Generative AI models.
Usage
Designed specifically for Retrieval-Augmented Generation (RAG), Micka-Gen3 performs well in:
- Generating contextually accurate responses from Cultural Heritage Texts.
Funding
The development of the Micka Tokenizer was partially funded by the PoVeJMo project, which aims to develop large language models for the Slovenian language.
The project PoVeJMo is cofinanced by:

License
This tokenizer is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0). This license allows for sharing and adaptation, provided appropriate credit is given and any derivatives are distributed under the same license.
Citation
Please cite the following if you use Micka-Gen3:
@misc{micka-gen3,
author = {Semantika Research},
title = {Micka-Gen3: A RetNet-based Slovenian Language Model for RAG tasks},
year = {2024},
publisher = {Hugging Face},
url = {https://huggingface.co/klokedm/micka-gen3}
}
Contact
For more information, please contact: