| # Micka-Gen3 | |
| **Author**: [Semantika Research](https://semantika.eu) | |
| ## Model Description | |
| **Micka Gen3** is a specialized language model based on the [Microsoft RetNet](https://github.com/microsoft/unilm/tree/master/retnet) architecture, fine-tuned for Retrieval-Augmented Generation (RAG) usage in Slovenian Cultural Heritage Domain. | |
| It leverages an efficient retention mechanism, and should be used as baseline and in combination with the [GAMS](https://huggingface.co/cjvt/GaMS-9B-Instruct) series of models. | |
| A standalone series of models, based on the GaMS model will also be released. | |
| ## Training Data | |
| The model was trained from scratch on: | |
| - **GigaFida corpus** (Slovenian) | |
| - **Slovenian Wikipedia** | |
| - **Random subset of 10,000 English Wikipedia articles** | |
| The model underwent **20 epochs** of training on the above datasets. | |
| ### Finetuning | |
| The final stage involved finetuning on **10,000 culturally relevant samples** prepared specifically for the **Povejmo Project**, focusing on cultural heritage content. | |
| ## Tokenizer | |
| This model uses the following tokenizer: | |
| - **Tokenizer**: [klokedm/micka-32768](https://huggingface.co/klokedm/micka-32768) | |
| The tokenizer shares the same foundational training data, with additional cultural heritage samples included for domain specificity. | |
| ## Architecture | |
| The Micka-Gen3 is based on the **Microsoft RetNet** architecture with the following detailed layers: | |
| - **10 decoder layers**, each including: | |
| - Retention layers (q_proj, k_proj, v_proj, g_proj, out_proj) | |
| - Feed-forward layers (linear1, linear2) | |
| - Embedding layer (`embedding.weight`) | |
| - Output projection layers (`out.weight`, `out.bias`) | |
| The architecture is optimized for long-context document retrieval and generation tasks in combination with large Generative AI models. | |
| ## Usage | |
| Designed specifically for Retrieval-Augmented Generation (RAG), Micka-Gen3 performs well in: | |
| - Generating contextually accurate responses from Cultural Heritage Texts. | |
| ## Funding | |
| The development of the Micka Tokenizer was partially funded by the [PoVeJMo project](https://povejmo.si/), which aims to develop large language models for the Slovenian language. | |
| The project PoVeJMo is cofinanced by: | |
|  | |
|  | |
|  | |
| ## License | |
| This tokenizer is licensed under the [Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0)](https://creativecommons.org/licenses/by-sa/4.0/). This license allows for sharing and adaptation, provided appropriate credit is given and any derivatives are distributed under the same license. | |
| ## Citation | |
| Please cite the following if you use **Micka-Gen3**: | |
| ``` | |
| @misc{micka-gen3, | |
| author = {Semantika Research}, | |
| title = {Micka-Gen3: A RetNet-based Slovenian Language Model for RAG tasks}, | |
| year = {2024}, | |
| publisher = {Hugging Face}, | |
| url = {https://huggingface.co/klokedm/micka-gen3} | |
| } | |
| ``` | |
| ## Contact | |
| For more information, please contact: | |
| - [Semantika Research](https://semantika.eu) | |
| - [Hugging Face Repository](https://huggingface.co/klokedm/micka-gen3) | |