SemantikaEU
/

Micka-gen3

Model card Files Files and versions

xet

Community

klokedm commited on Apr 5, 2025

Commit

0fe518b

verified ·

1 Parent(s): 07db0a6

Update README.md

Browse files

Files changed (1) hide show

README.md +81 -3

README.md CHANGED Viewed

@@ -1,3 +1,81 @@
----
-license: cc-by-sa-4.0
----

+# Micka-Gen3
+**Author**: [Semantika Research](https://semantika.eu)
+## Model Description
+**Micka Gen3** is a specialized language model based on the [Microsoft RetNet](https://github.com/microsoft/unilm/tree/master/retnet) architecture, fine-tuned for Retrieval-Augmented Generation (RAG) usage in Slovenian Cultural Heritage Domain.
+It leverages an efficient retention mechanism, and should be used as baseline and in combination with the [GAMS](https://huggingface.co/cjvt/GaMS-9B-Instruct) series of models.
+A standalone series of models, based on the GaMS model will also be released.
+## Training Data
+The model was trained from scratch on:
+- **GigaFida corpus** (Slovenian)
+- **Slovenian Wikipedia**
+- **Random subset of 10,000 English Wikipedia articles**
+The model underwent **20 epochs** of training on the above datasets.
+### Finetuning
+The final stage involved finetuning on **10,000 culturally relevant samples** prepared specifically for the **Povejmo Project**, focusing on cultural heritage content.
+## Tokenizer
+This model uses the following tokenizer:
+- **Tokenizer**: [klokedm/micka-32768](https://huggingface.co/klokedm/micka-32768)
+The tokenizer shares the same foundational training data, with additional cultural heritage samples included for domain specificity.
+## Architecture
+The Micka-Gen3 is based on the **Microsoft RetNet** architecture with the following detailed layers:
+- **10 decoder layers**, each including:
+  - Retention layers (q_proj, k_proj, v_proj, g_proj, out_proj)
+  - Feed-forward layers (linear1, linear2)
+- Embedding layer (`embedding.weight`)
+- Output projection layers (`out.weight`, `out.bias`)
+The architecture is optimized for long-context document retrieval and generation tasks in combination with large Generative AI models.
+## Usage
+Designed specifically for Retrieval-Augmented Generation (RAG), Micka-Gen3 performs well in:
+- Generating contextually accurate responses from Cultural Heritage Texts.
+## Funding
+The development of the Micka Tokenizer was partially funded by the [PoVeJMo project](https://povejmo.si/), which aims to develop large language models for the Slovenian language.
+The project PoVeJMo is cofinanced by:
+![ARIS](https://www.cjvt.si/povejmo/wp-content/uploads/sites/28/2023/11/ARISLogoSlo_small.jpg)
+![NOO](https://www.cjvt.si/povejmo/wp-content/uploads/sites/28/2023/11/NOO_2023_logotip-transparent_povejmo.png)
+![NextGenerationEU](https://www.cjvt.si/povejmo/wp-content/uploads/sites/28/2023/11/Financira_Evropska_unija_2023_logotip-transparent_povejmo.png)
+## License
+This tokenizer is licensed under the [Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0)](https://creativecommons.org/licenses/by-sa/4.0/). This license allows for sharing and adaptation, provided appropriate credit is given and any derivatives are distributed under the same license.
+## Citation
+Please cite the following if you use **Micka-Gen3**:
+```
+@misc{micka-gen3,
+  author = {Semantika Research},
+  title = {Micka-Gen3: A RetNet-based Slovenian Language Model for RAG tasks},
+  year = {2024},
+  publisher = {Hugging Face},
+  url = {https://huggingface.co/klokedm/micka-gen3}
+}
+```
+## Contact
+For more information, please contact:
+- [Semantika Research](https://semantika.eu)
+- [Hugging Face Repository](https://huggingface.co/klokedm/micka-gen3)