SemantikaEU
/

Micka-gen3

Model card Files Files and versions

Micka-gen3 / README.md

klokedm's picture

Update README.md

0fe518b verified 10 months ago

|

history blame contribute delete

3.36 kB

	# Micka-Gen3

	Author: [Semantika Research](https://semantika.eu)

	## Model Description

	Micka Gen3 is a specialized language model based on the [Microsoft RetNet](https://github.com/microsoft/unilm/tree/master/retnet) architecture, fine-tuned for Retrieval-Augmented Generation (RAG) usage in Slovenian Cultural Heritage Domain.
	It leverages an efficient retention mechanism, and should be used as baseline and in combination with the [GAMS](https://huggingface.co/cjvt/GaMS-9B-Instruct) series of models.

	A standalone series of models, based on the GaMS model will also be released.

	## Training Data

	The model was trained from scratch on:
	- GigaFida corpus (Slovenian)
	- Slovenian Wikipedia
	- Random subset of 10,000 English Wikipedia articles

	The model underwent 20 epochs of training on the above datasets.

	### Finetuning

	The final stage involved finetuning on 10,000 culturally relevant samples prepared specifically for the Povejmo Project, focusing on cultural heritage content.

	## Tokenizer

	This model uses the following tokenizer:
	- Tokenizer: [klokedm/micka-32768](https://huggingface.co/klokedm/micka-32768)

	The tokenizer shares the same foundational training data, with additional cultural heritage samples included for domain specificity.

	## Architecture

	The Micka-Gen3 is based on the Microsoft RetNet architecture with the following detailed layers:

	- 10 decoder layers, each including:
	- Retention layers (q_proj, k_proj, v_proj, g_proj, out_proj)
	- Feed-forward layers (linear1, linear2)
	- Embedding layer (`embedding.weight`)
	- Output projection layers (`out.weight`, `out.bias`)

	The architecture is optimized for long-context document retrieval and generation tasks in combination with large Generative AI models.

	## Usage

	Designed specifically for Retrieval-Augmented Generation (RAG), Micka-Gen3 performs well in:
	- Generating contextually accurate responses from Cultural Heritage Texts.

	## Funding

	The development of the Micka Tokenizer was partially funded by the [PoVeJMo project](https://povejmo.si/), which aims to develop large language models for the Slovenian language.
	The project PoVeJMo is cofinanced by:
	![ARIS](https://www.cjvt.si/povejmo/wp-content/uploads/sites/28/2023/11/ARISLogoSlo_small.jpg)
	![NOO](https://www.cjvt.si/povejmo/wp-content/uploads/sites/28/2023/11/NOO_2023_logotip-transparent_povejmo.png)
	![NextGenerationEU](https://www.cjvt.si/povejmo/wp-content/uploads/sites/28/2023/11/Financira_Evropska_unija_2023_logotip-transparent_povejmo.png)


	## License

	This tokenizer is licensed under the [Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0)](https://creativecommons.org/licenses/by-sa/4.0/). This license allows for sharing and adaptation, provided appropriate credit is given and any derivatives are distributed under the same license.

	## Citation

	Please cite the following if you use Micka-Gen3:

	```
	@misc{micka-gen3,
	author = {Semantika Research},
	title = {Micka-Gen3: A RetNet-based Slovenian Language Model for RAG tasks},
	year = {2024},
	publisher = {Hugging Face},
	url = {https://huggingface.co/klokedm/micka-gen3}
	}
	```

	## Contact
	For more information, please contact:
	- [Semantika Research](https://semantika.eu)
	- [Hugging Face Repository](https://huggingface.co/klokedm/micka-gen3)