YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Micka-Gen3

Author: Semantika Research

Model Description

Micka Gen3 is a specialized language model based on the Microsoft RetNet architecture, fine-tuned for Retrieval-Augmented Generation (RAG) usage in Slovenian Cultural Heritage Domain. It leverages an efficient retention mechanism, and should be used as baseline and in combination with the GAMS series of models.

A standalone series of models, based on the GaMS model will also be released.

Training Data

The model was trained from scratch on:

  • GigaFida corpus (Slovenian)
  • Slovenian Wikipedia
  • Random subset of 10,000 English Wikipedia articles

The model underwent 20 epochs of training on the above datasets.

Finetuning

The final stage involved finetuning on 10,000 culturally relevant samples prepared specifically for the Povejmo Project, focusing on cultural heritage content.

Tokenizer

This model uses the following tokenizer:

The tokenizer shares the same foundational training data, with additional cultural heritage samples included for domain specificity.

Architecture

The Micka-Gen3 is based on the Microsoft RetNet architecture with the following detailed layers:

  • 10 decoder layers, each including:
    • Retention layers (q_proj, k_proj, v_proj, g_proj, out_proj)
    • Feed-forward layers (linear1, linear2)
  • Embedding layer (embedding.weight)
  • Output projection layers (out.weight, out.bias)

The architecture is optimized for long-context document retrieval and generation tasks in combination with large Generative AI models.

Usage

Designed specifically for Retrieval-Augmented Generation (RAG), Micka-Gen3 performs well in:

  • Generating contextually accurate responses from Cultural Heritage Texts.

Funding

The development of the Micka Tokenizer was partially funded by the PoVeJMo project, which aims to develop large language models for the Slovenian language. The project PoVeJMo is cofinanced by: ARIS NOO NextGenerationEU

License

This tokenizer is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0). This license allows for sharing and adaptation, provided appropriate credit is given and any derivatives are distributed under the same license.

Citation

Please cite the following if you use Micka-Gen3:

@misc{micka-gen3,
  author = {Semantika Research},
  title = {Micka-Gen3: A RetNet-based Slovenian Language Model for RAG tasks},
  year = {2024},
  publisher = {Hugging Face},
  url = {https://huggingface.co/klokedm/micka-gen3}
}

Contact

For more information, please contact:

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support