klokedm commited on
Commit
0fe518b
·
verified ·
1 Parent(s): 07db0a6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +81 -3
README.md CHANGED
@@ -1,3 +1,81 @@
1
- ---
2
- license: cc-by-sa-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Micka-Gen3
2
+
3
+ **Author**: [Semantika Research](https://semantika.eu)
4
+
5
+ ## Model Description
6
+
7
+ **Micka Gen3** is a specialized language model based on the [Microsoft RetNet](https://github.com/microsoft/unilm/tree/master/retnet) architecture, fine-tuned for Retrieval-Augmented Generation (RAG) usage in Slovenian Cultural Heritage Domain.
8
+ It leverages an efficient retention mechanism, and should be used as baseline and in combination with the [GAMS](https://huggingface.co/cjvt/GaMS-9B-Instruct) series of models.
9
+
10
+ A standalone series of models, based on the GaMS model will also be released.
11
+
12
+ ## Training Data
13
+
14
+ The model was trained from scratch on:
15
+ - **GigaFida corpus** (Slovenian)
16
+ - **Slovenian Wikipedia**
17
+ - **Random subset of 10,000 English Wikipedia articles**
18
+
19
+ The model underwent **20 epochs** of training on the above datasets.
20
+
21
+ ### Finetuning
22
+
23
+ The final stage involved finetuning on **10,000 culturally relevant samples** prepared specifically for the **Povejmo Project**, focusing on cultural heritage content.
24
+
25
+ ## Tokenizer
26
+
27
+ This model uses the following tokenizer:
28
+ - **Tokenizer**: [klokedm/micka-32768](https://huggingface.co/klokedm/micka-32768)
29
+
30
+ The tokenizer shares the same foundational training data, with additional cultural heritage samples included for domain specificity.
31
+
32
+ ## Architecture
33
+
34
+ The Micka-Gen3 is based on the **Microsoft RetNet** architecture with the following detailed layers:
35
+
36
+ - **10 decoder layers**, each including:
37
+ - Retention layers (q_proj, k_proj, v_proj, g_proj, out_proj)
38
+ - Feed-forward layers (linear1, linear2)
39
+ - Embedding layer (`embedding.weight`)
40
+ - Output projection layers (`out.weight`, `out.bias`)
41
+
42
+ The architecture is optimized for long-context document retrieval and generation tasks in combination with large Generative AI models.
43
+
44
+ ## Usage
45
+
46
+ Designed specifically for Retrieval-Augmented Generation (RAG), Micka-Gen3 performs well in:
47
+ - Generating contextually accurate responses from Cultural Heritage Texts.
48
+
49
+ ## Funding
50
+
51
+ The development of the Micka Tokenizer was partially funded by the [PoVeJMo project](https://povejmo.si/), which aims to develop large language models for the Slovenian language.
52
+ The project PoVeJMo is cofinanced by:
53
+ ![ARIS](https://www.cjvt.si/povejmo/wp-content/uploads/sites/28/2023/11/ARISLogoSlo_small.jpg)
54
+ ![NOO](https://www.cjvt.si/povejmo/wp-content/uploads/sites/28/2023/11/NOO_2023_logotip-transparent_povejmo.png)
55
+ ![NextGenerationEU](https://www.cjvt.si/povejmo/wp-content/uploads/sites/28/2023/11/Financira_Evropska_unija_2023_logotip-transparent_povejmo.png)
56
+
57
+
58
+ ## License
59
+
60
+ This tokenizer is licensed under the [Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0)](https://creativecommons.org/licenses/by-sa/4.0/). This license allows for sharing and adaptation, provided appropriate credit is given and any derivatives are distributed under the same license.
61
+
62
+ ## Citation
63
+
64
+ Please cite the following if you use **Micka-Gen3**:
65
+
66
+ ```
67
+ @misc{micka-gen3,
68
+ author = {Semantika Research},
69
+ title = {Micka-Gen3: A RetNet-based Slovenian Language Model for RAG tasks},
70
+ year = {2024},
71
+ publisher = {Hugging Face},
72
+ url = {https://huggingface.co/klokedm/micka-gen3}
73
+ }
74
+ ```
75
+
76
+ ## Contact
77
+ For more information, please contact:
78
+ - [Semantika Research](https://semantika.eu)
79
+ - [Hugging Face Repository](https://huggingface.co/klokedm/micka-gen3)
80
+
81
+