File size: 5,642 Bytes
2509f6f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
---
license: gemma
language:
  - sl
  - en
base_model:
  - cjvt/GaMS3-12B
datasets:
  - TPreview/SloLexQA
pipeline_tag: text-generation
---

# GaMS3-12B-Instruct-Lex

**GaMS3-12B-Instruct-Lex** is a Slovene-focused instruction-tuned large language model for lexical question answering.
It is based on **GaMS3** and has been further adapted through:

1. **Continual Pretraining (CPT)** on a large, automatically constructed Slovene lexical corpus
2. **Instruction fine-tuning** on a diverse mix of Slovene QA, writing, and code-feedback datasets
3. **Task-specific fine-tuning** on **SloLexQA**

The resulting model is optimized for **lexical reasoning, grammatical explanation, and Slovene question answering**, both with and without additional context.

---

## Model Lineage

* **Base model**: `google/gemma-3-12b-pt`
* **Intermediate model**: GaMS3 (non-instruction-tuned)
* **Final model**: GaMS3-12B-Instruct-Lex (instruction-tuned + QA-specialized)

The starting point was the **non-finetuned GaMS3 model**, which itself is part of the GaMS (Generative Model for Slovene) family developed at the University of Ljubljana.

---

## Continual Pretraining (Lexical CPT)

Before instruction tuning, the model underwent **continual pretraining** on a **Lexical LLM Pretraining Corpus** constructed from high-quality Slovene lexical resources.

### Lexical CPT Corpus Overview

* **Entries**: 356,294 single-lexeme entries
* **Format**: Markdown (`corpus.txt`)
* **Primary source**: Digital Dictionary Database (DDD)

Each entry contains:

* Lemma
* Full morphological paradigm
* Word senses and definitions
* Usage examples
* Collocations
* Synonyms (sense-grouped)

### Integrated Linguistic Resources

* Digital Dictionary Database (DDD)
* WSD dataset (Slovene word sense disambiguation)
* sloWnet
* Bridge Dictionary
* Synonyms Dictionary

### CPT Objectives

This CPT stage aims to improve:

* Morphological awareness
* Sense disambiguation
* Lexical semantic reasoning
* Definition and usage explanation

---

## General Instruction Fine-Tuning

After lexical CPT, the model was instruction-tuned on a diverse set of general Slovene datasets:

| Dataset                | Language | Train  | Eval  | Description                                                 |
| ---------------------- | -------- | ------ | ----- | ----------------------------------------------------------- |
| GaMS-Instruct-ClosedQA | Slovene  | 10,825 | 1,202 | Closed-domain QA generated with GPT-4o and Gemini-2.0-Flash |
| GaMS-Instruct-OpenQA   | Slovene  | 28,704 | 3,189 | Open-domain QA across 400+ micro-topics                     |
| GaMS-Instruct-Writing  | Slovene  | 9,056  | 1,006 | Writing and composition tasks                               |
| GaMS-Instruct-DH-1.0   | Slovene  | 9,135  | 1,015 | Digital humanities dataset (CLARIN.SI)                      |
| SlCode                 | Slovene  | 10,000 | 1,000 | Code feedback and explanation dataset                       |

These datasets provide strong **instruction-following**, **reasoning**, and **explanatory** behavior in Slovene.

---

## SloLexQA Fine-Tuning

The final specialization stage fine-tuned the model on:

* **Dataset**: `hf:TPreview/SloLexQA`
* **Total QA pairs**: 16,508
* **Train / Test split**: 90.1% / 9.9%
* **Context-aware questions**: ~4.6%

### SloLexQA Highlights

* Linguistically grounded Slovene QA
* Mixture of:

  * Automatically generated grammatical questions
  * Real-world linguistic advisory questions
  * Forum-sourced language usage questions
* Covers:

  * Definitions
  * Usage examples
  * Grammar and morphology
  * Lexical semantics

This step significantly improves **precision, stylistic consistency, and robustness** on Slovene QA tasks.

---

## Intended Use

### Primary Use Cases

* Slovene question answering (with or without context)
* Grammar and morphology explanations
* Lexical and semantic analysis
* Linguistic research and education
* Instruction-following applications in Slovene

### Secondary Use Cases

* Writing assistance
* Language learning tools
* Dataset annotation support
* NLP research and prototyping

---

## Limitations

* The model may produce **hallucinated facts**, especially outside linguistic domains.
* Performance is strongest in **Slovene**; other languages are secondary.
* Long-context reasoning depends on prompt quality and task complexity.
* Subtle pragmatic or stylistic nuances may still require human validation.

---

## Ethical Considerations

* Training data was filtered to reduce personally identifiable information (PII).
* Linguistic and web data may still encode **socio-cultural biases**.
* Users are encouraged to:

  * Perform downstream bias evaluation
  * Apply content filtering where appropriate
  * Follow responsible AI deployment practices

Use of this model is subject to the **Gemma license** and its prohibited use policies.

---

## Acknowledgments

This model builds upon the **GaMS3** research effort developed at the **University of Ljubljana, Faculty of Computer and Information Science**, within the PoVeJMo and SloLLaMai research programs.

We thank all contributors to:

* Slovene linguistic resources
* Dataset construction and annotation
* Infrastructure support (EuroHPC, NVIDIA DGX Cloud)

---

## Citation

If you use this model in academic work, please cite both:

* The **GaMS3 model**
* The **SloLexQA dataset**

```bibtex
@misc{slollexqa_gams3,
  title={GaMS3-12B-Instruct-Lex: Lexically Enhanced Slovene QA Model},
  author={University of Ljubljana, Faculty of Computer and Information Science},
  year={2026},
  url={https://huggingface.co/TPreview/SloLexQA}
}
```