File size: 5,642 Bytes
2509f6f | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 | ---
license: gemma
language:
- sl
- en
base_model:
- cjvt/GaMS3-12B
datasets:
- TPreview/SloLexQA
pipeline_tag: text-generation
---
# GaMS3-12B-Instruct-Lex
**GaMS3-12B-Instruct-Lex** is a Slovene-focused instruction-tuned large language model for lexical question answering.
It is based on **GaMS3** and has been further adapted through:
1. **Continual Pretraining (CPT)** on a large, automatically constructed Slovene lexical corpus
2. **Instruction fine-tuning** on a diverse mix of Slovene QA, writing, and code-feedback datasets
3. **Task-specific fine-tuning** on **SloLexQA**
The resulting model is optimized for **lexical reasoning, grammatical explanation, and Slovene question answering**, both with and without additional context.
---
## Model Lineage
* **Base model**: `google/gemma-3-12b-pt`
* **Intermediate model**: GaMS3 (non-instruction-tuned)
* **Final model**: GaMS3-12B-Instruct-Lex (instruction-tuned + QA-specialized)
The starting point was the **non-finetuned GaMS3 model**, which itself is part of the GaMS (Generative Model for Slovene) family developed at the University of Ljubljana.
---
## Continual Pretraining (Lexical CPT)
Before instruction tuning, the model underwent **continual pretraining** on a **Lexical LLM Pretraining Corpus** constructed from high-quality Slovene lexical resources.
### Lexical CPT Corpus Overview
* **Entries**: 356,294 single-lexeme entries
* **Format**: Markdown (`corpus.txt`)
* **Primary source**: Digital Dictionary Database (DDD)
Each entry contains:
* Lemma
* Full morphological paradigm
* Word senses and definitions
* Usage examples
* Collocations
* Synonyms (sense-grouped)
### Integrated Linguistic Resources
* Digital Dictionary Database (DDD)
* WSD dataset (Slovene word sense disambiguation)
* sloWnet
* Bridge Dictionary
* Synonyms Dictionary
### CPT Objectives
This CPT stage aims to improve:
* Morphological awareness
* Sense disambiguation
* Lexical semantic reasoning
* Definition and usage explanation
---
## General Instruction Fine-Tuning
After lexical CPT, the model was instruction-tuned on a diverse set of general Slovene datasets:
| Dataset | Language | Train | Eval | Description |
| ---------------------- | -------- | ------ | ----- | ----------------------------------------------------------- |
| GaMS-Instruct-ClosedQA | Slovene | 10,825 | 1,202 | Closed-domain QA generated with GPT-4o and Gemini-2.0-Flash |
| GaMS-Instruct-OpenQA | Slovene | 28,704 | 3,189 | Open-domain QA across 400+ micro-topics |
| GaMS-Instruct-Writing | Slovene | 9,056 | 1,006 | Writing and composition tasks |
| GaMS-Instruct-DH-1.0 | Slovene | 9,135 | 1,015 | Digital humanities dataset (CLARIN.SI) |
| SlCode | Slovene | 10,000 | 1,000 | Code feedback and explanation dataset |
These datasets provide strong **instruction-following**, **reasoning**, and **explanatory** behavior in Slovene.
---
## SloLexQA Fine-Tuning
The final specialization stage fine-tuned the model on:
* **Dataset**: `hf:TPreview/SloLexQA`
* **Total QA pairs**: 16,508
* **Train / Test split**: 90.1% / 9.9%
* **Context-aware questions**: ~4.6%
### SloLexQA Highlights
* Linguistically grounded Slovene QA
* Mixture of:
* Automatically generated grammatical questions
* Real-world linguistic advisory questions
* Forum-sourced language usage questions
* Covers:
* Definitions
* Usage examples
* Grammar and morphology
* Lexical semantics
This step significantly improves **precision, stylistic consistency, and robustness** on Slovene QA tasks.
---
## Intended Use
### Primary Use Cases
* Slovene question answering (with or without context)
* Grammar and morphology explanations
* Lexical and semantic analysis
* Linguistic research and education
* Instruction-following applications in Slovene
### Secondary Use Cases
* Writing assistance
* Language learning tools
* Dataset annotation support
* NLP research and prototyping
---
## Limitations
* The model may produce **hallucinated facts**, especially outside linguistic domains.
* Performance is strongest in **Slovene**; other languages are secondary.
* Long-context reasoning depends on prompt quality and task complexity.
* Subtle pragmatic or stylistic nuances may still require human validation.
---
## Ethical Considerations
* Training data was filtered to reduce personally identifiable information (PII).
* Linguistic and web data may still encode **socio-cultural biases**.
* Users are encouraged to:
* Perform downstream bias evaluation
* Apply content filtering where appropriate
* Follow responsible AI deployment practices
Use of this model is subject to the **Gemma license** and its prohibited use policies.
---
## Acknowledgments
This model builds upon the **GaMS3** research effort developed at the **University of Ljubljana, Faculty of Computer and Information Science**, within the PoVeJMo and SloLLaMai research programs.
We thank all contributors to:
* Slovene linguistic resources
* Dataset construction and annotation
* Infrastructure support (EuroHPC, NVIDIA DGX Cloud)
---
## Citation
If you use this model in academic work, please cite both:
* The **GaMS3 model**
* The **SloLexQA dataset**
```bibtex
@misc{slollexqa_gams3,
title={GaMS3-12B-Instruct-Lex: Lexically Enhanced Slovene QA Model},
author={University of Ljubljana, Faculty of Computer and Information Science},
year={2026},
url={https://huggingface.co/TPreview/SloLexQA}
}
``` |