File size: 5,642 Bytes

2509f6f

---
license: gemma
language:
  - sl
  - en
base_model:
  - cjvt/GaMS3-12B
datasets:
  - TPreview/SloLexQA
pipeline_tag: text-generation
---

# GaMS3-12B-Instruct-Lex

**GaMS3-12B-Instruct-Lex** is a Slovene-focused instruction-tuned large language model for lexical question answering.
It is based on **GaMS3** and has been further adapted through:

1. **Continual Pretraining (CPT)** on a large, automatically constructed Slovene lexical corpus
2. **Instruction fine-tuning** on a diverse mix of Slovene QA, writing, and code-feedback datasets
3. **Task-specific fine-tuning** on **SloLexQA**

The resulting model is optimized for **lexical reasoning, grammatical explanation, and Slovene question answering**, both with and without additional context.

---

## Model Lineage

* **Base model**: `google/gemma-3-12b-pt`
* **Intermediate model**: GaMS3 (non-instruction-tuned)
* **Final model**: GaMS3-12B-Instruct-Lex (instruction-tuned + QA-specialized)

The starting point was the **non-finetuned GaMS3 model**, which itself is part of the GaMS (Generative Model for Slovene) family developed at the University of Ljubljana.

---

## Continual Pretraining (Lexical CPT)

Before instruction tuning, the model underwent **continual pretraining** on a **Lexical LLM Pretraining Corpus** constructed from high-quality Slovene lexical resources.

### Lexical CPT Corpus Overview

* **Entries**: 356,294 single-lexeme entries
* **Format**: Markdown (`corpus.txt`)
* **Primary source**: Digital Dictionary Database (DDD)

Each entry contains:

* Lemma
* Full morphological paradigm
* Word senses and definitions
* Usage examples
* Collocations
* Synonyms (sense-grouped)

### Integrated Linguistic Resources

* Digital Dictionary Database (DDD)
* WSD dataset (Slovene word sense disambiguation)
* sloWnet
* Bridge Dictionary
* Synonyms Dictionary

### CPT Objectives

This CPT stage aims to improve:

* Morphological awareness
* Sense disambiguation
* Lexical semantic reasoning
* Definition and usage explanation

---

## General Instruction Fine-Tuning

After lexical CPT, the model was instruction-tuned on a diverse set of general Slovene datasets:

| Dataset                | Language | Train  | Eval  | Description                                                 |
| ---------------------- | -------- | ------ | ----- | ----------------------------------------------------------- |
| GaMS-Instruct-ClosedQA | Slovene  | 10,825 | 1,202 | Closed-domain QA generated with GPT-4o and Gemini-2.0-Flash |
| GaMS-Instruct-OpenQA   | Slovene  | 28,704 | 3,189 | Open-domain QA across 400+ micro-topics                     |
| GaMS-Instruct-Writing  | Slovene  | 9,056  | 1,006 | Writing and composition tasks                               |
| GaMS-Instruct-DH-1.0   | Slovene  | 9,135  | 1,015 | Digital humanities dataset (CLARIN.SI)                      |
| SlCode                 | Slovene  | 10,000 | 1,000 | Code feedback and explanation dataset                       |

These datasets provide strong **instruction-following**, **reasoning**, and **explanatory** behavior in Slovene.

---

## SloLexQA Fine-Tuning

The final specialization stage fine-tuned the model on:

* **Dataset**: `hf:TPreview/SloLexQA`
* **Total QA pairs**: 16,508
* **Train / Test split**: 90.1% / 9.9%
* **Context-aware questions**: ~4.6%

### SloLexQA Highlights

* Linguistically grounded Slovene QA
* Mixture of:

  * Automatically generated grammatical questions
  * Real-world linguistic advisory questions
  * Forum-sourced language usage questions
* Covers:

  * Definitions
  * Usage examples
  * Grammar and morphology
  * Lexical semantics

This step significantly improves **precision, stylistic consistency, and robustness** on Slovene QA tasks.

---

## Intended Use

### Primary Use Cases

* Slovene question answering (with or without context)
* Grammar and morphology explanations
* Lexical and semantic analysis
* Linguistic research and education
* Instruction-following applications in Slovene

### Secondary Use Cases

* Writing assistance
* Language learning tools
* Dataset annotation support
* NLP research and prototyping

---

## Limitations

* The model may produce **hallucinated facts**, especially outside linguistic domains.
* Performance is strongest in **Slovene**; other languages are secondary.
* Long-context reasoning depends on prompt quality and task complexity.
* Subtle pragmatic or stylistic nuances may still require human validation.

---

## Ethical Considerations

* Training data was filtered to reduce personally identifiable information (PII).
* Linguistic and web data may still encode **socio-cultural biases**.
* Users are encouraged to:

  * Perform downstream bias evaluation
  * Apply content filtering where appropriate
  * Follow responsible AI deployment practices

Use of this model is subject to the **Gemma license** and its prohibited use policies.

---

## Acknowledgments

This model builds upon the **GaMS3** research effort developed at the **University of Ljubljana, Faculty of Computer and Information Science**, within the PoVeJMo and SloLLaMai research programs.

We thank all contributors to:

* Slovene linguistic resources
* Dataset construction and annotation
* Infrastructure support (EuroHPC, NVIDIA DGX Cloud)

---

## Citation

If you use this model in academic work, please cite both:

* The **GaMS3 model**
* The **SloLexQA dataset**

```bibtex
@misc{slollexqa_gams3,
  title={GaMS3-12B-Instruct-Lex: Lexically Enhanced Slovene QA Model},
  author={University of Ljubljana, Faculty of Computer and Information Science},
  year={2026},
  url={https://huggingface.co/TPreview/SloLexQA}
}
```