tknez's picture
Added model files
2509f6f verified
---
license: gemma
language:
- sl
- en
base_model:
- cjvt/GaMS3-12B
datasets:
- TPreview/SloLexQA
pipeline_tag: text-generation
---
# GaMS3-12B-Instruct-Lex
**GaMS3-12B-Instruct-Lex** is a Slovene-focused instruction-tuned large language model for lexical question answering.
It is based on **GaMS3** and has been further adapted through:
1. **Continual Pretraining (CPT)** on a large, automatically constructed Slovene lexical corpus
2. **Instruction fine-tuning** on a diverse mix of Slovene QA, writing, and code-feedback datasets
3. **Task-specific fine-tuning** on **SloLexQA**
The resulting model is optimized for **lexical reasoning, grammatical explanation, and Slovene question answering**, both with and without additional context.
---
## Model Lineage
* **Base model**: `google/gemma-3-12b-pt`
* **Intermediate model**: GaMS3 (non-instruction-tuned)
* **Final model**: GaMS3-12B-Instruct-Lex (instruction-tuned + QA-specialized)
The starting point was the **non-finetuned GaMS3 model**, which itself is part of the GaMS (Generative Model for Slovene) family developed at the University of Ljubljana.
---
## Continual Pretraining (Lexical CPT)
Before instruction tuning, the model underwent **continual pretraining** on a **Lexical LLM Pretraining Corpus** constructed from high-quality Slovene lexical resources.
### Lexical CPT Corpus Overview
* **Entries**: 356,294 single-lexeme entries
* **Format**: Markdown (`corpus.txt`)
* **Primary source**: Digital Dictionary Database (DDD)
Each entry contains:
* Lemma
* Full morphological paradigm
* Word senses and definitions
* Usage examples
* Collocations
* Synonyms (sense-grouped)
### Integrated Linguistic Resources
* Digital Dictionary Database (DDD)
* WSD dataset (Slovene word sense disambiguation)
* sloWnet
* Bridge Dictionary
* Synonyms Dictionary
### CPT Objectives
This CPT stage aims to improve:
* Morphological awareness
* Sense disambiguation
* Lexical semantic reasoning
* Definition and usage explanation
---
## General Instruction Fine-Tuning
After lexical CPT, the model was instruction-tuned on a diverse set of general Slovene datasets:
| Dataset | Language | Train | Eval | Description |
| ---------------------- | -------- | ------ | ----- | ----------------------------------------------------------- |
| GaMS-Instruct-ClosedQA | Slovene | 10,825 | 1,202 | Closed-domain QA generated with GPT-4o and Gemini-2.0-Flash |
| GaMS-Instruct-OpenQA | Slovene | 28,704 | 3,189 | Open-domain QA across 400+ micro-topics |
| GaMS-Instruct-Writing | Slovene | 9,056 | 1,006 | Writing and composition tasks |
| GaMS-Instruct-DH-1.0 | Slovene | 9,135 | 1,015 | Digital humanities dataset (CLARIN.SI) |
| SlCode | Slovene | 10,000 | 1,000 | Code feedback and explanation dataset |
These datasets provide strong **instruction-following**, **reasoning**, and **explanatory** behavior in Slovene.
---
## SloLexQA Fine-Tuning
The final specialization stage fine-tuned the model on:
* **Dataset**: `hf:TPreview/SloLexQA`
* **Total QA pairs**: 16,508
* **Train / Test split**: 90.1% / 9.9%
* **Context-aware questions**: ~4.6%
### SloLexQA Highlights
* Linguistically grounded Slovene QA
* Mixture of:
* Automatically generated grammatical questions
* Real-world linguistic advisory questions
* Forum-sourced language usage questions
* Covers:
* Definitions
* Usage examples
* Grammar and morphology
* Lexical semantics
This step significantly improves **precision, stylistic consistency, and robustness** on Slovene QA tasks.
---
## Intended Use
### Primary Use Cases
* Slovene question answering (with or without context)
* Grammar and morphology explanations
* Lexical and semantic analysis
* Linguistic research and education
* Instruction-following applications in Slovene
### Secondary Use Cases
* Writing assistance
* Language learning tools
* Dataset annotation support
* NLP research and prototyping
---
## Limitations
* The model may produce **hallucinated facts**, especially outside linguistic domains.
* Performance is strongest in **Slovene**; other languages are secondary.
* Long-context reasoning depends on prompt quality and task complexity.
* Subtle pragmatic or stylistic nuances may still require human validation.
---
## Ethical Considerations
* Training data was filtered to reduce personally identifiable information (PII).
* Linguistic and web data may still encode **socio-cultural biases**.
* Users are encouraged to:
* Perform downstream bias evaluation
* Apply content filtering where appropriate
* Follow responsible AI deployment practices
Use of this model is subject to the **Gemma license** and its prohibited use policies.
---
## Acknowledgments
This model builds upon the **GaMS3** research effort developed at the **University of Ljubljana, Faculty of Computer and Information Science**, within the PoVeJMo and SloLLaMai research programs.
We thank all contributors to:
* Slovene linguistic resources
* Dataset construction and annotation
* Infrastructure support (EuroHPC, NVIDIA DGX Cloud)
---
## Citation
If you use this model in academic work, please cite both:
* The **GaMS3 model**
* The **SloLexQA dataset**
```bibtex
@misc{slollexqa_gams3,
title={GaMS3-12B-Instruct-Lex: Lexically Enhanced Slovene QA Model},
author={University of Ljubljana, Faculty of Computer and Information Science},
year={2026},
url={https://huggingface.co/TPreview/SloLexQA}
}
```