| --- |
| license: gemma |
| language: |
| - sl |
| - en |
| base_model: |
| - cjvt/GaMS3-12B |
| datasets: |
| - TPreview/SloLexQA |
| pipeline_tag: text-generation |
| --- |
| |
| # GaMS3-12B-Instruct-Lex |
|
|
| **GaMS3-12B-Instruct-Lex** is a Slovene-focused instruction-tuned large language model for lexical question answering. |
| It is based on **GaMS3** and has been further adapted through: |
|
|
| 1. **Continual Pretraining (CPT)** on a large, automatically constructed Slovene lexical corpus |
| 2. **Instruction fine-tuning** on a diverse mix of Slovene QA, writing, and code-feedback datasets |
| 3. **Task-specific fine-tuning** on **SloLexQA** |
|
|
| The resulting model is optimized for **lexical reasoning, grammatical explanation, and Slovene question answering**, both with and without additional context. |
|
|
| --- |
|
|
| ## Model Lineage |
|
|
| * **Base model**: `google/gemma-3-12b-pt` |
| * **Intermediate model**: GaMS3 (non-instruction-tuned) |
| * **Final model**: GaMS3-12B-Instruct-Lex (instruction-tuned + QA-specialized) |
|
|
| The starting point was the **non-finetuned GaMS3 model**, which itself is part of the GaMS (Generative Model for Slovene) family developed at the University of Ljubljana. |
|
|
| --- |
|
|
| ## Continual Pretraining (Lexical CPT) |
|
|
| Before instruction tuning, the model underwent **continual pretraining** on a **Lexical LLM Pretraining Corpus** constructed from high-quality Slovene lexical resources. |
|
|
| ### Lexical CPT Corpus Overview |
|
|
| * **Entries**: 356,294 single-lexeme entries |
| * **Format**: Markdown (`corpus.txt`) |
| * **Primary source**: Digital Dictionary Database (DDD) |
|
|
| Each entry contains: |
|
|
| * Lemma |
| * Full morphological paradigm |
| * Word senses and definitions |
| * Usage examples |
| * Collocations |
| * Synonyms (sense-grouped) |
|
|
| ### Integrated Linguistic Resources |
|
|
| * Digital Dictionary Database (DDD) |
| * WSD dataset (Slovene word sense disambiguation) |
| * sloWnet |
| * Bridge Dictionary |
| * Synonyms Dictionary |
|
|
| ### CPT Objectives |
|
|
| This CPT stage aims to improve: |
|
|
| * Morphological awareness |
| * Sense disambiguation |
| * Lexical semantic reasoning |
| * Definition and usage explanation |
|
|
| --- |
|
|
| ## General Instruction Fine-Tuning |
|
|
| After lexical CPT, the model was instruction-tuned on a diverse set of general Slovene datasets: |
|
|
| | Dataset | Language | Train | Eval | Description | |
| | ---------------------- | -------- | ------ | ----- | ----------------------------------------------------------- | |
| | GaMS-Instruct-ClosedQA | Slovene | 10,825 | 1,202 | Closed-domain QA generated with GPT-4o and Gemini-2.0-Flash | |
| | GaMS-Instruct-OpenQA | Slovene | 28,704 | 3,189 | Open-domain QA across 400+ micro-topics | |
| | GaMS-Instruct-Writing | Slovene | 9,056 | 1,006 | Writing and composition tasks | |
| | GaMS-Instruct-DH-1.0 | Slovene | 9,135 | 1,015 | Digital humanities dataset (CLARIN.SI) | |
| | SlCode | Slovene | 10,000 | 1,000 | Code feedback and explanation dataset | |
|
|
| These datasets provide strong **instruction-following**, **reasoning**, and **explanatory** behavior in Slovene. |
|
|
| --- |
|
|
| ## SloLexQA Fine-Tuning |
|
|
| The final specialization stage fine-tuned the model on: |
|
|
| * **Dataset**: `hf:TPreview/SloLexQA` |
| * **Total QA pairs**: 16,508 |
| * **Train / Test split**: 90.1% / 9.9% |
| * **Context-aware questions**: ~4.6% |
|
|
| ### SloLexQA Highlights |
|
|
| * Linguistically grounded Slovene QA |
| * Mixture of: |
|
|
| * Automatically generated grammatical questions |
| * Real-world linguistic advisory questions |
| * Forum-sourced language usage questions |
| * Covers: |
|
|
| * Definitions |
| * Usage examples |
| * Grammar and morphology |
| * Lexical semantics |
|
|
| This step significantly improves **precision, stylistic consistency, and robustness** on Slovene QA tasks. |
|
|
| --- |
|
|
| ## Intended Use |
|
|
| ### Primary Use Cases |
|
|
| * Slovene question answering (with or without context) |
| * Grammar and morphology explanations |
| * Lexical and semantic analysis |
| * Linguistic research and education |
| * Instruction-following applications in Slovene |
|
|
| ### Secondary Use Cases |
|
|
| * Writing assistance |
| * Language learning tools |
| * Dataset annotation support |
| * NLP research and prototyping |
|
|
| --- |
|
|
| ## Limitations |
|
|
| * The model may produce **hallucinated facts**, especially outside linguistic domains. |
| * Performance is strongest in **Slovene**; other languages are secondary. |
| * Long-context reasoning depends on prompt quality and task complexity. |
| * Subtle pragmatic or stylistic nuances may still require human validation. |
|
|
| --- |
|
|
| ## Ethical Considerations |
|
|
| * Training data was filtered to reduce personally identifiable information (PII). |
| * Linguistic and web data may still encode **socio-cultural biases**. |
| * Users are encouraged to: |
|
|
| * Perform downstream bias evaluation |
| * Apply content filtering where appropriate |
| * Follow responsible AI deployment practices |
|
|
| Use of this model is subject to the **Gemma license** and its prohibited use policies. |
|
|
| --- |
|
|
| ## Acknowledgments |
|
|
| This model builds upon the **GaMS3** research effort developed at the **University of Ljubljana, Faculty of Computer and Information Science**, within the PoVeJMo and SloLLaMai research programs. |
|
|
| We thank all contributors to: |
|
|
| * Slovene linguistic resources |
| * Dataset construction and annotation |
| * Infrastructure support (EuroHPC, NVIDIA DGX Cloud) |
|
|
| --- |
|
|
| ## Citation |
|
|
| If you use this model in academic work, please cite both: |
|
|
| * The **GaMS3 model** |
| * The **SloLexQA dataset** |
|
|
| ```bibtex |
| @misc{slollexqa_gams3, |
| title={GaMS3-12B-Instruct-Lex: Lexically Enhanced Slovene QA Model}, |
| author={University of Ljubljana, Faculty of Computer and Information Science}, |
| year={2026}, |
| url={https://huggingface.co/TPreview/SloLexQA} |
| } |
| ``` |