--- license: gemma language: - sl - en base_model: - cjvt/GaMS3-12B datasets: - TPreview/SloLexQA pipeline_tag: text-generation --- # GaMS3-12B-Instruct-Lex **GaMS3-12B-Instruct-Lex** is a Slovene-focused instruction-tuned large language model for lexical question answering. It is based on **GaMS3** and has been further adapted through: 1. **Continual Pretraining (CPT)** on a large, automatically constructed Slovene lexical corpus 2. **Instruction fine-tuning** on a diverse mix of Slovene QA, writing, and code-feedback datasets 3. **Task-specific fine-tuning** on **SloLexQA** The resulting model is optimized for **lexical reasoning, grammatical explanation, and Slovene question answering**, both with and without additional context. --- ## Model Lineage * **Base model**: `google/gemma-3-12b-pt` * **Intermediate model**: GaMS3 (non-instruction-tuned) * **Final model**: GaMS3-12B-Instruct-Lex (instruction-tuned + QA-specialized) The starting point was the **non-finetuned GaMS3 model**, which itself is part of the GaMS (Generative Model for Slovene) family developed at the University of Ljubljana. --- ## Continual Pretraining (Lexical CPT) Before instruction tuning, the model underwent **continual pretraining** on a **Lexical LLM Pretraining Corpus** constructed from high-quality Slovene lexical resources. ### Lexical CPT Corpus Overview * **Entries**: 356,294 single-lexeme entries * **Format**: Markdown (`corpus.txt`) * **Primary source**: Digital Dictionary Database (DDD) Each entry contains: * Lemma * Full morphological paradigm * Word senses and definitions * Usage examples * Collocations * Synonyms (sense-grouped) ### Integrated Linguistic Resources * Digital Dictionary Database (DDD) * WSD dataset (Slovene word sense disambiguation) * sloWnet * Bridge Dictionary * Synonyms Dictionary ### CPT Objectives This CPT stage aims to improve: * Morphological awareness * Sense disambiguation * Lexical semantic reasoning * Definition and usage explanation --- ## General Instruction Fine-Tuning After lexical CPT, the model was instruction-tuned on a diverse set of general Slovene datasets: | Dataset | Language | Train | Eval | Description | | ---------------------- | -------- | ------ | ----- | ----------------------------------------------------------- | | GaMS-Instruct-ClosedQA | Slovene | 10,825 | 1,202 | Closed-domain QA generated with GPT-4o and Gemini-2.0-Flash | | GaMS-Instruct-OpenQA | Slovene | 28,704 | 3,189 | Open-domain QA across 400+ micro-topics | | GaMS-Instruct-Writing | Slovene | 9,056 | 1,006 | Writing and composition tasks | | GaMS-Instruct-DH-1.0 | Slovene | 9,135 | 1,015 | Digital humanities dataset (CLARIN.SI) | | SlCode | Slovene | 10,000 | 1,000 | Code feedback and explanation dataset | These datasets provide strong **instruction-following**, **reasoning**, and **explanatory** behavior in Slovene. --- ## SloLexQA Fine-Tuning The final specialization stage fine-tuned the model on: * **Dataset**: `hf:TPreview/SloLexQA` * **Total QA pairs**: 16,508 * **Train / Test split**: 90.1% / 9.9% * **Context-aware questions**: ~4.6% ### SloLexQA Highlights * Linguistically grounded Slovene QA * Mixture of: * Automatically generated grammatical questions * Real-world linguistic advisory questions * Forum-sourced language usage questions * Covers: * Definitions * Usage examples * Grammar and morphology * Lexical semantics This step significantly improves **precision, stylistic consistency, and robustness** on Slovene QA tasks. --- ## Intended Use ### Primary Use Cases * Slovene question answering (with or without context) * Grammar and morphology explanations * Lexical and semantic analysis * Linguistic research and education * Instruction-following applications in Slovene ### Secondary Use Cases * Writing assistance * Language learning tools * Dataset annotation support * NLP research and prototyping --- ## Limitations * The model may produce **hallucinated facts**, especially outside linguistic domains. * Performance is strongest in **Slovene**; other languages are secondary. * Long-context reasoning depends on prompt quality and task complexity. * Subtle pragmatic or stylistic nuances may still require human validation. --- ## Ethical Considerations * Training data was filtered to reduce personally identifiable information (PII). * Linguistic and web data may still encode **socio-cultural biases**. * Users are encouraged to: * Perform downstream bias evaluation * Apply content filtering where appropriate * Follow responsible AI deployment practices Use of this model is subject to the **Gemma license** and its prohibited use policies. --- ## Acknowledgments This model builds upon the **GaMS3** research effort developed at the **University of Ljubljana, Faculty of Computer and Information Science**, within the PoVeJMo and SloLLaMai research programs. We thank all contributors to: * Slovene linguistic resources * Dataset construction and annotation * Infrastructure support (EuroHPC, NVIDIA DGX Cloud) --- ## Citation If you use this model in academic work, please cite both: * The **GaMS3 model** * The **SloLexQA dataset** ```bibtex @misc{slollexqa_gams3, title={GaMS3-12B-Instruct-Lex: Lexically Enhanced Slovene QA Model}, author={University of Ljubljana, Faculty of Computer and Information Science}, year={2026}, url={https://huggingface.co/TPreview/SloLexQA} } ```