Added model files

2509f6f verified 3 months ago

5.64 kB

	---
	license: gemma
	language:
	- sl
	- en
	base_model:
	- cjvt/GaMS3-12B
	datasets:
	- TPreview/SloLexQA
	pipeline_tag: text-generation
	---

	# GaMS3-12B-Instruct-Lex

	GaMS3-12B-Instruct-Lex is a Slovene-focused instruction-tuned large language model for lexical question answering.
	It is based on GaMS3 and has been further adapted through:

	1. Continual Pretraining (CPT) on a large, automatically constructed Slovene lexical corpus
	2. Instruction fine-tuning on a diverse mix of Slovene QA, writing, and code-feedback datasets
	3. Task-specific fine-tuning on SloLexQA

	The resulting model is optimized for lexical reasoning, grammatical explanation, and Slovene question answering, both with and without additional context.

	---

	## Model Lineage

	* Base model: `google/gemma-3-12b-pt`
	* Intermediate model: GaMS3 (non-instruction-tuned)
	* Final model: GaMS3-12B-Instruct-Lex (instruction-tuned + QA-specialized)

	The starting point was the non-finetuned GaMS3 model, which itself is part of the GaMS (Generative Model for Slovene) family developed at the University of Ljubljana.

	---

	## Continual Pretraining (Lexical CPT)

	Before instruction tuning, the model underwent continual pretraining on a Lexical LLM Pretraining Corpus constructed from high-quality Slovene lexical resources.

	### Lexical CPT Corpus Overview

	* Entries: 356,294 single-lexeme entries
	* Format: Markdown (`corpus.txt`)
	* Primary source: Digital Dictionary Database (DDD)

	Each entry contains:

	* Lemma
	* Full morphological paradigm
	* Word senses and definitions
	* Usage examples
	* Collocations
	* Synonyms (sense-grouped)

	### Integrated Linguistic Resources

	* Digital Dictionary Database (DDD)
	* WSD dataset (Slovene word sense disambiguation)
	* sloWnet
	* Bridge Dictionary
	* Synonyms Dictionary

	### CPT Objectives

	This CPT stage aims to improve:

	* Morphological awareness
	* Sense disambiguation
	* Lexical semantic reasoning
	* Definition and usage explanation

	---

	## General Instruction Fine-Tuning

	After lexical CPT, the model was instruction-tuned on a diverse set of general Slovene datasets:

	\| Dataset \| Language \| Train \| Eval \| Description \|
	\| ---------------------- \| -------- \| ------ \| ----- \| ----------------------------------------------------------- \|
	\| GaMS-Instruct-ClosedQA \| Slovene \| 10,825 \| 1,202 \| Closed-domain QA generated with GPT-4o and Gemini-2.0-Flash \|
	\| GaMS-Instruct-OpenQA \| Slovene \| 28,704 \| 3,189 \| Open-domain QA across 400+ micro-topics \|
	\| GaMS-Instruct-Writing \| Slovene \| 9,056 \| 1,006 \| Writing and composition tasks \|
	\| GaMS-Instruct-DH-1.0 \| Slovene \| 9,135 \| 1,015 \| Digital humanities dataset (CLARIN.SI) \|
	\| SlCode \| Slovene \| 10,000 \| 1,000 \| Code feedback and explanation dataset \|

	These datasets provide strong instruction-following, reasoning, and explanatory behavior in Slovene.

	---

	## SloLexQA Fine-Tuning

	The final specialization stage fine-tuned the model on:

	* Dataset: `hf:TPreview/SloLexQA`
	* Total QA pairs: 16,508
	* Train / Test split: 90.1% / 9.9%
	* Context-aware questions: ~4.6%

	### SloLexQA Highlights

	* Linguistically grounded Slovene QA
	* Mixture of:

	* Automatically generated grammatical questions
	* Real-world linguistic advisory questions
	* Forum-sourced language usage questions
	* Covers:

	* Definitions
	* Usage examples
	* Grammar and morphology
	* Lexical semantics

	This step significantly improves precision, stylistic consistency, and robustness on Slovene QA tasks.

	---

	## Intended Use

	### Primary Use Cases

	* Slovene question answering (with or without context)
	* Grammar and morphology explanations
	* Lexical and semantic analysis
	* Linguistic research and education
	* Instruction-following applications in Slovene

	### Secondary Use Cases

	* Writing assistance
	* Language learning tools
	* Dataset annotation support
	* NLP research and prototyping

	---

	## Limitations

	* The model may produce hallucinated facts, especially outside linguistic domains.
	* Performance is strongest in Slovene; other languages are secondary.
	* Long-context reasoning depends on prompt quality and task complexity.
	* Subtle pragmatic or stylistic nuances may still require human validation.

	---

	## Ethical Considerations

	* Training data was filtered to reduce personally identifiable information (PII).
	* Linguistic and web data may still encode socio-cultural biases.
	* Users are encouraged to:

	* Perform downstream bias evaluation
	* Apply content filtering where appropriate
	* Follow responsible AI deployment practices

	Use of this model is subject to the Gemma license and its prohibited use policies.

	---

	## Acknowledgments

	This model builds upon the GaMS3 research effort developed at the University of Ljubljana, Faculty of Computer and Information Science, within the PoVeJMo and SloLLaMai research programs.

	We thank all contributors to:

	* Slovene linguistic resources
	* Dataset construction and annotation
	* Infrastructure support (EuroHPC, NVIDIA DGX Cloud)

	---

	## Citation

	If you use this model in academic work, please cite both:

	* The GaMS3 model
	* The SloLexQA dataset

	```bibtex
	@misc{slollexqa_gams3,
	title={GaMS3-12B-Instruct-Lex: Lexically Enhanced Slovene QA Model},
	author={University of Ljubljana, Faculty of Computer and Information Science},
	year={2026},
	url={https://huggingface.co/TPreview/SloLexQA}
	}
	```