Model Card for GaMS-9B-Instruct-Lex

GaMS-9B-Instruct-Lex is a specialized instruction-tuned model for Slovene lexicographic tasks. It is based on cjvt/GaMS-9B-Instruct and has been further fine-tuned on a large set of lexicographic question–answer pairs.

The goal of this model is to improve the performance of large language models for Slovene lexicographic queries, such as definitions, synonyms, word forms, collocations, sense distinctions, and contextual usage.

Acknowledgment

The model was developed as part of the LLM4DH project and within the PoVeJMo research program.

The program is funded within the Recovery and Resilience Plan by the Slovenian Research and Innovation Agency (ARIS) and NextGenerationEU.

We thank everyone who contributed to data collection and preparation, especially participants of the promptathon workshop and contributors of Slovene lexicographic resources.

Basic information

Developed by: University of Ljubljana, Faculty of Computer and Information Science
Project: LLM4DH
Languages: Slovene (primary), English, Croatian, Bosnian, Serbian (secondary)
Base model: cjvt/GaMS-9B-Instruct
License: Gemma

Lexicographic Fine-tuning Data

To adapt GaMS-9B-Instruct to the lexicographic domain, we prepared a specialized QA corpus derived from Slovene lexical resources:

Digitalna slovarska baza (DSB)
Bridge dictionary
Word sense disambiguation dataset
Slovene synonym dictionary
Open Slovene WordNet (OSWN)

Data sources for QA generation included:

Automatically generated questions and answers (word forms, definitions, synonyms, collocations, context-based sense analysis).
ssj500k corpus (for sentence-level word analysis questions).
Promptathon workshop contributions (25 manually created questions).
Slovene Language Advisory Service (3,698 unique real-world questions and answers).
GPT-4.1 was used to paraphrase and diversify auto-generated QA pairs, ensuring variety and naturalness.

Training

The model was fine-tuned on the lexicographic QA dataset using supervised instruction tuning. This adaptation enables GaMS-9B-Instruct-Lex to provide more precise and consistent answers to lexicographic and linguistic questions in Slovene.

Usage

You can load and run the model with Hugging Face transformers:

from transformers import pipeline

model_id = "cjvt/GaMS-9B-Instruct-Lex"

pline = pipeline(
    "text-generation",
    model=model_id,
    device_map="auto"
)

# Example usage
message = [{"role": "user", "content": "Kaj pomeni beseda 'drevo'?"}]
response = pline(message, max_new_tokens=512)
print("Model's response:", response[0]["generated_text"][-1]["content"])

Intended Use

Answering lexicographic questions about Slovene words.
Supporting research in computational lexicography and digital dictionaries.
Assisting language learners and linguists with Slovene word meanings, synonyms, collocations, and usage.

Limitations

The model is optimized for Slovene lexicographic tasks; performance may vary in other domains or languages.
Automatically generated training data may contain occasional noise or inconsistencies.
The model may still produce incorrect or incomplete answers, especially for rare or highly context-dependent queries.

Downloads last month: 1

Safetensors

Model size

9B params

Tensor type

BF16

Model tree for tknez/GaMS-9B-Instruct-Lex

Base model

google/gemma-2-9b

Finetuned

cjvt/GaMS-9B

Finetuned

cjvt/GaMS-9B-Instruct

Finetuned

(4)

this model