Model Card for GaMS-9B-Instruct-Lex
GaMS-9B-Instruct-Lex is a specialized instruction-tuned model for Slovene lexicographic tasks. It is based on cjvt/GaMS-9B-Instruct and has been further fine-tuned on a large set of lexicographic question–answer pairs.
The goal of this model is to improve the performance of large language models for Slovene lexicographic queries, such as definitions, synonyms, word forms, collocations, sense distinctions, and contextual usage.
Acknowledgment
The model was developed as part of the LLM4DH project and within the PoVeJMo research program.
The program is funded within the Recovery and Resilience Plan by the Slovenian Research and Innovation Agency (ARIS) and NextGenerationEU.
We thank everyone who contributed to data collection and preparation, especially participants of the promptathon workshop and contributors of Slovene lexicographic resources.
Basic information
- Developed by: University of Ljubljana, Faculty of Computer and Information Science
- Project: LLM4DH
- Languages: Slovene (primary), English, Croatian, Bosnian, Serbian (secondary)
- Base model: cjvt/GaMS-9B-Instruct
- License: Gemma
Lexicographic Fine-tuning Data
To adapt GaMS-9B-Instruct to the lexicographic domain, we prepared a specialized QA corpus derived from Slovene lexical resources:
- Digitalna slovarska baza (DSB)
- Bridge dictionary
- Word sense disambiguation dataset
- Slovene synonym dictionary
- Open Slovene WordNet (OSWN)
Data sources for QA generation included:
- Automatically generated questions and answers (word forms, definitions, synonyms, collocations, context-based sense analysis).
- ssj500k corpus (for sentence-level word analysis questions).
- Promptathon workshop contributions (25 manually created questions).
- Slovene Language Advisory Service (3,698 unique real-world questions and answers).
- GPT-4.1 was used to paraphrase and diversify auto-generated QA pairs, ensuring variety and naturalness.
Training
The model was fine-tuned on the lexicographic QA dataset using supervised instruction tuning. This adaptation enables GaMS-9B-Instruct-Lex to provide more precise and consistent answers to lexicographic and linguistic questions in Slovene.
Usage
You can load and run the model with Hugging Face transformers:
from transformers import pipeline
model_id = "cjvt/GaMS-9B-Instruct-Lex"
pline = pipeline(
"text-generation",
model=model_id,
device_map="auto"
)
# Example usage
message = [{"role": "user", "content": "Kaj pomeni beseda 'drevo'?"}]
response = pline(message, max_new_tokens=512)
print("Model's response:", response[0]["generated_text"][-1]["content"])
Intended Use
- Answering lexicographic questions about Slovene words.
- Supporting research in computational lexicography and digital dictionaries.
- Assisting language learners and linguists with Slovene word meanings, synonyms, collocations, and usage.
Limitations
- The model is optimized for Slovene lexicographic tasks; performance may vary in other domains or languages.
- Automatically generated training data may contain occasional noise or inconsistencies.
- The model may still produce incorrect or incomplete answers, especially for rare or highly context-dependent queries.
- Downloads last month
- 16