tknez
/

GaMS-9B-Instruct-Lex

+---
+license: gemma
+language:
+  - sl
+  - en
+  - hr
+  - sr
+  - bs
+base_model:
+  - cjvt/GaMS-9B-Instruct
+pipeline_tag: text-generation
+---
+# Model Card for GaMS-9B-Instruct-Lex
+GaMS-9B-Instruct-Lex is a specialized instruction-tuned model for Slovene lexicographic tasks.
+It is based on [cjvt/GaMS-9B-Instruct](https://huggingface.co/cjvt/GaMS-9B-Instruct) and has been further fine-tuned on a large set of lexicographic question–answer pairs.
+The goal of this model is to improve the performance of large language models for Slovene lexicographic queries, such as definitions, synonyms, word forms, collocations, sense distinctions, and contextual usage.
+---
+## Acknowledgment
+The model was developed as part of the **LLM4DH** project and within the [PoVeJMo](https://www.cjvt.si/povejmo/en/project/) research program.
+The program is funded within the Recovery and Resilience Plan by the Slovenian Research and Innovation Agency (ARIS) and NextGenerationEU.
+We thank everyone who contributed to data collection and preparation, especially participants of the promptathon workshop and contributors of Slovene lexicographic resources.
+---
+## Basic information
+* **Developed by:** University of Ljubljana, Faculty of Computer and Information Science
+* **Project:** LLM4DH
+* **Languages:** Slovene (primary), English, Croatian, Bosnian, Serbian (secondary)
+* **Base model:** [cjvt/GaMS-9B-Instruct](https://huggingface.co/cjvt/GaMS-9B-Instruct)
+* **License:** [Gemma](https://ai.google.dev/gemma/terms)
+---
+## Lexicographic Fine-tuning Data
+To adapt GaMS-9B-Instruct to the lexicographic domain, we prepared a **specialized QA corpus** derived from Slovene lexical resources:
+* **Digitalna slovarska baza (DSB)**
+* **Bridge dictionary**
+* **Word sense disambiguation dataset**
+* **Slovene synonym dictionary**
+* **Open Slovene WordNet (OSWN)**
+Data sources for QA generation included:
+* Automatically generated questions and answers (word forms, definitions, synonyms, collocations, context-based sense analysis).
+* **ssj500k corpus** (for sentence-level word analysis questions).
+* **Promptathon workshop contributions** (25 manually created questions).
+* **Slovene Language Advisory Service** (3,698 unique real-world questions and answers).
+* GPT-4.1 was used to paraphrase and diversify auto-generated QA pairs, ensuring variety and naturalness.
+---
+## Training
+The model was fine-tuned on the lexicographic QA dataset using supervised instruction tuning.
+This adaptation enables GaMS-9B-Instruct-Lex to provide more precise and consistent answers to lexicographic and linguistic questions in Slovene.
+---
+## Usage
+You can load and run the model with Hugging Face `transformers`:
+```python
+from transformers import pipeline
+model_id = "cjvt/GaMS-9B-Instruct-Lex"
+pline = pipeline(
+    "text-generation",
+    model=model_id,
+    device_map="auto"
+)
+# Example usage
+message = [{"role": "user", "content": "Kaj pomeni beseda 'drevo'?"}]
+response = pline(message, max_new_tokens=512)
+print("Model's response:", response[0]["generated_text"][-1]["content"])
+```
+---
+## Intended Use
+* Answering lexicographic questions about Slovene words.
+* Supporting research in computational lexicography and digital dictionaries.
+* Assisting language learners and linguists with Slovene word meanings, synonyms, collocations, and usage.
+---
+## Limitations
+* The model is optimized for **Slovene lexicographic tasks**; performance may vary in other domains or languages.
+* Automatically generated training data may contain occasional noise or inconsistencies.
+* The model may still produce incorrect or incomplete answers, especially for rare or highly context-dependent queries.