| --- |
| license: gemma |
| language: |
| - sl |
| - en |
| - hr |
| - sr |
| - bs |
| base_model: |
| - cjvt/GaMS-9B-Instruct |
| pipeline_tag: text-generation |
| --- |
| |
| # Model Card for GaMS-9B-Instruct-Lex |
|
|
| GaMS-9B-Instruct-Lex is a specialized instruction-tuned model for Slovene lexicographic tasks. |
| It is based on [cjvt/GaMS-9B-Instruct](https://huggingface.co/cjvt/GaMS-9B-Instruct) and has been further fine-tuned on a large set of lexicographic question–answer pairs. |
|
|
| The goal of this model is to improve the performance of large language models for Slovene lexicographic queries, such as definitions, synonyms, word forms, collocations, sense distinctions, and contextual usage. |
|
|
| --- |
|
|
| ## Acknowledgment |
|
|
| The model was developed as part of the **LLM4DH** project and within the [PoVeJMo](https://www.cjvt.si/povejmo/en/project/) research program. |
|
|
| The program is funded within the Recovery and Resilience Plan by the Slovenian Research and Innovation Agency (ARIS) and NextGenerationEU. |
|
|
| We thank everyone who contributed to data collection and preparation, especially participants of the promptathon workshop and contributors of Slovene lexicographic resources. |
|
|
| --- |
|
|
| ## Basic information |
|
|
| * **Developed by:** University of Ljubljana, Faculty of Computer and Information Science |
| * **Project:** LLM4DH |
| * **Languages:** Slovene (primary), English, Croatian, Bosnian, Serbian (secondary) |
| * **Base model:** [cjvt/GaMS-9B-Instruct](https://huggingface.co/cjvt/GaMS-9B-Instruct) |
| * **License:** [Gemma](https://ai.google.dev/gemma/terms) |
|
|
| --- |
|
|
| ## Lexicographic Fine-tuning Data |
|
|
| To adapt GaMS-9B-Instruct to the lexicographic domain, we prepared a **specialized QA corpus** derived from Slovene lexical resources: |
|
|
| * **Digitalna slovarska baza (DSB)** |
| * **Bridge dictionary** |
| * **Word sense disambiguation dataset** |
| * **Slovene synonym dictionary** |
| * **Open Slovene WordNet (OSWN)** |
|
|
| Data sources for QA generation included: |
|
|
| * Automatically generated questions and answers (word forms, definitions, synonyms, collocations, context-based sense analysis). |
| * **ssj500k corpus** (for sentence-level word analysis questions). |
| * **Promptathon workshop contributions** (25 manually created questions). |
| * **Slovene Language Advisory Service** (3,698 unique real-world questions and answers). |
| * GPT-4.1 was used to paraphrase and diversify auto-generated QA pairs, ensuring variety and naturalness. |
|
|
| --- |
|
|
| ## Training |
|
|
| The model was fine-tuned on the lexicographic QA dataset using supervised instruction tuning. |
| This adaptation enables GaMS-9B-Instruct-Lex to provide more precise and consistent answers to lexicographic and linguistic questions in Slovene. |
|
|
| --- |
|
|
| ## Usage |
|
|
| You can load and run the model with Hugging Face `transformers`: |
|
|
| ```python |
| from transformers import pipeline |
| |
| model_id = "cjvt/GaMS-9B-Instruct-Lex" |
| |
| pline = pipeline( |
| "text-generation", |
| model=model_id, |
| device_map="auto" |
| ) |
| |
| # Example usage |
| message = [{"role": "user", "content": "Kaj pomeni beseda 'drevo'?"}] |
| response = pline(message, max_new_tokens=512) |
| print("Model's response:", response[0]["generated_text"][-1]["content"]) |
| ``` |
|
|
| --- |
|
|
| ## Intended Use |
|
|
| * Answering lexicographic questions about Slovene words. |
| * Supporting research in computational lexicography and digital dictionaries. |
| * Assisting language learners and linguists with Slovene word meanings, synonyms, collocations, and usage. |
|
|
| --- |
|
|
| ## Limitations |
|
|
| * The model is optimized for **Slovene lexicographic tasks**; performance may vary in other domains or languages. |
| * Automatically generated training data may contain occasional noise or inconsistencies. |
| * The model may still produce incorrect or incomplete answers, especially for rare or highly context-dependent queries. |
|
|