|
|
--- |
|
|
license: gemma |
|
|
language: |
|
|
- sl |
|
|
- en |
|
|
- hr |
|
|
- sr |
|
|
- bs |
|
|
base_model: |
|
|
- cjvt/GaMS-9B-Instruct |
|
|
pipeline_tag: text-generation |
|
|
--- |
|
|
|
|
|
# Model Card for GaMS-9B-Instruct-Lex |
|
|
|
|
|
GaMS-9B-Instruct-Lex is a specialized instruction-tuned model for Slovene lexicographic tasks. |
|
|
It is based on [cjvt/GaMS-9B-Instruct](https://huggingface.co/cjvt/GaMS-9B-Instruct) and has been further fine-tuned on a large set of lexicographic question–answer pairs. |
|
|
|
|
|
The goal of this model is to improve the performance of large language models for Slovene lexicographic queries, such as definitions, synonyms, word forms, collocations, sense distinctions, and contextual usage. |
|
|
|
|
|
--- |
|
|
|
|
|
## Acknowledgment |
|
|
|
|
|
The model was developed as part of the **LLM4DH** project and within the [PoVeJMo](https://www.cjvt.si/povejmo/en/project/) research program. |
|
|
|
|
|
The program is funded within the Recovery and Resilience Plan by the Slovenian Research and Innovation Agency (ARIS) and NextGenerationEU. |
|
|
|
|
|
We thank everyone who contributed to data collection and preparation, especially participants of the promptathon workshop and contributors of Slovene lexicographic resources. |
|
|
|
|
|
--- |
|
|
|
|
|
## Basic information |
|
|
|
|
|
* **Developed by:** University of Ljubljana, Faculty of Computer and Information Science |
|
|
* **Project:** LLM4DH |
|
|
* **Languages:** Slovene (primary), English, Croatian, Bosnian, Serbian (secondary) |
|
|
* **Base model:** [cjvt/GaMS-9B-Instruct](https://huggingface.co/cjvt/GaMS-9B-Instruct) |
|
|
* **License:** [Gemma](https://ai.google.dev/gemma/terms) |
|
|
|
|
|
--- |
|
|
|
|
|
## Lexicographic Fine-tuning Data |
|
|
|
|
|
To adapt GaMS-9B-Instruct to the lexicographic domain, we prepared a **specialized QA corpus** derived from Slovene lexical resources: |
|
|
|
|
|
* **Digitalna slovarska baza (DSB)** |
|
|
* **Bridge dictionary** |
|
|
* **Word sense disambiguation dataset** |
|
|
* **Slovene synonym dictionary** |
|
|
* **Open Slovene WordNet (OSWN)** |
|
|
|
|
|
Data sources for QA generation included: |
|
|
|
|
|
* Automatically generated questions and answers (word forms, definitions, synonyms, collocations, context-based sense analysis). |
|
|
* **ssj500k corpus** (for sentence-level word analysis questions). |
|
|
* **Promptathon workshop contributions** (25 manually created questions). |
|
|
* **Slovene Language Advisory Service** (3,698 unique real-world questions and answers). |
|
|
* GPT-4.1 was used to paraphrase and diversify auto-generated QA pairs, ensuring variety and naturalness. |
|
|
|
|
|
--- |
|
|
|
|
|
## Training |
|
|
|
|
|
The model was fine-tuned on the lexicographic QA dataset using supervised instruction tuning. |
|
|
This adaptation enables GaMS-9B-Instruct-Lex to provide more precise and consistent answers to lexicographic and linguistic questions in Slovene. |
|
|
|
|
|
--- |
|
|
|
|
|
## Usage |
|
|
|
|
|
You can load and run the model with Hugging Face `transformers`: |
|
|
|
|
|
```python |
|
|
from transformers import pipeline |
|
|
|
|
|
model_id = "cjvt/GaMS-9B-Instruct-Lex" |
|
|
|
|
|
pline = pipeline( |
|
|
"text-generation", |
|
|
model=model_id, |
|
|
device_map="auto" |
|
|
) |
|
|
|
|
|
# Example usage |
|
|
message = [{"role": "user", "content": "Kaj pomeni beseda 'drevo'?"}] |
|
|
response = pline(message, max_new_tokens=512) |
|
|
print("Model's response:", response[0]["generated_text"][-1]["content"]) |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
* Answering lexicographic questions about Slovene words. |
|
|
* Supporting research in computational lexicography and digital dictionaries. |
|
|
* Assisting language learners and linguists with Slovene word meanings, synonyms, collocations, and usage. |
|
|
|
|
|
--- |
|
|
|
|
|
## Limitations |
|
|
|
|
|
* The model is optimized for **Slovene lexicographic tasks**; performance may vary in other domains or languages. |
|
|
* Automatically generated training data may contain occasional noise or inconsistencies. |
|
|
* The model may still produce incorrect or incomplete answers, especially for rare or highly context-dependent queries. |
|
|
|