Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,105 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: gemma
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: gemma
|
| 3 |
+
language:
|
| 4 |
+
- sl
|
| 5 |
+
- en
|
| 6 |
+
- hr
|
| 7 |
+
- sr
|
| 8 |
+
- bs
|
| 9 |
+
base_model:
|
| 10 |
+
- cjvt/GaMS-9B-Instruct
|
| 11 |
+
pipeline_tag: text-generation
|
| 12 |
+
---
|
| 13 |
+
|
| 14 |
+
# Model Card for GaMS-9B-Instruct-Lex
|
| 15 |
+
|
| 16 |
+
GaMS-9B-Instruct-Lex is a specialized instruction-tuned model for Slovene lexicographic tasks.
|
| 17 |
+
It is based on [cjvt/GaMS-9B-Instruct](https://huggingface.co/cjvt/GaMS-9B-Instruct) and has been further fine-tuned on a large set of lexicographic question–answer pairs.
|
| 18 |
+
|
| 19 |
+
The goal of this model is to improve the performance of large language models for Slovene lexicographic queries, such as definitions, synonyms, word forms, collocations, sense distinctions, and contextual usage.
|
| 20 |
+
|
| 21 |
+
---
|
| 22 |
+
|
| 23 |
+
## Acknowledgment
|
| 24 |
+
|
| 25 |
+
The model was developed as part of the **LLM4DH** project and within the [PoVeJMo](https://www.cjvt.si/povejmo/en/project/) research program.
|
| 26 |
+
|
| 27 |
+
The program is funded within the Recovery and Resilience Plan by the Slovenian Research and Innovation Agency (ARIS) and NextGenerationEU.
|
| 28 |
+
|
| 29 |
+
We thank everyone who contributed to data collection and preparation, especially participants of the promptathon workshop and contributors of Slovene lexicographic resources.
|
| 30 |
+
|
| 31 |
+
---
|
| 32 |
+
|
| 33 |
+
## Basic information
|
| 34 |
+
|
| 35 |
+
* **Developed by:** University of Ljubljana, Faculty of Computer and Information Science
|
| 36 |
+
* **Project:** LLM4DH
|
| 37 |
+
* **Languages:** Slovene (primary), English, Croatian, Bosnian, Serbian (secondary)
|
| 38 |
+
* **Base model:** [cjvt/GaMS-9B-Instruct](https://huggingface.co/cjvt/GaMS-9B-Instruct)
|
| 39 |
+
* **License:** [Gemma](https://ai.google.dev/gemma/terms)
|
| 40 |
+
|
| 41 |
+
---
|
| 42 |
+
|
| 43 |
+
## Lexicographic Fine-tuning Data
|
| 44 |
+
|
| 45 |
+
To adapt GaMS-9B-Instruct to the lexicographic domain, we prepared a **specialized QA corpus** derived from Slovene lexical resources:
|
| 46 |
+
|
| 47 |
+
* **Digitalna slovarska baza (DSB)**
|
| 48 |
+
* **Bridge dictionary**
|
| 49 |
+
* **Word sense disambiguation dataset**
|
| 50 |
+
* **Slovene synonym dictionary**
|
| 51 |
+
* **Open Slovene WordNet (OSWN)**
|
| 52 |
+
|
| 53 |
+
Data sources for QA generation included:
|
| 54 |
+
|
| 55 |
+
* Automatically generated questions and answers (word forms, definitions, synonyms, collocations, context-based sense analysis).
|
| 56 |
+
* **ssj500k corpus** (for sentence-level word analysis questions).
|
| 57 |
+
* **Promptathon workshop contributions** (25 manually created questions).
|
| 58 |
+
* **Slovene Language Advisory Service** (3,698 unique real-world questions and answers).
|
| 59 |
+
* GPT-4.1 was used to paraphrase and diversify auto-generated QA pairs, ensuring variety and naturalness.
|
| 60 |
+
|
| 61 |
+
---
|
| 62 |
+
|
| 63 |
+
## Training
|
| 64 |
+
|
| 65 |
+
The model was fine-tuned on the lexicographic QA dataset using supervised instruction tuning.
|
| 66 |
+
This adaptation enables GaMS-9B-Instruct-Lex to provide more precise and consistent answers to lexicographic and linguistic questions in Slovene.
|
| 67 |
+
|
| 68 |
+
---
|
| 69 |
+
|
| 70 |
+
## Usage
|
| 71 |
+
|
| 72 |
+
You can load and run the model with Hugging Face `transformers`:
|
| 73 |
+
|
| 74 |
+
```python
|
| 75 |
+
from transformers import pipeline
|
| 76 |
+
|
| 77 |
+
model_id = "cjvt/GaMS-9B-Instruct-Lex"
|
| 78 |
+
|
| 79 |
+
pline = pipeline(
|
| 80 |
+
"text-generation",
|
| 81 |
+
model=model_id,
|
| 82 |
+
device_map="auto"
|
| 83 |
+
)
|
| 84 |
+
|
| 85 |
+
# Example usage
|
| 86 |
+
message = [{"role": "user", "content": "Kaj pomeni beseda 'drevo'?"}]
|
| 87 |
+
response = pline(message, max_new_tokens=512)
|
| 88 |
+
print("Model's response:", response[0]["generated_text"][-1]["content"])
|
| 89 |
+
```
|
| 90 |
+
|
| 91 |
+
---
|
| 92 |
+
|
| 93 |
+
## Intended Use
|
| 94 |
+
|
| 95 |
+
* Answering lexicographic questions about Slovene words.
|
| 96 |
+
* Supporting research in computational lexicography and digital dictionaries.
|
| 97 |
+
* Assisting language learners and linguists with Slovene word meanings, synonyms, collocations, and usage.
|
| 98 |
+
|
| 99 |
+
---
|
| 100 |
+
|
| 101 |
+
## Limitations
|
| 102 |
+
|
| 103 |
+
* The model is optimized for **Slovene lexicographic tasks**; performance may vary in other domains or languages.
|
| 104 |
+
* Automatically generated training data may contain occasional noise or inconsistencies.
|
| 105 |
+
* The model may still produce incorrect or incomplete answers, especially for rare or highly context-dependent queries.
|