Update README.md

24b9366 verified 3 months ago

3.69 kB

	---
	license: gemma
	language:
	- sl
	- en
	- hr
	- sr
	- bs
	base_model:
	- cjvt/GaMS-9B-Instruct
	pipeline_tag: text-generation
	---

	# Model Card for GaMS-9B-Instruct-Lex

	GaMS-9B-Instruct-Lex is a specialized instruction-tuned model for Slovene lexicographic tasks.
	It is based on [cjvt/GaMS-9B-Instruct](https://huggingface.co/cjvt/GaMS-9B-Instruct) and has been further fine-tuned on a large set of lexicographic question–answer pairs.

	The goal of this model is to improve the performance of large language models for Slovene lexicographic queries, such as definitions, synonyms, word forms, collocations, sense distinctions, and contextual usage.

	---

	## Acknowledgment

	The model was developed as part of the LLM4DH project and within the [PoVeJMo](https://www.cjvt.si/povejmo/en/project/) research program.

	The program is funded within the Recovery and Resilience Plan by the Slovenian Research and Innovation Agency (ARIS) and NextGenerationEU.

	We thank everyone who contributed to data collection and preparation, especially participants of the promptathon workshop and contributors of Slovene lexicographic resources.

	---

	## Basic information

	* Developed by: University of Ljubljana, Faculty of Computer and Information Science
	* Project: LLM4DH
	* Languages: Slovene (primary), English, Croatian, Bosnian, Serbian (secondary)
	* Base model: [cjvt/GaMS-9B-Instruct](https://huggingface.co/cjvt/GaMS-9B-Instruct)
	* License: [Gemma](https://ai.google.dev/gemma/terms)

	---

	## Lexicographic Fine-tuning Data

	To adapt GaMS-9B-Instruct to the lexicographic domain, we prepared a specialized QA corpus derived from Slovene lexical resources:

	* Digitalna slovarska baza (DSB)
	* Bridge dictionary
	* Word sense disambiguation dataset
	* Slovene synonym dictionary
	* Open Slovene WordNet (OSWN)

	Data sources for QA generation included:

	* Automatically generated questions and answers (word forms, definitions, synonyms, collocations, context-based sense analysis).
	* ssj500k corpus (for sentence-level word analysis questions).
	* Promptathon workshop contributions (25 manually created questions).
	* Slovene Language Advisory Service (3,698 unique real-world questions and answers).
	* GPT-4.1 was used to paraphrase and diversify auto-generated QA pairs, ensuring variety and naturalness.

	---

	## Training

	The model was fine-tuned on the lexicographic QA dataset using supervised instruction tuning.
	This adaptation enables GaMS-9B-Instruct-Lex to provide more precise and consistent answers to lexicographic and linguistic questions in Slovene.

	---

	## Usage

	You can load and run the model with Hugging Face `transformers`:

	```python
	from transformers import pipeline

	model_id = "cjvt/GaMS-9B-Instruct-Lex"

	pline = pipeline(
	"text-generation",
	model=model_id,
	device_map="auto"
	)

	# Example usage
	message = [{"role": "user", "content": "Kaj pomeni beseda 'drevo'?"}]
	response = pline(message, max_new_tokens=512)
	print("Model's response:", response[0]["generated_text"][-1]["content"])
	```

	---

	## Intended Use

	* Answering lexicographic questions about Slovene words.
	* Supporting research in computational lexicography and digital dictionaries.
	* Assisting language learners and linguists with Slovene word meanings, synonyms, collocations, and usage.

	---

	## Limitations

	* The model is optimized for Slovene lexicographic tasks; performance may vary in other domains or languages.
	* Automatically generated training data may contain occasional noise or inconsistencies.
	* The model may still produce incorrect or incomplete answers, especially for rare or highly context-dependent queries.