Qolda / README.md

Initial model upload

93eee3d verified about 2 months ago

11.2 kB

	---
	language:
	- kk
	- ru
	- en
	base_model:
	- OpenGVLab/InternVL3_5-4B
	pipeline_tag: image-text-to-text
	---
	[Қазақша](#кіріспе)     [English](#introduction)

	# Qolda
	[![GitHub](https://img.shields.io/badge/GitHub-Qolda--deployment-blue?logo=github)](https://github.com/IS2AI/Qolda-deployment)
	[![License](https://img.shields.io/badge/License-Apache%202.0-green.svg)](https://www.apache.org/licenses/LICENSE-2.0)

	## Introduction
	Built on top of InternVL3.5 and Qwen3, Qolda is a small vision-language model designed to operate in Kazakh, Russian, and English. The model has 4.3B parameters and comprises the InternViT-300M vision encoder and MLP Projector components from [InternVL3.5-4B](https://huggingface.co/OpenGVLab/InternVL3_5-4B), along with the [Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B) language model. Model training was performed using the [InternVL framework](https://github.com/OpenGVLab/InternVL) 💙

	The name "Qolda" reflects both its design and purpose in Kazakh: "in hand" (қолда) for its compact accessibility, and "to support" (қолдау) for its assistive nature.

	## Evaluation Results
	Evaluation was conducted separately for text-only and vision-language modalities. Qolda demonstrates significant performance improvements for Kazakh while maintaining comparable performance on Russian and English.

	### Text Benchmarks
	![Model performance comparison on language benchmarks](assets/eval-results-text.png)
	Performance comparison on language tasks including MMLU, Winogrande, HellaSwag, ARC, GSM8K, and DROP.

	Note: The comparison below presents Qolda's performance against Qwen3-4B on Kazakh language benchmarks only. Evaluation results for additional models and performance on Russian and English will be added later.

	\| Model \| Mode \| Avg \| MMLU \| Winogrande \| HellaSwag \| ARC \| GSM8K \| DROP \|
	\|-------\|------\|-----\|------\|------------\|-----------\|-----\|-------\|------\|
	\| Qwen3-4B \| Direct \| 52.00 \| 42.43 \| 56.88 \| 42.04 \| 64.77 \| 73.62 \| 32.27 \|
	\| Qwen3-4B \| Think \| 57.73 \| 52.98 \| 51.27 \| 41.86 \| 79.65 \| 64.82 \| 55.81 \|
	\| Qolda \| Direct \| 58.77 \| 46.55 \| 56.37 \| 55.75 \| 73.62 \| 63.50 \| 56.84 \|
	\| Qolda \| Think \| 71.64 \| 64.56 \| 70.54 \| 57.70 \| 89.99 \| 79.47 \| 67.59 \|

	### Vision Benchmarks
	![Model performance comparison on vision-language benchmarks](assets/eval-results-vision.png)
	Performance comparison on vision-language tasks including AI2D, MMStar, RealWorldQA, and KazakhOCR.

	Note: The comparison below presents Qolda's performance against InternVL3.5-4B on Kazakh vision-language benchmarks only. Evaluation results for additional models and performance on Russian and English will be added later.

	\| Model \| Mode \| Avg \| AI2D \| MMStar \| RealWorldQA \| KazakhOCR \|
	\|-------\|------\|--------\|--------\|----------\|---------------\|-------------\|
	\| InternVL3.5-4B \| Direct \| 42.23 \| 52.33 \| 47.47 \| 38.32 \| 30.81 \|
	\| InternVL3.5-4B \| Think \| 42.58 \| 51.42 \| 49.33 \| 38.74 \| 30.81 \|
	\| Qolda \| Direct \| 59.39 \| 66.06 \| 55.47 \| 54.97 \| 61.06 \|
	\| Qolda \| Think \| 60.44 \| 67.62 \| 56.53 \| 57.07 \| 60.54 \|

	## Model Usage
	To run inference with Transformers, please follow the [guidelines](https://huggingface.co/OpenGVLab/InternVL3_5-4B#inference-with-transformers) from InternVL.

	Alternatively, to run the model via an OpenAI-compatible server, you can use lmdeploy:
	```bash
	pip install lmdeploy>=0.9.1

	lmdeploy serve api_server issai/Qolda --server-port 23333 --tp 1 --backend pytorch
	```

	Note: Unlike the original InternVL3.5, this model requires the `enable_thinking` parameter to be explicitly set in the `extra_body` of your API calls. However, depending on the task complexity, an empty thinking response might be generated.

	Then, make a standard API call:

	```python
	import base64
	from openai import OpenAI

	client = OpenAI(api_key='YOUR_API_KEY', base_url='http://0.0.0.0:23333/v1')

	def encode_image(image_path):
	with open(image_path, "rb") as image_file:
	return base64.b64encode(image_file.read()).decode('utf-8')

	image_path = "./assets/eval-results-text.png"

	response = client.chat.completions.create(
	model=client.models.list().data[0].id,
	messages=[{
	'role': 'user',
	'content': [
	{
	'type': 'text',
	'text': 'Берілген диаграмманың сипаттамасын бер.'
	},
	{
	'type': 'image_url',
	'image_url': {
	'url': f'data:image/png;base64,{encode_image(image_path)}',
	},
	}
	],
	}],
	max_tokens=8192,
	temperature=0.6,
	top_p=0.95,
	extra_body={
	"top_k": 20,
	"enable_thinking": True
	},
	)

	print(response.choices[0].message.content)
	```

	## License
	This model is licensed under the Apache License 2.0.


	## Кіріспе
	InternVL3.5 және Qwen3 негізінде жасалған Qolda — қазақ, орыс және ағылшын тілдерінде жұмыс істеуге арналған шағын көру-тілдік моделі (vision-language model). Модель 4,3 млрд параметрге ие және [InternVL3.5-4B](https://huggingface.co/OpenGVLab/InternVL3_5-4B) моделінің InternViT-300M көру энкодері мен MLP проектор компоненттерін, сондай-ақ [Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B) тілдік моделін қамтиды. Модельді оқыту [InternVL фреймворкі](https://github.com/OpenGVLab/InternVL) көмегімен жүзеге асырылды 💙

	"Qolda" атауы модельдің дизайны мен мақсатын қазақ тіліндегі қолда сөзінің қос мағынасы арқылы көрсетеді. Біріншісі, шағын әрі қолжетімді болуы үшін "қолда" cөзі арқылы және екіншісі, көмекші табиғаты үшін, "қолдау" мағынасы арқылы.

	## Бағалау нәтижелері
	Мәтіндік және көру-тілдік модальділіктер үшін бағалау бөлек жүргізілді. Qolda орыс және ағылшын тілдеріндегі өзінің бастапқы деңгейін сақтай отырып, қазақ тіліндегі өнімділігін айтарлықтай жақсартты.

	### Мәтіндік бенчмарктар
	![Тілдік бенчмарктардағы модель өнімділігін салыстыру](assets/eval-results-text.png)
	MMLU, Winogrande, HellaSwag, ARC, GSM8K және DROP сияқты тілдік тапсырмалардағы өнімділікті салыстыру.

	Ескерту: Төмендегі кестедегі Qolda және Qwen3-4B модельдерінің салыстырылуы тек қазақ тіліндегі бенчмарктар нәтижелерін көрсетеді. Басқа модельдердің өнімділігі, сондай-ақ орыс және ағылшын тілдеріндегі көрсеткіштер кейінірек ұсынылады.

	\| Model \| Mode \| Avg \| MMLU \| Winogrande \| HellaSwag \| ARC \| GSM8K \| DROP \|
	\|-------\|------\|-----\|------\|------------\|-----------\|-----\|-------\|------\|
	\| Qwen3-4B \| Direct \| 52.00 \| 42.43 \| 56.88 \| 42.04 \| 64.77 \| 73.62 \| 32.27 \|
	\| Qwen3-4B \| Think \| 57.73 \| 52.98 \| 51.27 \| 41.86 \| 79.65 \| 64.82 \| 55.81 \|
	\| Qolda \| Direct \| 58.77 \| 46.55 \| 56.37 \| 55.75 \| 73.62 \| 63.50 \| 56.84 \|
	\| Qolda \| Think \| 71.64 \| 64.56 \| 70.54 \| 57.70 \| 89.99 \| 79.47 \| 67.59 \|

	### Көру бенчмарктары
	![Көру-тілдік бенчмарктарындағы модель өнімділігін салыстыру](assets/eval-results-vision.png)
	AI2D, MMStar, RealWorldQA және KazakhOCR сияқты көру-тілдік тапсырмаларындағы өнімділікті салыстыру.

	Ескерту: Төмендегі кестедегі Qolda және InternVL3.5-4B модельдерінің салыстырылуы тек қазақ тіліндегі көру-тілдік бенчмарктар нәтижелерін көрсетеді. Басқа модельдердің өнімділігі, сондай-ақ орыс және ағылшын тілдеріндегі көрсеткіштер кейінірек ұсынылады.

	\| Model \| Mode \| Avg \| AI2D \| MMStar \| RealWorldQA \| KazakhOCR \|
	\|-------\|------\|--------\|--------\|----------\|---------------\|-------------\|
	\| InternVL3.5-4B \| Direct \| 42.23 \| 52.33 \| 47.47 \| 38.32 \| 30.81 \|
	\| InternVL3.5-4B \| Think \| 42.58 \| 51.42 \| 49.33 \| 38.74 \| 30.81 \|
	\| Qolda \| Direct \| 59.39 \| 66.06 \| 55.47 \| 54.97 \| 61.06 \|
	\| Qolda \| Think \| 60.44 \| 67.62 \| 56.53 \| 57.07 \| 60.54 \|

	## Модельді қолдану
	Transformers арқылы инференсті іске қосу үшін InternVL ұсынған [нұсқаулықтарды](https://huggingface.co/OpenGVLab/InternVL3_5-4B#inference-with-transformers) орындаңыз.

	Немесе, модельді OpenAI-үйлесімді сервер арқылы іске қосу үшін lmdeploy құралын пайдалануға болады:
	```bash
	pip install lmdeploy>=0.9.1

	lmdeploy serve api_server issai/Qolda --server-port 23333 --tp 1 --backend pytorch
	```

	Ескерту: Qolda-ның түпнұсқалық InternVL3.5-тен айырмашылығы, бұл модель API call жасаған кезде `extra_body` бөлігінде `enable_thinking` параметрінің нақты орнатылуын талап етеді. Тапсырманың күрделілігіне байланысты бос thinking жауабы қайтарылуы мүмкін.

	Содан соң, стандартты API call жасаңыз:

	```python
	import base64
	from openai import OpenAI

	client = OpenAI(api_key='YOUR_API_KEY', base_url='http://0.0.0.0:23333/v1')

	def encode_image(image_path):
	with open(image_path, "rb") as image_file:
	return base64.b64encode(image_file.read()).decode('utf-8')

	image_path = "./assets/eval-results-text.png"

	response = client.chat.completions.create(
	model=client.models.list().data[0].id,
	messages=[{
	'role': 'user',
	'content': [
	{
	'type': 'text',
	'text': 'Берілген диаграмманың сипаттамасын бер.'
	},
	{
	'type': 'image_url',
	'image_url': {
	'url': f'data:image/png;base64,{encode_image(image_path)}',
	},
	}
	],
	}],
	max_tokens=8192,
	temperature=0.6,
	top_p=0.95,
	extra_body={
	"top_k": 20,
	"enable_thinking": True
	},
	)

	print(response.choices[0].message.content)
	```

	## Лицензия
	Бұл модель Apache License 2.0 бойынша лицензияланған.