Qolda

File size: 11,195 Bytes

93eee3d

---
language:
- kk
- ru
- en
base_model:
- OpenGVLab/InternVL3_5-4B
pipeline_tag: image-text-to-text
---
[Қазақша](#кіріспе)&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[English](#introduction)

# Qolda
[![GitHub](https://img.shields.io/badge/GitHub-Qolda--deployment-blue?logo=github)](https://github.com/IS2AI/Qolda-deployment)
[![License](https://img.shields.io/badge/License-Apache%202.0-green.svg)](https://www.apache.org/licenses/LICENSE-2.0)

## Introduction
Built on top of InternVL3.5 and Qwen3, **Qolda** is a small vision-language model designed to operate in Kazakh, Russian, and English. The model has 4.3B parameters and comprises the InternViT-300M vision encoder and MLP Projector components from [InternVL3.5-4B](https://huggingface.co/OpenGVLab/InternVL3_5-4B), along with the [Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B) language model. Model training was performed using the [InternVL framework](https://github.com/OpenGVLab/InternVL) 💙

The name "Qolda" reflects both its design and purpose in Kazakh: "in hand" (қолда) for its compact accessibility, and "to support" (қолдау) for its assistive nature.

## Evaluation Results
Evaluation was conducted separately for text-only and vision-language modalities. Qolda demonstrates significant performance improvements for Kazakh while maintaining comparable performance on Russian and English.

### Text Benchmarks
![Model performance comparison on language benchmarks](assets/eval-results-text.png)
*Performance comparison on language tasks including MMLU, Winogrande, HellaSwag, ARC, GSM8K, and DROP.*

**Note:** The comparison below presents Qolda's performance against Qwen3-4B on **Kazakh** language benchmarks only. Evaluation results for additional models and performance on Russian and English will be added later.

| Model | Mode | Avg | MMLU | Winogrande | HellaSwag | ARC | GSM8K | DROP |
|-------|------|-----|------|------------|-----------|-----|-------|------|
| Qwen3-4B | Direct | 52.00 | 42.43 | 56.88 | 42.04 | 64.77 | 73.62 | 32.27 |
| Qwen3-4B | Think | 57.73 | 52.98 | 51.27 | 41.86 | 79.65 | 64.82 | 55.81 |
| Qolda | Direct | 58.77 | 46.55 | 56.37 | 55.75 | 73.62 | 63.50 | 56.84 |
| Qolda | Think | **71.64** | **64.56** | **70.54** | **57.70** | **89.99** | **79.47** | **67.59** |

### Vision Benchmarks
![Model performance comparison on vision-language benchmarks](assets/eval-results-vision.png)
*Performance comparison on vision-language tasks including AI2D, MMStar, RealWorldQA, and KazakhOCR.*

**Note:** The comparison below presents Qolda's performance against InternVL3.5-4B on **Kazakh** vision-language benchmarks only. Evaluation results for additional models and performance on Russian and English will be added later.

| Model | Mode | Avg | AI2D | MMStar | RealWorldQA | KazakhOCR |
|-------|------|--------|--------|----------|---------------|-------------|
| InternVL3.5-4B | Direct | 42.23 | 52.33 | 47.47 | 38.32 | 30.81 |
| InternVL3.5-4B | Think | 42.58 | 51.42 | 49.33 | 38.74 | 30.81 |
| Qolda | Direct | 59.39 | 66.06 | 55.47 | 54.97 | **61.06** |
| Qolda | Think | **60.44** | **67.62** | **56.53** | **57.07** | 60.54 |

## Model Usage
To run inference with Transformers, please follow the [guidelines](https://huggingface.co/OpenGVLab/InternVL3_5-4B#inference-with-transformers) from InternVL.

Alternatively, to run the model via an OpenAI-compatible server, you can use lmdeploy:
```bash
pip install lmdeploy>=0.9.1

lmdeploy serve api_server issai/Qolda --server-port 23333 --tp 1 --backend pytorch
```

**Note:** Unlike the original InternVL3.5, this model requires the `enable_thinking` parameter to be explicitly set in the `extra_body` of your API calls. However, depending on the task complexity, an empty thinking response might be generated.

Then, make a standard API call:

```python
import base64
from openai import OpenAI

client = OpenAI(api_key='YOUR_API_KEY', base_url='http://0.0.0.0:23333/v1')

def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')

image_path = "./assets/eval-results-text.png"

response = client.chat.completions.create(
    model=client.models.list().data[0].id,
    messages=[{
        'role': 'user',
        'content': [
            {
                'type': 'text',
                'text': 'Берілген диаграмманың сипаттамасын бер.'
            },
            {
                'type': 'image_url',
                'image_url': {
                    'url': f'data:image/png;base64,{encode_image(image_path)}',
                },
            }
        ],
    }],
    max_tokens=8192,
    temperature=0.6,
    top_p=0.95,
    extra_body={
        "top_k": 20,
        "enable_thinking": True
    },
)

print(response.choices[0].message.content)
```

## License
This model is licensed under the Apache License 2.0.


## Кіріспе
InternVL3.5 және Qwen3 негізінде жасалған **Qolda** — қазақ, орыс және ағылшын тілдерінде жұмыс істеуге арналған шағын көру-тілдік моделі (vision-language model). Модель 4,3 млрд параметрге ие және [InternVL3.5-4B](https://huggingface.co/OpenGVLab/InternVL3_5-4B) моделінің InternViT-300M көру энкодері мен MLP проектор компоненттерін, сондай-ақ [Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B) тілдік моделін қамтиды. Модельді оқыту [InternVL фреймворкі](https://github.com/OpenGVLab/InternVL) көмегімен жүзеге асырылды 💙

"Qolda" атауы модельдің дизайны мен мақсатын қазақ тіліндегі қолда сөзінің қос мағынасы арқылы көрсетеді. Біріншісі, шағын әрі қолжетімді болуы үшін "қолда" cөзі арқылы және екіншісі, көмекші табиғаты үшін, "қолдау" мағынасы арқылы.

## Бағалау нәтижелері
Мәтіндік және көру-тілдік модальділіктер үшін бағалау бөлек жүргізілді. Qolda орыс және ағылшын тілдеріндегі өзінің бастапқы деңгейін сақтай отырып, қазақ тіліндегі өнімділігін айтарлықтай жақсартты.

### Мәтіндік бенчмарктар
![Тілдік бенчмарктардағы модель өнімділігін салыстыру](assets/eval-results-text.png)
*MMLU, Winogrande, HellaSwag, ARC, GSM8K және DROP сияқты тілдік тапсырмалардағы өнімділікті салыстыру.*

**Ескерту:** Төмендегі кестедегі Qolda және Qwen3-4B модельдерінің салыстырылуы тек **қазақ** тіліндегі бенчмарктар нәтижелерін көрсетеді. Басқа модельдердің өнімділігі, сондай-ақ орыс және ағылшын тілдеріндегі көрсеткіштер кейінірек ұсынылады.

| Model | Mode | Avg | MMLU | Winogrande | HellaSwag | ARC | GSM8K | DROP |
|-------|------|-----|------|------------|-----------|-----|-------|------|
| Qwen3-4B | Direct | 52.00 | 42.43 | 56.88 | 42.04 | 64.77 | 73.62 | 32.27 |
| Qwen3-4B | Think | 57.73 | 52.98 | 51.27 | 41.86 | 79.65 | 64.82 | 55.81 |
| Qolda | Direct | 58.77 | 46.55 | 56.37 | 55.75 | 73.62 | 63.50 | 56.84 |
| Qolda | Think | **71.64** | **64.56** | **70.54** | **57.70** | **89.99** | **79.47** | **67.59** |

### Көру бенчмарктары
![Көру-тілдік бенчмарктарындағы модель өнімділігін салыстыру](assets/eval-results-vision.png)
*AI2D, MMStar, RealWorldQA және KazakhOCR сияқты көру-тілдік тапсырмаларындағы өнімділікті салыстыру.*

**Ескерту:** Төмендегі кестедегі Qolda және InternVL3.5-4B модельдерінің салыстырылуы тек **қазақ** тіліндегі көру-тілдік бенчмарктар нәтижелерін көрсетеді. Басқа модельдердің өнімділігі, сондай-ақ орыс және ағылшын тілдеріндегі көрсеткіштер кейінірек ұсынылады.

| Model | Mode | Avg | AI2D | MMStar | RealWorldQA | KazakhOCR |
|-------|------|--------|--------|----------|---------------|-------------|
| InternVL3.5-4B | Direct | 42.23 | 52.33 | 47.47 | 38.32 | 30.81 |
| InternVL3.5-4B | Think | 42.58 | 51.42 | 49.33 | 38.74 | 30.81 |
| Qolda | Direct | 59.39 | 66.06 | 55.47 | 54.97 | **61.06** |
| Qolda | Think | **60.44** | **67.62** | **56.53** | **57.07** | 60.54 |

## Модельді қолдану
Transformers арқылы инференсті іске қосу үшін InternVL ұсынған [нұсқаулықтарды](https://huggingface.co/OpenGVLab/InternVL3_5-4B#inference-with-transformers) орындаңыз.

Немесе, модельді OpenAI-үйлесімді сервер арқылы іске қосу үшін lmdeploy құралын пайдалануға болады:
```bash
pip install lmdeploy>=0.9.1

lmdeploy serve api_server issai/Qolda --server-port 23333 --tp 1 --backend pytorch
```

**Ескерту:** Qolda-ның түпнұсқалық InternVL3.5-тен айырмашылығы, бұл модель API call жасаған кезде `extra_body` бөлігінде `enable_thinking` параметрінің нақты орнатылуын талап етеді. Тапсырманың күрделілігіне байланысты бос thinking жауабы қайтарылуы мүмкін.

Содан соң, стандартты API call жасаңыз:

```python
import base64
from openai import OpenAI

client = OpenAI(api_key='YOUR_API_KEY', base_url='http://0.0.0.0:23333/v1')

def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')

image_path = "./assets/eval-results-text.png"

response = client.chat.completions.create(
    model=client.models.list().data[0].id,
    messages=[{
        'role': 'user',
        'content': [
            {
                'type': 'text',
                'text': 'Берілген диаграмманың сипаттамасын бер.'
            },
            {
                'type': 'image_url',
                'image_url': {
                    'url': f'data:image/png;base64,{encode_image(image_path)}',
                },
            }
        ],
    }],
    max_tokens=8192,
    temperature=0.6,
    top_p=0.95,
    extra_body={
        "top_k": 20,
        "enable_thinking": True
    },
)

print(response.choices[0].message.content)
```

## Лицензия
Бұл модель Apache License 2.0 бойынша лицензияланған.