|
|
--- |
|
|
language: |
|
|
- kk |
|
|
license: apache-2.0 |
|
|
base_model: |
|
|
- Qwen/Qwen3-VL-8B-Instruct |
|
|
pipeline_tag: image-text-to-text |
|
|
library_name: transformers |
|
|
--- |
|
|
|
|
|
# HordeVision: Open-Source Kazakh Vision-Language Model |
|
|
|
|
|
**HordeVision** is a vision-language model specifically trained for the Kazakh language, designed to handle OCR, image captioning, visual question answering (VQA), reasoning, and instruction-following tasks. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
HordeVision is built to address the lack of vision-language models for low-resource languages like Kazakh. The model excels at: |
|
|
|
|
|
- **Image Captioning**: Generating detailed, contextual descriptions in Kazakh |
|
|
- **Visual Question Answering (VQA)**: Answering diverse questions about image content |
|
|
- **OCR**: Extracting and reading Kazakh text from images |
|
|
- **Visual Reasoning**: Making inferences about context, causality, and temporal states |
|
|
- **Instruction Following**: Executing multi-step visual tasks based on user commands |
|
|
|
|
|
## Key Features |
|
|
|
|
|
- First open-source Kazakh vision-language model |
|
|
- Trained on ~50k culturally relevant images covering daily life, education, work, culture, and heritage |
|
|
- Two-stage training: Supervised Fine-Tuning (SFT) + Reinforcement Learning (GRPO) |
|
|
- Ranks #1 across all evaluation tasks compared to comparable multilingual models |
|
|
|
|
|
# Model Performance Summary |
|
|
|
|
|
| Model | caption | vqa | ocr | reason | instruct_follow | Avg Rank | |
|
|
|-------|---------|---------|---------|---------|---------|----------| |
|
|
| horde-vision | 83.5 (โ12.3%) | 68.1 (โ5.3%) | 64.7 (โ2.6%) | 77.4 (โ5.7%) | 70.5 (โ5.9%) | #1 | |
|
|
| Qolda | 75.2 (โ8.7%) | 61.7 (โ3.0%) | 60.6 (โ2.0%) | 70.3 (โ2.9%) | 62.2 (โ2.8%) | #2 | |
|
|
| Qwen3-VL-8B-Instruct | 41.3 (โ0.5%) | 53.6 (โ1.1%) | 59.3 (โ2.1%) | 55.5 (โ0.7%) | 49.5 (โ0.9%) | #3 | |
|
|
| gemma-3-4b-it | 42.0 (โ0.1%) | 41.8 (โ0.4%) | 50.3 (โ2.3%) | 53.0 (โ0.6%) | 42.5 (โ0.5%) | #4 | |
|
|
| Qwen2.5-VL-7B-Instruct | 35.4 (โ0.0%) | 41.6 (โ0.4%) | 51.0 (โ0.9%) | 44.6 (โ0.3%) | 37.7 (โ0.3%) | #5 | |
|
|
| Llama-3.2-11B-Vision | 36.2 (โ0.1%) | 38.0 (โ0.3%) | 15.0 (โ0.1%) | 43.4 (โ0.3%) | 36.4 (โ0.3%) | #6 | |
|
|
| InternVL3-8B | 26.1 (โ0.6%) | 29.0 (โ0.0%) | 29.1 (โ0.3%) | 27.3 (โ0.0%) | 25.7 (โ0.0%) | #7 | |
|
|
|
|
|
**Comparison**: Outperforms Google Gemma 3-4B-IT, InternVL3-8B, Qwen2.5-VL-7B-Instruct, Qwen3-VL-8B-Instruct, and ISSAI Qolda across all tasks. |
|
|
|
|
|
## Dataset |
|
|
|
|
|
The training dataset was collected using a syntactic data generation pipeline: |
|
|
|
|
|
- **Size**: 45k training images, 5k validation images |
|
|
- **Categories**: 21 main categories, 104 subcategories, ~2,600 keyword phrases |
|
|
- **Coverage**: Daily contexts, social life, education, work/economy, media/communications, culture and heritage |
|
|
- **Quality**: Filtered with imagededup for deduplication and aesthetic scoring |
|
|
- **Annotation**: Labeled using GPT-4.1 with structured prompts for consistent quality |
|
|
- **Split Strategy**: Entity-level stratification to ensure models are tested on completely unseen entities |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Supervised Fine-Tuning (SFT) |
|
|
- **Data**: 46k images |
|
|
- **LoRA Rank**: 128 |
|
|
- **Epochs**: 1 |
|
|
|
|
|
### Reinforcement Learning (GRPO) |
|
|
- **Data**: 5k images |
|
|
- **LoRA Rank**: 64 |
|
|
- **Epochs**: 1 |
|
|
- **Judge**: GPT-4.1-mini with custom Kazakh evaluation prompts |
|
|
|
|
|
## How to Use |
|
|
|
|
|
```python |
|
|
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor |
|
|
|
|
|
# default: Load the model on the available device(s) |
|
|
model = Qwen3VLForConditionalGeneration.from_pretrained( |
|
|
"kz-transformers/horde-vision", dtype="auto", device_map="auto" |
|
|
) |
|
|
|
|
|
# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios. |
|
|
# model = Qwen3VLForConditionalGeneration.from_pretrained( |
|
|
# "kz-transformers/horde-vision", |
|
|
# dtype=torch.bfloat16, |
|
|
# attn_implementation="flash_attention_2", |
|
|
# device_map="auto", |
|
|
# ) |
|
|
|
|
|
processor = AutoProcessor.from_pretrained("kz-transformers/horde-vision") |
|
|
|
|
|
messages = [ |
|
|
{ |
|
|
"role": "user", |
|
|
"content": [ |
|
|
{ |
|
|
"type": "image", |
|
|
"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg", |
|
|
}, |
|
|
{"type": "text", "text": "ะาฑะป ัััะตััั ัะธะฟะฐััะฐาฃัะท."}, |
|
|
], |
|
|
} |
|
|
] |
|
|
|
|
|
# Preparation for inference |
|
|
inputs = processor.apply_chat_template( |
|
|
messages, |
|
|
tokenize=True, |
|
|
add_generation_prompt=True, |
|
|
return_dict=True, |
|
|
return_tensors="pt" |
|
|
) |
|
|
inputs = inputs.to(model.device) |
|
|
|
|
|
# Inference: Generation of the output |
|
|
generated_ids = model.generate(**inputs, max_new_tokens=128) |
|
|
generated_ids_trimmed = [ |
|
|
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) |
|
|
] |
|
|
output_text = processor.batch_decode( |
|
|
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False |
|
|
) |
|
|
print(output_text) |
|
|
``` |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
|
|
|
``` |