horde-vision / README.md

Update README.md

f446450 verified 21 days ago

4.9 kB

	---
	language:
	- kk
	license: apache-2.0
	base_model:
	- Qwen/Qwen3-VL-8B-Instruct
	pipeline_tag: image-text-to-text
	library_name: transformers
	---

	# HordeVision: Open-Source Kazakh Vision-Language Model

	HordeVision is a vision-language model specifically trained for the Kazakh language, designed to handle OCR, image captioning, visual question answering (VQA), reasoning, and instruction-following tasks.

	## Model Description

	HordeVision is built to address the lack of vision-language models for low-resource languages like Kazakh. The model excels at:

	- Image Captioning: Generating detailed, contextual descriptions in Kazakh
	- Visual Question Answering (VQA): Answering diverse questions about image content
	- OCR: Extracting and reading Kazakh text from images
	- Visual Reasoning: Making inferences about context, causality, and temporal states
	- Instruction Following: Executing multi-step visual tasks based on user commands

	## Key Features

	- First open-source Kazakh vision-language model
	- Trained on ~50k culturally relevant images covering daily life, education, work, culture, and heritage
	- Two-stage training: Supervised Fine-Tuning (SFT) + Reinforcement Learning (GRPO)
	- Ranks #1 across all evaluation tasks compared to comparable multilingual models

	# Model Performance Summary

	\| Model \| caption \| vqa \| ocr \| reason \| instruct_follow \| Avg Rank \|
	\|-------\|---------\|---------\|---------\|---------\|---------\|----------\|
	\| horde-vision \| 83.5 (↑12.3%) \| 68.1 (↑5.3%) \| 64.7 (↑2.6%) \| 77.4 (↑5.7%) \| 70.5 (↑5.9%) \| #1 \|
	\| Qolda \| 75.2 (↑8.7%) \| 61.7 (↑3.0%) \| 60.6 (↑2.0%) \| 70.3 (↑2.9%) \| 62.2 (↑2.8%) \| #2 \|
	\| Qwen3-VL-8B-Instruct \| 41.3 (↑0.5%) \| 53.6 (↑1.1%) \| 59.3 (↑2.1%) \| 55.5 (↑0.7%) \| 49.5 (↑0.9%) \| #3 \|
	\| gemma-3-4b-it \| 42.0 (↑0.1%) \| 41.8 (↑0.4%) \| 50.3 (↑2.3%) \| 53.0 (↑0.6%) \| 42.5 (↑0.5%) \| #4 \|
	\| Qwen2.5-VL-7B-Instruct \| 35.4 (↑0.0%) \| 41.6 (↑0.4%) \| 51.0 (↑0.9%) \| 44.6 (↑0.3%) \| 37.7 (↑0.3%) \| #5 \|
	\| Llama-3.2-11B-Vision \| 36.2 (↑0.1%) \| 38.0 (↑0.3%) \| 15.0 (↑0.1%) \| 43.4 (↑0.3%) \| 36.4 (↑0.3%) \| #6 \|
	\| InternVL3-8B \| 26.1 (↑0.6%) \| 29.0 (↑0.0%) \| 29.1 (↑0.3%) \| 27.3 (↑0.0%) \| 25.7 (↑0.0%) \| #7 \|

	Comparison: Outperforms Google Gemma 3-4B-IT, InternVL3-8B, Qwen2.5-VL-7B-Instruct, Qwen3-VL-8B-Instruct, and ISSAI Qolda across all tasks.

	## Dataset

	The training dataset was collected using a syntactic data generation pipeline:

	- Size: 45k training images, 5k validation images
	- Categories: 21 main categories, 104 subcategories, ~2,600 keyword phrases
	- Coverage: Daily contexts, social life, education, work/economy, media/communications, culture and heritage
	- Quality: Filtered with imagededup for deduplication and aesthetic scoring
	- Annotation: Labeled using GPT-4.1 with structured prompts for consistent quality
	- Split Strategy: Entity-level stratification to ensure models are tested on completely unseen entities

	## Training Details

	### Supervised Fine-Tuning (SFT)
	- Data: 46k images
	- LoRA Rank: 128
	- Epochs: 1

	### Reinforcement Learning (GRPO)
	- Data: 5k images
	- LoRA Rank: 64
	- Epochs: 1
	- Judge: GPT-4.1-mini with custom Kazakh evaluation prompts

	## How to Use

	```python
	from transformers import Qwen3VLForConditionalGeneration, AutoProcessor

	# default: Load the model on the available device(s)
	model = Qwen3VLForConditionalGeneration.from_pretrained(
	"kz-transformers/horde-vision", dtype="auto", device_map="auto"
	)

	# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
	# model = Qwen3VLForConditionalGeneration.from_pretrained(
	# "kz-transformers/horde-vision",
	# dtype=torch.bfloat16,
	# attn_implementation="flash_attention_2",
	# device_map="auto",
	# )

	processor = AutoProcessor.from_pretrained("kz-transformers/horde-vision")

	messages = [
	{
	"role": "user",
	"content": [
	{
	"type": "image",
	"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
	},
	{"type": "text", "text": "Бұл суретті сипаттаңыз."},
	],
	}
	]

	# Preparation for inference
	inputs = processor.apply_chat_template(
	messages,
	tokenize=True,
	add_generation_prompt=True,
	return_dict=True,
	return_tensors="pt"
	)
	inputs = inputs.to(model.device)

	# Inference: Generation of the output
	generated_ids = model.generate(**inputs, max_new_tokens=128)
	generated_ids_trimmed = [
	out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
	]
	output_text = processor.batch_decode(
	generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
	)
	print(output_text)
	```

	## Citation

	```bibtex

	```