horde-vision / README.md

kz-transformers

Update README.md

f446450 verified 14 days ago

preview code

raw

history blame contribute delete

4.9 kB

metadata

language:
  - kk
license: apache-2.0
base_model:
  - Qwen/Qwen3-VL-8B-Instruct
pipeline_tag: image-text-to-text
library_name: transformers

HordeVision: Open-Source Kazakh Vision-Language Model

HordeVision is a vision-language model specifically trained for the Kazakh language, designed to handle OCR, image captioning, visual question answering (VQA), reasoning, and instruction-following tasks.

Model Description

HordeVision is built to address the lack of vision-language models for low-resource languages like Kazakh. The model excels at:

Image Captioning: Generating detailed, contextual descriptions in Kazakh
Visual Question Answering (VQA): Answering diverse questions about image content
OCR: Extracting and reading Kazakh text from images
Visual Reasoning: Making inferences about context, causality, and temporal states
Instruction Following: Executing multi-step visual tasks based on user commands

Key Features

First open-source Kazakh vision-language model
Trained on ~50k culturally relevant images covering daily life, education, work, culture, and heritage
Two-stage training: Supervised Fine-Tuning (SFT) + Reinforcement Learning (GRPO)
Ranks #1 across all evaluation tasks compared to comparable multilingual models

Model Performance Summary

Model	caption	vqa	ocr	reason	instruct_follow	Avg Rank
horde-vision	83.5 (↑12.3%)	68.1 (↑5.3%)	64.7 (↑2.6%)	77.4 (↑5.7%)	70.5 (↑5.9%)	#1
Qolda	75.2 (↑8.7%)	61.7 (↑3.0%)	60.6 (↑2.0%)	70.3 (↑2.9%)	62.2 (↑2.8%)	#2
Qwen3-VL-8B-Instruct	41.3 (↑0.5%)	53.6 (↑1.1%)	59.3 (↑2.1%)	55.5 (↑0.7%)	49.5 (↑0.9%)	#3
gemma-3-4b-it	42.0 (↑0.1%)	41.8 (↑0.4%)	50.3 (↑2.3%)	53.0 (↑0.6%)	42.5 (↑0.5%)	#4
Qwen2.5-VL-7B-Instruct	35.4 (↑0.0%)	41.6 (↑0.4%)	51.0 (↑0.9%)	44.6 (↑0.3%)	37.7 (↑0.3%)	#5
Llama-3.2-11B-Vision	36.2 (↑0.1%)	38.0 (↑0.3%)	15.0 (↑0.1%)	43.4 (↑0.3%)	36.4 (↑0.3%)	#6
InternVL3-8B	26.1 (↑0.6%)	29.0 (↑0.0%)	29.1 (↑0.3%)	27.3 (↑0.0%)	25.7 (↑0.0%)	#7

Comparison: Outperforms Google Gemma 3-4B-IT, InternVL3-8B, Qwen2.5-VL-7B-Instruct, Qwen3-VL-8B-Instruct, and ISSAI Qolda across all tasks.

Dataset

The training dataset was collected using a syntactic data generation pipeline:

Size: 45k training images, 5k validation images
Categories: 21 main categories, 104 subcategories, ~2,600 keyword phrases
Coverage: Daily contexts, social life, education, work/economy, media/communications, culture and heritage
Quality: Filtered with imagededup for deduplication and aesthetic scoring
Annotation: Labeled using GPT-4.1 with structured prompts for consistent quality
Split Strategy: Entity-level stratification to ensure models are tested on completely unseen entities

Training Details

Supervised Fine-Tuning (SFT)

Data: 46k images
LoRA Rank: 128
Epochs: 1

Reinforcement Learning (GRPO)

Data: 5k images
LoRA Rank: 64
Epochs: 1
Judge: GPT-4.1-mini with custom Kazakh evaluation prompts

How to Use

from transformers import Qwen3VLForConditionalGeneration, AutoProcessor

# default: Load the model on the available device(s)
model = Qwen3VLForConditionalGeneration.from_pretrained(
    "kz-transformers/horde-vision", dtype="auto", device_map="auto"
)

# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = Qwen3VLForConditionalGeneration.from_pretrained(
#     "kz-transformers/horde-vision",
#     dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

processor = AutoProcessor.from_pretrained("kz-transformers/horde-vision")

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Бұл суретті сипаттаңыз."},
        ],
    }
]

# Preparation for inference
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt"
)
inputs = inputs.to(model.device)

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

kz-transformers
/

horde-vision