🦙 Finetuned Vision-Language Model

Developed by: ozertuu
License: apache-2.0

🔧 Installation

Before using the model, install the necessary dependencies:

pip install unsloth transformers accelerate bitsandbytes torch pillow

from unsloth import FastVisionModel
from PIL import Image
import torch
from transformers import BitsAndBytesConfig

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

bnb_config = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_enable_fp32_cpu_offload=True,
)

model, tokenizer = FastVisionModel.from_pretrained(
    "ozertuu/Llama-3.2-11B-VL-Turkish-Captioner",
    use_gradient_checkpointing="unsloth",
    device_map="auto",
    quantization_config=bnb_config,
)

FastVisionModel.for_inference(model)

def predict_radiology_description(image, instruction):
    try:
        messages = [{"role": "user", "content": [
            {"type": "image"},
            {"type": "text", "text": instruction}
        ]}]
        input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True)

        inputs = tokenizer(
            image,
            input_text,
            add_special_tokens=False,
            return_tensors="pt",
        ).to(device)

        output_ids = model.generate(
            **inputs,
            max_new_tokens=256,
            temperature=1.5,
            min_p=0.1
        )

        generated_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
        return generated_text.replace("assistant", "\n\nassistant").strip()
    except Exception as e:
        return f"Error: {str(e)}"


image_path = "1695678287384.jpg"
instruction = "Bu resmi detaylı bir şekilde açıkla"

image = Image.open(image_path).convert("RGB")
output = predict_radiology_description(image, instruction)
print(output)

Görüntü, kısa koyu saçlı ve açık kahverengi gözlü, gözlük takan ve siyah bir ceket giyen bir adamın portresidir.
Koyu bir arka plana karşı duran büyük bir siyah harf G ile loş bir şekilde aydınlatılmış bir ortamda duruyor.
Adam, yüzünde kararlı bir ifade ile doğrudan kameraya bakıyor.
İfadesi, belirli bir görevi veya projeyi yerine getiriyor gibi bir hedefe odaklanmış olduğunu gösteriyor.

This model is a fine-tuned version of LLaMA 3.2 Vision-Instruct, optimized using Unsloth and Hugging Face TRL library, enabling up to 2x faster training.

Downloads last month: 5

Safetensors

Model size

11B params

Tensor type

BF16

Datasets used to train ozertuu/Llama-3.2-11B-VL-Turkish-Captioner

Collection including ozertuu/Llama-3.2-11B-VL-Turkish-Captioner

Turkish image description

Collection

All data has been translated into Turkish • 49 items • Updated Dec 5, 2025