YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
InternVL2.5-1B – Stage 3 (Multi-turn)
1. Model Overview
This model is part of a Vision-Language AI system designed for chest X-ray analysis in Vietnamese clinical settings.
The full pipeline consists of 3 stages:
- Stage 1: Findings generation (image → radiology findings)
- Stage 2: Impression generation (image → clinical impression)
- Stage 3: Multi-turn conversation (findings + impression + dialogue)
This repository corresponds to:
- Stage: 3 (Multi-turn)
- Task: Multi-turn reasoning with findings and impression
- Domain: Vietnamese medical imaging (Chest X-ray)
The model supports multi-turn dialogue, where:
- Turn 1: Generate findings
- Turn 2: Generate clinical impression based on previous context
2. Installation
pip install torch torchvision transformers decord pillow
3. Inference
GPU with bfloat16 is recommended.
import numpy as np
import torch
import torchvision.transforms as T
from PIL import Image
from torchvision.transforms.functional import InterpolationMode
from transformers import AutoModel, AutoTokenizer
IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)
def build_transform(input_size):
transform = T.Compose([
T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
T.ToTensor(),
T.Normalize(mean=IMAGENET_MEAN, std=IMAGENET_STD)
])
return transform
def dynamic_preprocess(image, image_size=448, max_num=12):
resized = image.resize((image_size, image_size))
return [resized]
def load_image(image_file, input_size=448):
image = Image.open(image_file).convert('RGB')
transform = build_transform(input_size)
images = dynamic_preprocess(image, image_size=input_size)
pixel_values = [transform(img) for img in images]
return torch.stack(pixel_values)
path = "THP2903/InternVL2_5-1B_multi"
model = AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True
).eval().cuda()
tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)
pixel_values = load_image("your_image.jpg").to(torch.bfloat16).cuda()
generation_config = dict(max_new_tokens=1024, do_sample=True)
# Turn 1: Findings
question1 = "<image>\nẢnh chụp xray bệnh nhân nam, 48 tuổi PA. Mô tả thông tin benh nhân."
response1, history = model.chat(
tokenizer,
pixel_values,
question1,
generation_config,
history=None,
return_history=True
)
print("Turn 1:", response1)
# Turn 2: Impression
question2 = "Kết luận bệnh gì?"
response2, history = model.chat(
tokenizer,
pixel_values,
question2,
generation_config,
history=history,
return_history=True
)
print("Turn 2:", response2)
4. Notes
- Input must be a chest X-ray image
- Turn 1 generates findings
- Turn 2 generates clinical impression using conversation history
- History must be passed between turns
- This model uses the original multi-turn interface of InternVL
- For best performance, consider using Qwen2-VL-7B
- Downloads last month
- 2
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support