THP2903's picture
Create README.md
d28146c verified

LaVy-8B – Stage 3 (Multi-turn)

1. Model Overview

This model is part of a Vision-Language AI system designed for chest X-ray analysis in Vietnamese clinical settings.

The full pipeline consists of 3 stages:

  • Stage 1: Findings generation (image → radiology findings)
  • Stage 2: Impression generation (image → clinical impression)
  • Stage 3: Multi-turn conversation (findings + impression + dialogue)

This repository corresponds to:

  • Stage: 3 (Multi-turn)
  • Task: Multi-turn reasoning with findings and impression
  • Domain: Vietnamese medical imaging (Chest X-ray)

The model supports multi-turn dialogue, where:

  • Turn 1: Generate findings
  • Turn 2: Generate clinical impression based on previous context

2. Installation

pip install torch torchvision transformers pillow

3. Inference

GPU is recommended.

import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor

model = AutoModelForCausalLM.from_pretrained(
    "THP2903/lavy-Instruct_multi_full",
    torch_dtype=torch.float16,
    device_map="auto",
    trust_remote_code=True
)

processor = AutoProcessor.from_pretrained(
    "THP2903/lavy-Instruct_multi_full",
    trust_remote_code=True
)

image = Image.open("your_image.jpg").convert("RGB")

# Turn 1: Findings
inputs = processor(
    images=image,
    text="Ảnh chụp xray bệnh nhân nam, 48 tuổi PA. Mô tả thông tin benh nhân.",
    return_tensors="pt"
).to("cuda")

outputs = model.generate(
    **inputs,
    max_new_tokens=512
)

response1 = processor.batch_decode(
    outputs,
    skip_special_tokens=True
)[0]

print("Turn 1:", response1)

# Turn 2: Impression (reuse previous response)
inputs = processor(
    images=image,
    text=f"Previous findings: {response1}\nKết luận bệnh gì?",
    return_tensors="pt"
).to("cuda")

outputs = model.generate(
    **inputs,
    max_new_tokens=512
)

response2 = processor.batch_decode(
    outputs,
    skip_special_tokens=True
)[0]

print("Turn 2:", response2)

4. Notes

  • Input must be a chest X-ray image
  • Turn 1 generates findings
  • Turn 2 generates clinical impression using previous findings as context
  • This implementation simulates multi-turn via prompt concatenation
  • For best performance, consider using Qwen2-VL-7B