File size: 2,662 Bytes

---
tags:
- vision-language
- multimodal
- image-question-answering
- biomedical
- transformers
- huggingface
- fastvision
license: openrail
language:
- en
datasets:
- axiong/pmc_oa_demo
library_name: transformers
model-index:
  - name: Medical Image QA Model (PMC-OA)
    results: []
---

# 🩺 Medical Image QA Model — Vision-Language Expert

This is a multimodal model fine-tuned for **image-based biomedical question answering and captioning**, based on scientific figures from [PMC Open Access subset](https://huggingface.co/datasets/axiong/pmc_oa_demo). The model takes a biomedical image and an optional question, then generates an expert-level description or answer.

---

## 🧠 Model Architecture

- **Base Model:** `FastVisionModel` (e.g., a BLIP, MiniGPT4, or Flamingo-style model)
- **Backbone:** Vision encoder + LLM (supports `apply_chat_template` for prompt formatting)
- **Trained for Tasks:**
  - Biomedical image captioning
  - Image-based question answering

---

## 🧬 Dataset

- **Name:** [axiong/pmc_oa_demo](https://huggingface.co/datasets/axiong/pmc_oa_demo)
- **Samples:** 100 samples (demo)
- **Fields:**
  - `image`: Biomedical figure (from scientific paper)
  - `caption`: Expert-written caption
  - `question`: (optional) User query about the image
  - `answer`: (optional) Expert response

---

## 🧪 Example Usage

### 🔍 Visual Inference with Instruction & Optional Question

```python
from transformers import TextStreamer
import matplotlib.pyplot as plt

# Prepare model and tokenizer
FastVisionModel.for_inference(model)

sample = dataset[10]
image = sample["image"]
caption = sample.get("caption", "")

# Display the image
plt.imshow(image)
plt.axis('off')
plt.title("Input Image")
plt.show()

instruction = "You are an expert Doctor. Describe accurately what you see in this image."
question = input("Please enter your question about the image (or press Enter to skip): ").strip()

# Build messages for the chat template
user_content = [
    {"type": "image", "image": image},
    {"type": "text", "text": instruction}
]
if question:
    user_content.append({"type": "text", "text": question})

messages = [{"role": "user", "content": user_content}]
input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True)

inputs = tokenizer(image, input_text, add_special_tokens=False, return_tensors="pt").to("cuda")
streamer = TextStreamer(tokenizer, skip_prompt=True)

_ = model.generate(
    **inputs,
    streamer=streamer,
    max_new_tokens=128,
    use_cache=True,
    temperature=1.5,
    min_p=0.1,
)

# Optional: display true caption for comparison
print("\nGround Truth Caption:\n", caption)