File size: 2,662 Bytes
488080a
 
2c1f736
 
 
 
488080a
2c1f736
 
 
488080a
 
2c1f736
 
 
 
 
 
488080a
 
2c1f736
488080a
2c1f736
488080a
2c1f736
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
488080a
2c1f736
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
---
tags:
- vision-language
- multimodal
- image-question-answering
- biomedical
- transformers
- huggingface
- fastvision
license: openrail
language:
- en
datasets:
- axiong/pmc_oa_demo
library_name: transformers
model-index:
  - name: Medical Image QA Model (PMC-OA)
    results: []
---

# 🩺 Medical Image QA Model — Vision-Language Expert

This is a multimodal model fine-tuned for **image-based biomedical question answering and captioning**, based on scientific figures from [PMC Open Access subset](https://huggingface.co/datasets/axiong/pmc_oa_demo). The model takes a biomedical image and an optional question, then generates an expert-level description or answer.

---

## 🧠 Model Architecture

- **Base Model:** `FastVisionModel` (e.g., a BLIP, MiniGPT4, or Flamingo-style model)
- **Backbone:** Vision encoder + LLM (supports `apply_chat_template` for prompt formatting)
- **Trained for Tasks:**
  - Biomedical image captioning
  - Image-based question answering

---

## 🧬 Dataset

- **Name:** [axiong/pmc_oa_demo](https://huggingface.co/datasets/axiong/pmc_oa_demo)
- **Samples:** 100 samples (demo)
- **Fields:**
  - `image`: Biomedical figure (from scientific paper)
  - `caption`: Expert-written caption
  - `question`: (optional) User query about the image
  - `answer`: (optional) Expert response

---

## 🧪 Example Usage

### 🔍 Visual Inference with Instruction & Optional Question

```python
from transformers import TextStreamer
import matplotlib.pyplot as plt

# Prepare model and tokenizer
FastVisionModel.for_inference(model)

sample = dataset[10]
image = sample["image"]
caption = sample.get("caption", "")

# Display the image
plt.imshow(image)
plt.axis('off')
plt.title("Input Image")
plt.show()

instruction = "You are an expert Doctor. Describe accurately what you see in this image."
question = input("Please enter your question about the image (or press Enter to skip): ").strip()

# Build messages for the chat template
user_content = [
    {"type": "image", "image": image},
    {"type": "text", "text": instruction}
]
if question:
    user_content.append({"type": "text", "text": question})

messages = [{"role": "user", "content": user_content}]
input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True)

inputs = tokenizer(image, input_text, add_special_tokens=False, return_tensors="pt").to("cuda")
streamer = TextStreamer(tokenizer, skip_prompt=True)

_ = model.generate(
    **inputs,
    streamer=streamer,
    max_new_tokens=128,
    use_cache=True,
    temperature=1.5,
    min_p=0.1,
)

# Optional: display true caption for comparison
print("\nGround Truth Caption:\n", caption)