|
|
--- |
|
|
tags: |
|
|
- vision-language |
|
|
- multimodal |
|
|
- image-question-answering |
|
|
- biomedical |
|
|
- transformers |
|
|
- huggingface |
|
|
- fastvision |
|
|
license: openrail |
|
|
language: |
|
|
- en |
|
|
datasets: |
|
|
- axiong/pmc_oa_demo |
|
|
library_name: transformers |
|
|
model-index: |
|
|
- name: Medical Image QA Model (PMC-OA) |
|
|
results: [] |
|
|
--- |
|
|
|
|
|
# 🩺 Medical Image QA Model — Vision-Language Expert |
|
|
|
|
|
This is a multimodal model fine-tuned for **image-based biomedical question answering and captioning**, based on scientific figures from [PMC Open Access subset](https://huggingface.co/datasets/axiong/pmc_oa_demo). The model takes a biomedical image and an optional question, then generates an expert-level description or answer. |
|
|
|
|
|
--- |
|
|
|
|
|
## 🧠 Model Architecture |
|
|
|
|
|
- **Base Model:** `FastVisionModel` (e.g., a BLIP, MiniGPT4, or Flamingo-style model) |
|
|
- **Backbone:** Vision encoder + LLM (supports `apply_chat_template` for prompt formatting) |
|
|
- **Trained for Tasks:** |
|
|
- Biomedical image captioning |
|
|
- Image-based question answering |
|
|
|
|
|
--- |
|
|
|
|
|
## 🧬 Dataset |
|
|
|
|
|
- **Name:** [axiong/pmc_oa_demo](https://huggingface.co/datasets/axiong/pmc_oa_demo) |
|
|
- **Samples:** 100 samples (demo) |
|
|
- **Fields:** |
|
|
- `image`: Biomedical figure (from scientific paper) |
|
|
- `caption`: Expert-written caption |
|
|
- `question`: (optional) User query about the image |
|
|
- `answer`: (optional) Expert response |
|
|
|
|
|
--- |
|
|
|
|
|
## 🧪 Example Usage |
|
|
|
|
|
### 🔍 Visual Inference with Instruction & Optional Question |
|
|
|
|
|
```python |
|
|
from transformers import TextStreamer |
|
|
import matplotlib.pyplot as plt |
|
|
|
|
|
# Prepare model and tokenizer |
|
|
FastVisionModel.for_inference(model) |
|
|
|
|
|
sample = dataset[10] |
|
|
image = sample["image"] |
|
|
caption = sample.get("caption", "") |
|
|
|
|
|
# Display the image |
|
|
plt.imshow(image) |
|
|
plt.axis('off') |
|
|
plt.title("Input Image") |
|
|
plt.show() |
|
|
|
|
|
instruction = "You are an expert Doctor. Describe accurately what you see in this image." |
|
|
question = input("Please enter your question about the image (or press Enter to skip): ").strip() |
|
|
|
|
|
# Build messages for the chat template |
|
|
user_content = [ |
|
|
{"type": "image", "image": image}, |
|
|
{"type": "text", "text": instruction} |
|
|
] |
|
|
if question: |
|
|
user_content.append({"type": "text", "text": question}) |
|
|
|
|
|
messages = [{"role": "user", "content": user_content}] |
|
|
input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True) |
|
|
|
|
|
inputs = tokenizer(image, input_text, add_special_tokens=False, return_tensors="pt").to("cuda") |
|
|
streamer = TextStreamer(tokenizer, skip_prompt=True) |
|
|
|
|
|
_ = model.generate( |
|
|
**inputs, |
|
|
streamer=streamer, |
|
|
max_new_tokens=128, |
|
|
use_cache=True, |
|
|
temperature=1.5, |
|
|
min_p=0.1, |
|
|
) |
|
|
|
|
|
# Optional: display true caption for comparison |
|
|
print("\nGround Truth Caption:\n", caption) |
|
|
|