--- tags: - vision-language - multimodal - image-question-answering - biomedical - transformers - huggingface - fastvision license: openrail language: - en datasets: - axiong/pmc_oa_demo library_name: transformers model-index: - name: Medical Image QA Model (PMC-OA) results: [] --- # ๐Ÿฉบ Medical Image QA Model โ€” Vision-Language Expert This is a multimodal model fine-tuned for **image-based biomedical question answering and captioning**, based on scientific figures from [PMC Open Access subset](https://huggingface.co/datasets/axiong/pmc_oa_demo). The model takes a biomedical image and an optional question, then generates an expert-level description or answer. --- ## ๐Ÿง  Model Architecture - **Base Model:** `FastVisionModel` (e.g., a BLIP, MiniGPT4, or Flamingo-style model) - **Backbone:** Vision encoder + LLM (supports `apply_chat_template` for prompt formatting) - **Trained for Tasks:** - Biomedical image captioning - Image-based question answering --- ## ๐Ÿงฌ Dataset - **Name:** [axiong/pmc_oa_demo](https://huggingface.co/datasets/axiong/pmc_oa_demo) - **Samples:** 100 samples (demo) - **Fields:** - `image`: Biomedical figure (from scientific paper) - `caption`: Expert-written caption - `question`: (optional) User query about the image - `answer`: (optional) Expert response --- ## ๐Ÿงช Example Usage ### ๐Ÿ” Visual Inference with Instruction & Optional Question ```python from transformers import TextStreamer import matplotlib.pyplot as plt # Prepare model and tokenizer FastVisionModel.for_inference(model) sample = dataset[10] image = sample["image"] caption = sample.get("caption", "") # Display the image plt.imshow(image) plt.axis('off') plt.title("Input Image") plt.show() instruction = "You are an expert Doctor. Describe accurately what you see in this image." question = input("Please enter your question about the image (or press Enter to skip): ").strip() # Build messages for the chat template user_content = [ {"type": "image", "image": image}, {"type": "text", "text": instruction} ] if question: user_content.append({"type": "text", "text": question}) messages = [{"role": "user", "content": user_content}] input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True) inputs = tokenizer(image, input_text, add_special_tokens=False, return_tensors="pt").to("cuda") streamer = TextStreamer(tokenizer, skip_prompt=True) _ = model.generate( **inputs, streamer=streamer, max_new_tokens=128, use_cache=True, temperature=1.5, min_p=0.1, ) # Optional: display true caption for comparison print("\nGround Truth Caption:\n", caption)