| --- |
| base_model: Qwen/Qwen2.5-VL-7B-Instruct |
| library_name: peft |
| --- |
| # 🩺 PointDetectCount-Qwen2.5-VL-7B-LoRA |
|
|
| **Model:** `SimulaMet/PointDetectCount-Qwen2.5-VL-7B-LoRA` |
| **Base model:** [`Qwen/Qwen2.5-VL-7B-Instruct`](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) |
| **Library:** `peft` (LoRA) |
| **Paper:** [arXiv:2505.16647](https://doi.org/10.48550/arXiv.2505.16647) |
| **Code:** [GitHub - simula/PointDetectCount](https://github.com/simula/PointDetectCount) |
| **Dataset:** [`SimulaMet/MedMultiPoints`](https://huggingface.co/datasets/SimulaMet/MedMultiPoints) |
|
|
| --- |
|
|
| ## 📌 Model Summary |
|
|
| `PointDetectCount-Qwen2.5-VL-7B-LoRA` is a **multi-task medical vision-language model** fine-tuned using **LoRA** on top of **Qwen2.5-VL-7B-Instruct**, a vision-language instruction-following model. This model performs **pointing (localization), bounding box detection**, and **object counting** on medical images using natural language prompts and structured JSON outputs. |
|
|
| It is trained on the [MedMultiPoints dataset](https://huggingface.co/datasets/SimulaMet/MedMultiPoints), a multimodal collection of endoscopic and microscopic images with clinical annotations. |
|
|
| --- |
|
|
| ## 🧠 Intended Uses |
|
|
| - **Medical image localization**: Predict spatial locations (points/bounding boxes) of anatomical/clinical findings. |
| - **Object counting**: Accurately estimate number of objects like polyps, clusters, or cells in medical images. |
| - **Instruction-tuned VQA**: Accepts natural language queries prompting multimodal image understanding. |
|
|
| This model is designed for **research purposes**, particularly in **medical vision-language modeling**, and should not be used directly for clinical diagnosis. |
|
|
| --- |
|
|
| ## 🚀 How to Use |
|
|
| ```python |
| import torch |
| from PIL import Image |
| |
| from peft import PeftModel |
| from transformers import AutoModelForCausalLM |
| |
| base_model = AutoModelForCausalLM.from_pretrained("/home/sushant/.cache/modelscope/hub/Qwen/Qwen2___5-VL-7B-Instruct") |
| model = PeftModel.from_pretrained(base_model, "SimulaMet/PointDetectCount-Qwen2.5-VL-7B-LoRA") |
| |
| image = Image.open("example.jpg").convert("RGB") |
| prompt = "Return bounding boxes for each polyp in the image and the total count." |
| |
| inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device) |
| with torch.no_grad(): |
| outputs = model.generate(**inputs, max_new_tokens=512) |
| |
| print(processor.batch_decode(outputs, skip_special_tokens=True)[0]) |
| ``` |
|
|
| --- |
|
|
| ## 📊 Training Details |
|
|
| - **Fine-tuning method:** [LoRA](https://arxiv.org/abs/2106.09685) (`rank=16`) |
| - **Frozen components:** Vision encoder (ViT) |
| - **Trained components:** LLM layers (excluding final LM head) |
| - **Loss function:** Language modeling loss (cross-entropy over tokens) |
| - **Format:** Instruction → JSON response (`{"bbox": [...], "count": n, "points": [...]}`) |
| - **Hardware:** Single NVIDIA A100 (80GB) |
| - **Epochs:** 5 |
| - **Batch size:** 4 (gradient accumulation used) |
| - **Learning rate:** 2e-4 |
|
|
| --- |
|
|
| ## 📁 Repository Structure |
|
|
| - `create_datasetJSON.py`: Converts raw annotations into instruction-response format |
| - `evaluate_qwen.py`: Parses and evaluates model outputs vs. ground truth |
| - `MedMultiPoints-images/`: Folder containing the training/validation images |
|
|
| --- |
|
|
| ## 🧪 Evaluation |
|
|
| Each model output is parsed to extract: |
| - Bounding box coordinates |
| - Point coordinates |
| - Object count |
|
|
| The parsed outputs are compared against the ground truth for each modality (GI tract, sperm, clusters, etc.). Accuracy is measured through precision/recall on detection, mean absolute error for counting, and proximity scores for pointing. |
|
|
| --- |
|
|
| ## 🛑 Limitations |
|
|
| - Trained only on limited domains (GI endoscopy, microscopy). |
| - Not certified for real-world clinical use. |
| - Output format depends on correct JSON generation—parsing may fail with malformed outputs. |
|
|
| --- |
|
|
| ## 📚 Citation |
|
|
| ```bibtex |
| @article{Gautam2025May, |
| author = {Gautam, Sushant and Riegler, Michael A. and Halvorsen, Pål}, |
| title = {Point, Detect, Count: Multi-Task Medical Image Understanding with Instruction-Tuned Vision-Language Models}, |
| journal = {arXiv}, |
| year = {2025}, |
| month = {may}, |
| eprint = {2505.16647}, |
| doi = {10.48550/arXiv.2505.16647} |
| } |
| ``` |
|
|
| --- |
|
|
| ## 🤝 Acknowledgements |
|
|
| Developed by researchers at **SimulaMet**, **Simula Research Laboratory**, and **OsloMet**. |
| Part of ongoing efforts to enhance **instruction-tuned medical VLMs** for robust multimodal reasoning. |