MadriMed-VL-2B

A 2B-parameter multimodal medical vision-language model**, trained for medical image understanding, radiology report assistance, clinical visual question answering, and medical text reasoning.


πŸ“Š Benchmarks

Vision-Language (Medical Image Dataset evaluation)

Open Notebook

Benchmark MadriMed-VL (2B) MedGemma (4B) Gap
SLAKE 65.7% 72.3% -6.6
VQA-RAD 43.09% 49.9% -6.81
MedXpertQA-MM 21.05% 18.8% +2.25

Key Result: On MedXpertQA-MM β€” a USMLE-level multimodal benchmark β€” MadriMed-VL-2B outperforms Google's 4B MedGemma in come some cases despite being half the size. This demonstrates strong clinical reasoning transfer from focused radiology VQA fine-tuning.

Benchmark Image

Text-Only (Zero-Shot)

⚠️ Important: MadriMed-VL-2B was not trained on text-only medical datasets. These scores reflect zero-shot generalization from visual fine-tuning only.

Benchmark MadriMed-VL (2B) OpenBioLLM (8B) MedGemma (4B) Meditron (2B)
PubMedQA 59.90% 74.1% 73.4% 58.2%
USMLE-4 (MedQA) 45.40% 57.7% 49.8% 34.1%
MedXpertQA (Text) 11.02% 10.7% 14.2% N/A

🎯 What This Model Does

MadriMed-VL-2B is optimized on medical image like Xrays, CTscan, MRI, etc.:

  • πŸ”¬ Radiology VQA β€” answer questions about medical images
  • 🫁 Modality/Plane Detection β€” identify imaging type and orientation
  • πŸ₯ Organ/Position Recognition β€” localize anatomical structures
  • βš•οΈ Binary Screening β€” yes/no presence/absence questions

What It Does NOT Do (Safely)

  • ❌ Diagnosis β€” do not use for patient diagnosis
  • ❌ Treatment Recommendations β€” not trained on clinical guidelines
  • ❌ Pathology Detection β€” abnormality recognition is limited

πŸš€ Quick Start

Installation

pip install transformers torch Pillow

Run the model directly

import torch
import re
from transformers import AutoProcessor, Qwen3VLForConditionalGeneration

MODEL_PATH = "madrisight/MadriMed-VL-2B"
print(f"πŸš€ Loading {MODEL_PATH} for MedQA inference...")

# -----------------------
# Load model + processor
# -----------------------
processor = AutoProcessor.from_pretrained(MODEL_PATH, trust_remote_code=True)
model = Qwen3VLForConditionalGeneration.from_pretrained(
    MODEL_PATH,
    device_map="mps",
    trust_remote_code=True
)
model.eval()

# -----------------------
# Output cleaner
# -----------------------
def clean_output(text: str) -> str:
    text = re.sub(r"</?think>", "", text)
    text = re.sub(r"<\|?image\|?>", "", text)
    return text.strip()

prompt = (
    "You are a medical reasoning classification system.\n\n"
    "Your task is to classify each question into the SINGLE most appropriate underlying immune mechanism.\n\n"
    "RULES:\n"
    "- Output ONLY one standardized category label\n"
    "- Choose the most specific textbook mechanism\n"
    "- If multiple are possible, choose the most primary cause\n\n"
    "ALLOWED OUTPUT CATEGORIES (use exact wording):\n"
    "- CD4+ T-cell deficiency\n"
    "- CD8+ T-cell dysfunction\n"
    "- Neutropenia\n"
    "- Defective phagocytic oxidative burst\n"
    "- Humoral immunity deficiency (B-cell dysfunction)\n"
    "- Complement deficiency\n"
    "- Bone marrow suppression\n"
    "- Cytokine signaling defect\n"
    "- Hypersensitivity reaction Type I\n"
    "- Hypersensitivity reaction Type II\n"
    "- Hypersensitivity reaction Type III\n"
    "- Hypersensitivity reaction Type IV\n"
    "- General immunosuppression (only if no specific mechanism applies)\n\n"
    "OUTPUT FORMAT:\n"
    "Reasoning: <brief reasoning>\n"
    "Final Answer: <category>\n\n"
    "Question:\n"
    "A patient with HIV develops progressive loss of helper T-cell function leading to opportunistic infections."
)

messages = [
    {
        "role": "system",
        "content": "You are an expert medical AI. You must deeply analyze the question and must provide your final answer."
    },
    {
        "role": "user",
        "content": [{"type": "text", "text": prompt}]
    }
]

# -----------------------
# Inference
# -----------------------
with torch.inference_mode():

    text = processor.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )

    inputs = processor(
        text=text,
        # images=image,
        return_tensors="pt",
        padding=True,
        truncation=True,
    ).to("mps")

    generated_ids = model.generate(
        **inputs,
        max_new_tokens=512,
        do_sample=False,
        repetition_penalty=1.05
    )

    output_text = processor.batch_decode(
        generated_ids[:, inputs.input_ids.shape[1]:],
        skip_special_tokens=True
    )[0]

output_text = clean_output(output_text)
print(output_text)
Reasoning: The loss of helper T-cell function in HIV leads to impaired immune response, particularly affecting CD4+ T-cells which are crucial for orchestrating immune responses. This aligns with CD4+ T-cell deficiency as it directly impacts the ability of these cells to support other immune cells.
Final Answer: CD4+ T-cell deficiency

⚠️ Limitations & Safety

Known Issues

Issue Severity Details
Output corruption πŸ”΄ High Rare tokenization artifacts (non-ASCII) on very long sequences
Hallucination 🟑 Medium Invented findings
Positive bias 🟑 Medium Tendency to answer "yes" on presence questions
Pathology substitution 🟑 Medium Confuses similar conditions (subdural ↔ subarachnoid)

Safety Guidelines

  • NOT for clinical diagnosis
  • NOT for treatment decisions
  • Use as educational aid only
  • Always verify binary answers
  • Check for corruption

πŸ”¬ Technical Details

Training Configuration

Parameter Value
Base model Qwen/Qwen3-VL-2B-Thinking
Training data medmax
Fine-tuning type Lora SFTP
Precision bfloat16
Hardware Single Apple MPS (M4 Pro) on heavily optimized Trl (https://github.com/krrish-v/trlmps)

πŸ™ Acknowledgments


πŸ“„ Citation

If you use this model in research, please cite:

@software{madrimedvl2b,
  title = {MadriMed-VL-2B: A Compact Multimodal Medical Vision-Language Model},
  author = {Madrisight},
  year = {2026},
  url = {https://huggingface.co/madrisight/MadriMed-VL-2B}
}

Disclaimer: This model is provided for research and educational purposes only. It is not FDA-approved, not clinically validated, and must not be used for patient care without expert human oversight. The authors assume no liability for clinical use.

Downloads last month
37
Safetensors
Model size
2B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for madrisight/MadriMed-VL-2B

Finetuned
(16)
this model
Quantizations
1 model

Dataset used to train madrisight/MadriMed-VL-2B