---
library_name: transformers
pipeline_tag: image-text-to-text
base_model:
- microsoft/Florence-2-base-ft
license: apache-2.0
tags:
  - vision-language
  - abnormality-grounding
  - medical-imaging
  - knowledge-distillation
  - multimodal
model-index:
  - name: AG-KD
    results:
      - task:
          type: Abnormality Grounding
          name: Grounding
        metrics:
          - name: none
            type: none
            value: null
---


# 🚀 Enhancing Abnormality Grounding for Vision-Language Models with Knowledge Descriptions

This repository provides the code and model weights for our paper:  
**[Enhancing Abnormality Grounding for Vision-Language Models with Knowledge Descriptions](https://arxiv.org/abs/2503.03278)**

🧪 Explore our live demo on [Hugging Face Spaces](https://huggingface.co/spaces/Anonymous-AC/AG-KD-anonymous-Demo) to see the model in action!


## 📌 Overview

**AG-KD (Abnormality Grounding with Knowledge Descriptions)** is a compact 0.23B vision-language model designed for abnormality grounding in medical images. Despite its small size, it delivers performance **comparable to 7B state-of-the-art medical VLMs**. Our approach integrates **structured knowledge descriptions** into prompts, enhancing the model’s ability to localize medical abnormalities in images.


## 💻 How to Use

### Simple Example

For detailed examples, visit: [AG-KD GitHub Repository](https://github.com/LijunRio/AG-KD)

```python

import torch
import requests
from io import BytesIO
from PIL import Image
import numpy as np
import albumentations as A
from transformers import AutoModelForCausalLM, AutoProcessor


def apply_transform(image, size=512):
    transform = A.Compose([
        A.LongestMaxSize(max_size=size),
        A.PadIfNeeded(min_height=size, min_width=size, border_mode=0, value=(0,0,0)),
        A.Resize(height=size, width=size)
    ])
    return transform(image=np.array(image))["image"]

def run_simple(image_url, target, definition, model, processor, device):
    prompt = f"<CAPTION_TO_PHRASE_GROUNDING>Locate the phrases in the caption: {target} means {definition}."
    response = requests.get(image_url)
    image = Image.open(BytesIO(response.content)).convert("RGB")
    np_image = apply_transform(image)

    inputs = processor(text=[prompt], images=[np_image], return_tensors="pt", padding=True).to(device)

    outputs = model.generate(
        input_ids=inputs["input_ids"],
        pixel_values=inputs["pixel_values"],
        max_new_tokens=1024,
        num_beams=3,
        output_scores=True,
        return_dict_in_generate=True
    )

    transition_scores = model.compute_transition_scores(outputs.sequences, outputs.scores, outputs.beam_indices, normalize_logits=False)
    generated_text = processor.batch_decode(outputs.sequences, skip_special_tokens=False)[0]

    output_len = np.sum(transition_scores.cpu().numpy() < 0, axis=1)
    length_penalty = model.generation_config.length_penalty
    score = transition_scores.cpu().sum(axis=1) / (output_len**length_penalty)
    prob = np.exp(score.cpu().numpy())

    print(f"\n[IMAGE URL] {image_url}")
    print(f"[TARGET] {target}")
    print(f"[PROBABILITY] {prob[0] * 100:.2f}%")
    print(f"[GENERATED TEXT]\n{generated_text}")

if __name__ == "__main__":
    image_url = "https://huggingface.co/spaces/RioJune/AG-KD/resolve/main/examples/f1eb2216d773ced6330b1f31e18f04f8.png"
    target = "pulmonary fibrosis"
    definition = "Scarring of the lung tissue creating a dense fibrous appearance."

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model_name = "RioJune/AG-KD"

    model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True).to(device)
    processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)

    run_simple(image_url, target, definition, model, processor, device)
```


## 📖 Citation

If you use our work, please cite:

```
@article{li2025enhancing,
    title={Enhancing Abnormality Grounding for Vision Language Models with Knowledge Descriptions},
    author={Li, J. and Liu, C. and Bai, W. and Arcucci, R. and Bercea, C. I. and Schnabel, J. A.},
    journal={arXiv preprint arXiv:2503.03278},
    year={2025}
}
```