RioJune
/

AG-KD

@@ -1,83 +1,120 @@
 ---
-pipeline_tag: zero-shot-object-detection
 library_name: transformers
 license: apache-2.0
 ---
-# Knowledge to Sight (K2Sight)
-**Knowledge to Sight (K2Sight)** is a novel framework designed for grounding abnormalities in medical images, where the goal is to localize clinical findings based on textual descriptions. Unlike generalist Vision-Language Models (VLMs) that often struggle with domain-specific medical terms, K2Sight introduces structured semantic supervision. It achieves this by decomposing clinical concepts into interpretable visual attributes like shape, density, and anatomical location, distilled from domain ontologies.
-This approach guides region-text alignment during training, enabling data-efficient training of compact models (0.23B and 2B parameters) using only 1.5% of the data required by state-of-the-art medical VLMs. Despite their small size and limited training data, K2Sight models achieve performance on par with or better than 7B+ medical VLMs, with up to 9.82% improvement in $mAP_{50}$.
--   **Paper**: [Knowledge to Sight: Reasoning over Visual Attributes via Knowledge Decomposition for Abnormality Grounding](https://huggingface.co/papers/2508.04572)
--   **Project Page**: https://lijunrio.github.io/K2Sight/
--   **Code**: https://github.com/LijunRio/AG-KD
--   **Demo**: https://huggingface.co/spaces/RioJune/AG-KD
-## Usage
-This model can be easily integrated and used for zero-shot abnormality grounding in medical images.
-First, install the necessary dependencies:
-```bash
-pip install transformers Pillow
-# For full project dependencies and further setup, refer to the official GitHub repository.
-```
-Here's a basic example of how to use the model for abnormality grounding:
 ```python
 import torch
 from PIL import Image
-from transformers import AutoModel, AutoProcessor
-# Load model and processor
-model_id = "RioJune/AG-KD"
-model = AutoModel.from_pretrained(model_id, trust_remote_code=True)
-processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
-# Example image (replace with your medical image path)
-# Ensure 'your_medical_image.png' exists in your directory or provide a full path.
-image = Image.open("path/to/your/medical_image.png").convert("RGB")
-# Example instruction for abnormality grounding
-# The model expects instructions to start with specific tokens like <OD> for object detection.
-instruction = "<OD> Please localize the lesion. "
-# Prepare inputs
-inputs = processor(images=image, text=instruction, return_tensors="pt")
-# Generate output
-with torch.no_grad():
-    generated_ids = model.generate(
         input_ids=inputs["input_ids"],
         pixel_values=inputs["pixel_values"],
         max_new_tokens=1024,
         num_beams=3,
     )
-# Decode and print the result
-output_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
-print(f"Instruction: {instruction}")
-print(f"Detected abnormality: {output_text}")
-# The output_text will contain bounding box coordinates (e.g., <loc_000><loc_001><loc_002><loc_003>)
-# and a description of the localized finding.
 ```
-For more advanced usage, including training and evaluation scripts, please refer to the [official GitHub repository](https://github.com/LijunRio/AG-KD).
-## Citation
-If you find our work helpful or inspiring, please cite our paper:
-```bibtex
 @article{li2025enhancing,
-   title={Enhancing Abnormality Grounding for Vision Language Models with Knowledge Descriptions},
-   author={Li, J. and Liu, C. and Bai, W. and Arcucci, R. and Bercea, C. I. and Schnabel, J. A.},
-   journal={arXiv preprint arXiv:2503.03278},
-   year={2025}
 }
 ```

 ---
 library_name: transformers
+pipeline_tag: image-text-to-text
+base_model:
+- microsoft/Florence-2-base-ft
 license: apache-2.0
+tags:
+  - vision-language
+  - abnormality-grounding
+  - medical-imaging
+  - knowledge-distillation
+  - multimodal
+model-index:
+  - name: AG-KD
+    results:
+      - task:
+          type: Abnormality Grounding
+          name: Grounding
+        metrics:
+          - name: none
+            type: none
+            value: null
 ---
+# 🚀 Enhancing Abnormality Grounding for Vision-Language Models with Knowledge Descriptions
+This repository provides the code and model weights for our paper:
+**[Enhancing Abnormality Grounding for Vision-Language Models with Knowledge Descriptions](https://arxiv.org/abs/2503.03278)**
+🧪 Explore our live demo on [Hugging Face Spaces](https://huggingface.co/spaces/Anonymous-AC/AG-KD-anonymous-Demo) to see the model in action!
+## 📌 Overview
+**AG-KD (Abnormality Grounding with Knowledge Descriptions)** is a compact 0.23B vision-language model designed for abnormality grounding in medical images. Despite its small size, it delivers performance **comparable to 7B state-of-the-art medical VLMs**. Our approach integrates **structured knowledge descriptions** into prompts, enhancing the model’s ability to localize medical abnormalities in images.
+## 💻 How to Use
+### Simple Example
+For detailed examples, visit: [AG-KD GitHub Repository](https://github.com/LijunRio/AG-KD)
 ```python
 import torch
+import requests
+from io import BytesIO
 from PIL import Image
+import numpy as np
+import albumentations as A
+from transformers import AutoModelForCausalLM, AutoProcessor
+def apply_transform(image, size=512):
+    transform = A.Compose([
+        A.LongestMaxSize(max_size=size),
+        A.PadIfNeeded(min_height=size, min_width=size, border_mode=0, value=(0,0,0)),
+        A.Resize(height=size, width=size)
+    ])
+    return transform(image=np.array(image))["image"]
+def run_simple(image_url, target, definition, model, processor, device):
+    prompt = f"<CAPTION_TO_PHRASE_GROUNDING>Locate the phrases in the caption: {target} means {definition}."
+    response = requests.get(image_url)
+    image = Image.open(BytesIO(response.content)).convert("RGB")
+    np_image = apply_transform(image)
+    inputs = processor(text=[prompt], images=[np_image], return_tensors="pt", padding=True).to(device)
+    outputs = model.generate(
         input_ids=inputs["input_ids"],
         pixel_values=inputs["pixel_values"],
         max_new_tokens=1024,
         num_beams=3,
+        output_scores=True,
+        return_dict_in_generate=True
     )
+    transition_scores = model.compute_transition_scores(outputs.sequences, outputs.scores, outputs.beam_indices, normalize_logits=False)
+    generated_text = processor.batch_decode(outputs.sequences, skip_special_tokens=False)[0]
+    output_len = np.sum(transition_scores.cpu().numpy() < 0, axis=1)
+    length_penalty = model.generation_config.length_penalty
+    score = transition_scores.cpu().sum(axis=1) / (output_len**length_penalty)
+    prob = np.exp(score.cpu().numpy())
+    print(f"\n[IMAGE URL] {image_url}")
+    print(f"[TARGET] {target}")
+    print(f"[PROBABILITY] {prob[0] * 100:.2f}%")
+    print(f"[GENERATED TEXT]\n{generated_text}")
+if __name__ == "__main__":
+    image_url = "https://huggingface.co/spaces/RioJune/AG-KD/resolve/main/examples/f1eb2216d773ced6330b1f31e18f04f8.png"
+    target = "pulmonary fibrosis"
+    definition = "Scarring of the lung tissue creating a dense fibrous appearance."
+    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+    model_name = "RioJune/AG-KD"
+    model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True).to(device)
+    processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)
+    run_simple(image_url, target, definition, model, processor, device)
 ```
+## 📖 Citation
+If you use our work, please cite:
+```
 @article{li2025enhancing,
+    title={Enhancing Abnormality Grounding for Vision Language Models with Knowledge Descriptions},
+    author={Li, J. and Liu, C. and Bai, W. and Arcucci, R. and Bercea, C. I. and Schnabel, J. A.},
+    journal={arXiv preprint arXiv:2503.03278},
+    year={2025}
 }
 ```