RioJune
/

AG-KD

@@ -1,120 +1,83 @@
 ---
 library_name: transformers
-pipeline_tag: image-text-to-text
-base_model:
-- microsoft/Florence-2-base-ft
 license: apache-2.0
-tags:
-  - vision-language
-  - abnormality-grounding
-  - medical-imaging
-  - knowledge-distillation
-  - multimodal
-model-index:
-  - name: AG-KD
-    results:
-      - task:
-          type: Abnormality Grounding
-          name: Grounding
-        metrics:
-          - name: none
-            type: none
-            value: null
 ---
-# 🚀 Enhancing Abnormality Grounding for Vision-Language Models with Knowledge Descriptions
-This repository provides the code and model weights for our paper:
-**[Enhancing Abnormality Grounding for Vision-Language Models with Knowledge Descriptions](https://arxiv.org/abs/2503.03278)**
-🧪 Explore our live demo on [Hugging Face Spaces](https://huggingface.co/spaces/Anonymous-AC/AG-KD-anonymous-Demo) to see the model in action!
-## 📌 Overview
-**AG-KD (Abnormality Grounding with Knowledge Descriptions)** is a compact 0.23B vision-language model designed for abnormality grounding in medical images. Despite its small size, it delivers performance **comparable to 7B state-of-the-art medical VLMs**. Our approach integrates **structured knowledge descriptions** into prompts, enhancing the model’s ability to localize medical abnormalities in images.
-## 💻 How to Use
-### Simple Example
-For detailed examples, visit: [AG-KD GitHub Repository](https://github.com/LijunRio/AG-KD)
 ```python
 import torch
-import requests
-from io import BytesIO
 from PIL import Image
-import numpy as np
-import albumentations as A
-from transformers import AutoModelForCausalLM, AutoProcessor
-def apply_transform(image, size=512):
-    transform = A.Compose([
-        A.LongestMaxSize(max_size=size),
-        A.PadIfNeeded(min_height=size, min_width=size, border_mode=0, value=(0,0,0)),
-        A.Resize(height=size, width=size)
-    ])
-    return transform(image=np.array(image))["image"]
-def run_simple(image_url, target, definition, model, processor, device):
-    prompt = f"<CAPTION_TO_PHRASE_GROUNDING>Locate the phrases in the caption: {target} means {definition}."
-    response = requests.get(image_url)
-    image = Image.open(BytesIO(response.content)).convert("RGB")
-    np_image = apply_transform(image)
-    inputs = processor(text=[prompt], images=[np_image], return_tensors="pt", padding=True).to(device)
-    outputs = model.generate(
         input_ids=inputs["input_ids"],
         pixel_values=inputs["pixel_values"],
         max_new_tokens=1024,
         num_beams=3,
-        output_scores=True,
-        return_dict_in_generate=True
     )
-    transition_scores = model.compute_transition_scores(outputs.sequences, outputs.scores, outputs.beam_indices, normalize_logits=False)
-    generated_text = processor.batch_decode(outputs.sequences, skip_special_tokens=False)[0]
-    output_len = np.sum(transition_scores.cpu().numpy() < 0, axis=1)
-    length_penalty = model.generation_config.length_penalty
-    score = transition_scores.cpu().sum(axis=1) / (output_len**length_penalty)
-    prob = np.exp(score.cpu().numpy())
-    print(f"\n[IMAGE URL] {image_url}")
-    print(f"[TARGET] {target}")
-    print(f"[PROBABILITY] {prob[0] * 100:.2f}%")
-    print(f"[GENERATED TEXT]\n{generated_text}")
-if __name__ == "__main__":
-    image_url = "https://huggingface.co/spaces/RioJune/AG-KD/resolve/main/examples/f1eb2216d773ced6330b1f31e18f04f8.png"
-    target = "pulmonary fibrosis"
-    definition = "Scarring of the lung tissue creating a dense fibrous appearance."
-    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
-    model_name = "RioJune/AG-KD"
-    model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True).to(device)
-    processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)
-    run_simple(image_url, target, definition, model, processor, device)
 ```
-## 📖 Citation
-If you use our work, please cite:
-```
 @article{li2025enhancing,
-    title={Enhancing Abnormality Grounding for Vision Language Models with Knowledge Descriptions},
-    author={Li, J. and Liu, C. and Bai, W. and Arcucci, R. and Bercea, C. I. and Schnabel, J. A.},
-    journal={arXiv preprint arXiv:2503.03278},
-    year={2025}
 }
 ```

 ---
+pipeline_tag: zero-shot-object-detection
 library_name: transformers
 license: apache-2.0
 ---
+# Knowledge to Sight (K2Sight)
+**Knowledge to Sight (K2Sight)** is a novel framework designed for grounding abnormalities in medical images, where the goal is to localize clinical findings based on textual descriptions. Unlike generalist Vision-Language Models (VLMs) that often struggle with domain-specific medical terms, K2Sight introduces structured semantic supervision. It achieves this by decomposing clinical concepts into interpretable visual attributes like shape, density, and anatomical location, distilled from domain ontologies.
+This approach guides region-text alignment during training, enabling data-efficient training of compact models (0.23B and 2B parameters) using only 1.5% of the data required by state-of-the-art medical VLMs. Despite their small size and limited training data, K2Sight models achieve performance on par with or better than 7B+ medical VLMs, with up to 9.82% improvement in $mAP_{50}$.
+-   **Paper**: [Knowledge to Sight: Reasoning over Visual Attributes via Knowledge Decomposition for Abnormality Grounding](https://huggingface.co/papers/2508.04572)
+-   **Project Page**: https://lijunrio.github.io/K2Sight/
+-   **Code**: https://github.com/LijunRio/AG-KD
+-   **Demo**: https://huggingface.co/spaces/RioJune/AG-KD
+## Usage
+This model can be easily integrated and used for zero-shot abnormality grounding in medical images.
+First, install the necessary dependencies:
+```bash
+pip install transformers Pillow
+# For full project dependencies and further setup, refer to the official GitHub repository.
+```
+Here's a basic example of how to use the model for abnormality grounding:
 ```python
 import torch
 from PIL import Image
+from transformers import AutoModel, AutoProcessor
+# Load model and processor
+model_id = "RioJune/AG-KD"
+model = AutoModel.from_pretrained(model_id, trust_remote_code=True)
+processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
+# Example image (replace with your medical image path)
+# Ensure 'your_medical_image.png' exists in your directory or provide a full path.
+image = Image.open("path/to/your/medical_image.png").convert("RGB")
+# Example instruction for abnormality grounding
+# The model expects instructions to start with specific tokens like <OD> for object detection.
+instruction = "<OD> Please localize the lesion. "
+# Prepare inputs
+inputs = processor(images=image, text=instruction, return_tensors="pt")
+# Generate output
+with torch.no_grad():
+    generated_ids = model.generate(
         input_ids=inputs["input_ids"],
         pixel_values=inputs["pixel_values"],
         max_new_tokens=1024,
         num_beams=3,
     )
+# Decode and print the result
+output_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
+print(f"Instruction: {instruction}")
+print(f"Detected abnormality: {output_text}")
+# The output_text will contain bounding box coordinates (e.g., <loc_000><loc_001><loc_002><loc_003>)
+# and a description of the localized finding.
 ```
+For more advanced usage, including training and evaluation scripts, please refer to the [official GitHub repository](https://github.com/LijunRio/AG-KD).
+## Citation
+If you find our work helpful or inspiring, please cite our paper:
+```bibtex
 @article{li2025enhancing,
+   title={Enhancing Abnormality Grounding for Vision Language Models with Knowledge Descriptions},
+   author={Li, J. and Liu, C. and Bai, W. and Arcucci, R. and Bercea, C. I. and Schnabel, J. A.},
+   journal={arXiv preprint arXiv:2503.03278},
+   year={2025}
 }
 ```