|
|
--- |
|
|
pipeline_tag: zero-shot-object-detection |
|
|
library_name: transformers |
|
|
license: apache-2.0 |
|
|
--- |
|
|
|
|
|
# Knowledge to Sight (K2Sight) |
|
|
|
|
|
**Knowledge to Sight (K2Sight)** is a novel framework designed for grounding abnormalities in medical images, where the goal is to localize clinical findings based on textual descriptions. Unlike generalist Vision-Language Models (VLMs) that often struggle with domain-specific medical terms, K2Sight introduces structured semantic supervision. It achieves this by decomposing clinical concepts into interpretable visual attributes like shape, density, and anatomical location, distilled from domain ontologies. |
|
|
|
|
|
This approach guides region-text alignment during training, enabling data-efficient training of compact models (0.23B and 2B parameters) using only 1.5% of the data required by state-of-the-art medical VLMs. Despite their small size and limited training data, K2Sight models achieve performance on par with or better than 7B+ medical VLMs, with up to 9.82% improvement in $mAP_{50}$. |
|
|
|
|
|
- **Paper**: [Knowledge to Sight: Reasoning over Visual Attributes via Knowledge Decomposition for Abnormality Grounding](https://huggingface.co/papers/2508.04572) |
|
|
- **Project Page**: https://lijunrio.github.io/K2Sight/ |
|
|
- **Code**: https://github.com/LijunRio/AG-KD |
|
|
- **Demo**: https://huggingface.co/spaces/RioJune/AG-KD |
|
|
|
|
|
## Usage |
|
|
|
|
|
This model can be easily integrated and used for zero-shot abnormality grounding in medical images. |
|
|
|
|
|
First, install the necessary dependencies: |
|
|
|
|
|
```bash |
|
|
pip install transformers Pillow |
|
|
# For full project dependencies and further setup, refer to the official GitHub repository. |
|
|
``` |
|
|
|
|
|
Here's a basic example of how to use the model for abnormality grounding: |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from PIL import Image |
|
|
from transformers import AutoModel, AutoProcessor |
|
|
|
|
|
# Load model and processor |
|
|
model_id = "RioJune/AG-KD" |
|
|
model = AutoModel.from_pretrained(model_id, trust_remote_code=True) |
|
|
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True) |
|
|
|
|
|
# Example image (replace with your medical image path) |
|
|
# Ensure 'your_medical_image.png' exists in your directory or provide a full path. |
|
|
image = Image.open("path/to/your/medical_image.png").convert("RGB") |
|
|
|
|
|
# Example instruction for abnormality grounding |
|
|
# The model expects instructions to start with specific tokens like <OD> for object detection. |
|
|
instruction = "<OD> Please localize the lesion. " |
|
|
|
|
|
# Prepare inputs |
|
|
inputs = processor(images=image, text=instruction, return_tensors="pt") |
|
|
|
|
|
# Generate output |
|
|
with torch.no_grad(): |
|
|
generated_ids = model.generate( |
|
|
input_ids=inputs["input_ids"], |
|
|
pixel_values=inputs["pixel_values"], |
|
|
max_new_tokens=1024, |
|
|
num_beams=3, |
|
|
) |
|
|
|
|
|
# Decode and print the result |
|
|
output_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0] |
|
|
print(f"Instruction: {instruction}") |
|
|
print(f"Detected abnormality: {output_text}") |
|
|
|
|
|
# The output_text will contain bounding box coordinates (e.g., <loc_000><loc_001><loc_002><loc_003>) |
|
|
# and a description of the localized finding. |
|
|
``` |
|
|
|
|
|
For more advanced usage, including training and evaluation scripts, please refer to the [official GitHub repository](https://github.com/LijunRio/AG-KD). |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you find our work helpful or inspiring, please cite our paper: |
|
|
|
|
|
```bibtex |
|
|
@article{li2025enhancing, |
|
|
title={Enhancing Abnormality Grounding for Vision Language Models with Knowledge Descriptions}, |
|
|
author={Li, J. and Liu, C. and Bai, W. and Arcucci, R. and Bercea, C. I. and Schnabel, J. A.}, |
|
|
journal={arXiv preprint arXiv:2503.03278}, |
|
|
year={2025} |
|
|
} |
|
|
``` |