AG-KD / README.md

Add model card (#2)

24a4b0d verified 5 months ago

3.61 kB

	---
	pipeline_tag: zero-shot-object-detection
	library_name: transformers
	license: apache-2.0
	---

	# Knowledge to Sight (K2Sight)

	Knowledge to Sight (K2Sight) is a novel framework designed for grounding abnormalities in medical images, where the goal is to localize clinical findings based on textual descriptions. Unlike generalist Vision-Language Models (VLMs) that often struggle with domain-specific medical terms, K2Sight introduces structured semantic supervision. It achieves this by decomposing clinical concepts into interpretable visual attributes like shape, density, and anatomical location, distilled from domain ontologies.

	This approach guides region-text alignment during training, enabling data-efficient training of compact models (0.23B and 2B parameters) using only 1.5% of the data required by state-of-the-art medical VLMs. Despite their small size and limited training data, K2Sight models achieve performance on par with or better than 7B+ medical VLMs, with up to 9.82% improvement in $mAP_{50}$.

	- Paper: [Knowledge to Sight: Reasoning over Visual Attributes via Knowledge Decomposition for Abnormality Grounding](https://huggingface.co/papers/2508.04572)
	- Project Page: https://lijunrio.github.io/K2Sight/
	- Code: https://github.com/LijunRio/AG-KD
	- Demo: https://huggingface.co/spaces/RioJune/AG-KD

	## Usage

	This model can be easily integrated and used for zero-shot abnormality grounding in medical images.

	First, install the necessary dependencies:

	```bash
	pip install transformers Pillow
	# For full project dependencies and further setup, refer to the official GitHub repository.
	```

	Here's a basic example of how to use the model for abnormality grounding:

	```python
	import torch
	from PIL import Image
	from transformers import AutoModel, AutoProcessor

	# Load model and processor
	model_id = "RioJune/AG-KD"
	model = AutoModel.from_pretrained(model_id, trust_remote_code=True)
	processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

	# Example image (replace with your medical image path)
	# Ensure 'your_medical_image.png' exists in your directory or provide a full path.
	image = Image.open("path/to/your/medical_image.png").convert("RGB")

	# Example instruction for abnormality grounding
	# The model expects instructions to start with specific tokens like <OD> for object detection.
	instruction = "<OD> Please localize the lesion. "

	# Prepare inputs
	inputs = processor(images=image, text=instruction, return_tensors="pt")

	# Generate output
	with torch.no_grad():
	generated_ids = model.generate(
	input_ids=inputs["input_ids"],
	pixel_values=inputs["pixel_values"],
	max_new_tokens=1024,
	num_beams=3,
	)

	# Decode and print the result
	output_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
	print(f"Instruction: {instruction}")
	print(f"Detected abnormality: {output_text}")

	# The output_text will contain bounding box coordinates (e.g., <loc_000><loc_001><loc_002><loc_003>)
	# and a description of the localized finding.
	```

	For more advanced usage, including training and evaluation scripts, please refer to the [official GitHub repository](https://github.com/LijunRio/AG-KD).

	## Citation

	If you find our work helpful or inspiring, please cite our paper:

	```bibtex
	@article{li2025enhancing,
	title={Enhancing Abnormality Grounding for Vision Language Models with Knowledge Descriptions},
	author={Li, J. and Liu, C. and Bai, W. and Arcucci, R. and Bercea, C. I. and Schnabel, J. A.},
	journal={arXiv preprint arXiv:2503.03278},
	year={2025}
	}
	```