Files changed (1) hide show
  1. README.md +46 -83
README.md CHANGED
@@ -1,120 +1,83 @@
1
  ---
 
2
  library_name: transformers
3
- pipeline_tag: image-text-to-text
4
- base_model:
5
- - microsoft/Florence-2-base-ft
6
  license: apache-2.0
7
- tags:
8
- - vision-language
9
- - abnormality-grounding
10
- - medical-imaging
11
- - knowledge-distillation
12
- - multimodal
13
- model-index:
14
- - name: AG-KD
15
- results:
16
- - task:
17
- type: Abnormality Grounding
18
- name: Grounding
19
- metrics:
20
- - name: none
21
- type: none
22
- value: null
23
  ---
24
 
 
25
 
26
- # 🚀 Enhancing Abnormality Grounding for Vision-Language Models with Knowledge Descriptions
27
 
28
- This repository provides the code and model weights for our paper:
29
- **[Enhancing Abnormality Grounding for Vision-Language Models with Knowledge Descriptions](https://arxiv.org/abs/2503.03278)**
30
 
31
- 🧪 Explore our live demo on [Hugging Face Spaces](https://huggingface.co/spaces/Anonymous-AC/AG-KD-anonymous-Demo) to see the model in action!
 
 
 
32
 
 
33
 
34
- ## 📌 Overview
35
 
36
- **AG-KD (Abnormality Grounding with Knowledge Descriptions)** is a compact 0.23B vision-language model designed for abnormality grounding in medical images. Despite its small size, it delivers performance **comparable to 7B state-of-the-art medical VLMs**. Our approach integrates **structured knowledge descriptions** into prompts, enhancing the model’s ability to localize medical abnormalities in images.
37
 
 
 
 
 
38
 
39
- ## 💻 How to Use
40
-
41
- ### Simple Example
42
-
43
- For detailed examples, visit: [AG-KD GitHub Repository](https://github.com/LijunRio/AG-KD)
44
 
45
  ```python
46
-
47
  import torch
48
- import requests
49
- from io import BytesIO
50
  from PIL import Image
51
- import numpy as np
52
- import albumentations as A
53
- from transformers import AutoModelForCausalLM, AutoProcessor
54
 
 
 
 
 
55
 
56
- def apply_transform(image, size=512):
57
- transform = A.Compose([
58
- A.LongestMaxSize(max_size=size),
59
- A.PadIfNeeded(min_height=size, min_width=size, border_mode=0, value=(0,0,0)),
60
- A.Resize(height=size, width=size)
61
- ])
62
- return transform(image=np.array(image))["image"]
63
 
64
- def run_simple(image_url, target, definition, model, processor, device):
65
- prompt = f"<CAPTION_TO_PHRASE_GROUNDING>Locate the phrases in the caption: {target} means {definition}."
66
- response = requests.get(image_url)
67
- image = Image.open(BytesIO(response.content)).convert("RGB")
68
- np_image = apply_transform(image)
69
 
70
- inputs = processor(text=[prompt], images=[np_image], return_tensors="pt", padding=True).to(device)
 
71
 
72
- outputs = model.generate(
 
 
73
  input_ids=inputs["input_ids"],
74
  pixel_values=inputs["pixel_values"],
75
  max_new_tokens=1024,
76
  num_beams=3,
77
- output_scores=True,
78
- return_dict_in_generate=True
79
  )
80
 
81
- transition_scores = model.compute_transition_scores(outputs.sequences, outputs.scores, outputs.beam_indices, normalize_logits=False)
82
- generated_text = processor.batch_decode(outputs.sequences, skip_special_tokens=False)[0]
83
-
84
- output_len = np.sum(transition_scores.cpu().numpy() < 0, axis=1)
85
- length_penalty = model.generation_config.length_penalty
86
- score = transition_scores.cpu().sum(axis=1) / (output_len**length_penalty)
87
- prob = np.exp(score.cpu().numpy())
88
 
89
- print(f"\n[IMAGE URL] {image_url}")
90
- print(f"[TARGET] {target}")
91
- print(f"[PROBABILITY] {prob[0] * 100:.2f}%")
92
- print(f"[GENERATED TEXT]\n{generated_text}")
93
-
94
- if __name__ == "__main__":
95
- image_url = "https://huggingface.co/spaces/RioJune/AG-KD/resolve/main/examples/f1eb2216d773ced6330b1f31e18f04f8.png"
96
- target = "pulmonary fibrosis"
97
- definition = "Scarring of the lung tissue creating a dense fibrous appearance."
98
-
99
- device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
100
- model_name = "RioJune/AG-KD"
101
-
102
- model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True).to(device)
103
- processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)
104
-
105
- run_simple(image_url, target, definition, model, processor, device)
106
  ```
107
 
 
108
 
109
- ## 📖 Citation
110
 
111
- If you use our work, please cite:
112
 
113
- ```
114
  @article{li2025enhancing,
115
- title={Enhancing Abnormality Grounding for Vision Language Models with Knowledge Descriptions},
116
- author={Li, J. and Liu, C. and Bai, W. and Arcucci, R. and Bercea, C. I. and Schnabel, J. A.},
117
- journal={arXiv preprint arXiv:2503.03278},
118
- year={2025}
119
  }
120
  ```
 
1
  ---
2
+ pipeline_tag: zero-shot-object-detection
3
  library_name: transformers
 
 
 
4
  license: apache-2.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  ---
6
 
7
+ # Knowledge to Sight (K2Sight)
8
 
9
+ **Knowledge to Sight (K2Sight)** is a novel framework designed for grounding abnormalities in medical images, where the goal is to localize clinical findings based on textual descriptions. Unlike generalist Vision-Language Models (VLMs) that often struggle with domain-specific medical terms, K2Sight introduces structured semantic supervision. It achieves this by decomposing clinical concepts into interpretable visual attributes like shape, density, and anatomical location, distilled from domain ontologies.
10
 
11
+ This approach guides region-text alignment during training, enabling data-efficient training of compact models (0.23B and 2B parameters) using only 1.5% of the data required by state-of-the-art medical VLMs. Despite their small size and limited training data, K2Sight models achieve performance on par with or better than 7B+ medical VLMs, with up to 9.82% improvement in $mAP_{50}$.
 
12
 
13
+ - **Paper**: [Knowledge to Sight: Reasoning over Visual Attributes via Knowledge Decomposition for Abnormality Grounding](https://huggingface.co/papers/2508.04572)
14
+ - **Project Page**: https://lijunrio.github.io/K2Sight/
15
+ - **Code**: https://github.com/LijunRio/AG-KD
16
+ - **Demo**: https://huggingface.co/spaces/RioJune/AG-KD
17
 
18
+ ## Usage
19
 
20
+ This model can be easily integrated and used for zero-shot abnormality grounding in medical images.
21
 
22
+ First, install the necessary dependencies:
23
 
24
+ ```bash
25
+ pip install transformers Pillow
26
+ # For full project dependencies and further setup, refer to the official GitHub repository.
27
+ ```
28
 
29
+ Here's a basic example of how to use the model for abnormality grounding:
 
 
 
 
30
 
31
  ```python
 
32
  import torch
 
 
33
  from PIL import Image
34
+ from transformers import AutoModel, AutoProcessor
 
 
35
 
36
+ # Load model and processor
37
+ model_id = "RioJune/AG-KD"
38
+ model = AutoModel.from_pretrained(model_id, trust_remote_code=True)
39
+ processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
40
 
41
+ # Example image (replace with your medical image path)
42
+ # Ensure 'your_medical_image.png' exists in your directory or provide a full path.
43
+ image = Image.open("path/to/your/medical_image.png").convert("RGB")
 
 
 
 
44
 
45
+ # Example instruction for abnormality grounding
46
+ # The model expects instructions to start with specific tokens like <OD> for object detection.
47
+ instruction = "<OD> Please localize the lesion. "
 
 
48
 
49
+ # Prepare inputs
50
+ inputs = processor(images=image, text=instruction, return_tensors="pt")
51
 
52
+ # Generate output
53
+ with torch.no_grad():
54
+ generated_ids = model.generate(
55
  input_ids=inputs["input_ids"],
56
  pixel_values=inputs["pixel_values"],
57
  max_new_tokens=1024,
58
  num_beams=3,
 
 
59
  )
60
 
61
+ # Decode and print the result
62
+ output_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
63
+ print(f"Instruction: {instruction}")
64
+ print(f"Detected abnormality: {output_text}")
 
 
 
65
 
66
+ # The output_text will contain bounding box coordinates (e.g., <loc_000><loc_001><loc_002><loc_003>)
67
+ # and a description of the localized finding.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
68
  ```
69
 
70
+ For more advanced usage, including training and evaluation scripts, please refer to the [official GitHub repository](https://github.com/LijunRio/AG-KD).
71
 
72
+ ## Citation
73
 
74
+ If you find our work helpful or inspiring, please cite our paper:
75
 
76
+ ```bibtex
77
  @article{li2025enhancing,
78
+ title={Enhancing Abnormality Grounding for Vision Language Models with Knowledge Descriptions},
79
+ author={Li, J. and Liu, C. and Bai, W. and Arcucci, R. and Bercea, C. I. and Schnabel, J. A.},
80
+ journal={arXiv preprint arXiv:2503.03278},
81
+ year={2025}
82
  }
83
  ```