jiang-cc commited on
Commit
366408f
·
verified ·
1 Parent(s): f2bc369

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +68 -55
README.md CHANGED
@@ -1,92 +1,105 @@
1
  ---
2
  library_name: transformers
 
3
  tags:
4
  - anomaly-detection
5
  - vision-language-model
6
  - industrial-inspection
7
- - multimodal
8
- - in-context-learning
 
 
 
 
 
9
  ---
10
 
11
- # AD-Copilot
12
 
13
- A vision-language assistant for industrial anomaly detection via visual in-context comparison.
 
 
14
 
15
- ## Model Details
16
 
17
- ### Model Description
 
 
18
 
19
- - **Developed by:** Xi Jiang, Yue Guo, Jian Li, Yong Liu, Bin-Bin Gao, Hanqiu Deng, Jun Liu, Heng Zhao, Chengjie Wang, Feng Zheng
20
- - **Model type:** Vision-Language Model (VLM)
21
- - **Language(s):** English and Chinese
22
- - **License:** Apache 2.0
23
- - **Finetuned from:** [Qwen/Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct)
24
 
25
- ### Model Sources
 
 
 
 
26
 
27
- - **Repository:** [jam-cc/AD-Copilot](https://github.com/jam-cc/AD-Copilot)
28
- - **Paper:** [arXiv:2603.13779](https://arxiv.org/abs/2603.13779v1)
29
-
30
- ## Uses
31
-
32
- ### Direct Use
33
-
34
- AD-Copilot can be used for:
35
- - Industrial anomaly detection and localization
36
- - Natural language question answering about product defects
37
- - Visual comparison between normal reference images and query images
38
- - General visual question answering
39
-
40
- ## How to Get Started with the Model
41
 
42
  ```python
43
- from transformers import AutoModelForImageTextToText, AutoProcessor
 
44
  from qwen_vl_utils import process_vision_info
45
 
46
- model = AutoModelForImageTextToText.from_pretrained(
47
  "jiang-cc/AD-Copilot",
48
- torch_dtype="auto",
49
- device_map="auto"
 
 
 
 
 
 
 
50
  )
51
- processor = AutoProcessor.from_pretrained("jiang-cc/AD-Copilot")
52
 
53
  messages = [
54
  {
55
  "role": "user",
56
  "content": [
57
- {"type": "image", "image": "<path_to_reference_image>"},
58
- {"type": "image", "image": "<path_to_query_image>"},
59
- {"type": "text", "text": "The first image is a normal reference. Is there any anomaly in the second image? If so, describe it."},
60
  ],
61
  }
62
  ]
63
 
64
  text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
65
  image_inputs, video_inputs = process_vision_info(messages)
66
- inputs = processor(
67
- text=[text],
68
- images=image_inputs,
69
- videos=video_inputs,
70
- return_tensors="pt"
71
- ).to(model.device)
72
-
73
- output_ids = model.generate(**inputs, max_new_tokens=512)
74
- response = processor.batch_decode(
75
- output_ids[:, inputs.input_ids.shape[1]:],
76
- skip_special_tokens=True
77
- )[0]
78
- print(response)
79
  ```
80
 
81
- ## Citation
 
 
 
 
 
 
 
 
 
82
 
83
- **BibTeX:**
 
 
 
 
 
 
84
 
85
  ```bibtex
86
- @article{jiang2026ad,
87
- title = {AD-Copilot: A Vision-Language Assistant for Industrial Anomaly Detection via Visual In-context Comparison},
88
- author = {Jiang, Xi and Guo, Yue and Li, Jian and Liu, Yong and Gao, Bin-Bin and Deng, Hanqiu and Liu, Jun and Zhao, Heng and Wang, Chengjie and Zheng, Feng},
89
- journal = {arXiv preprint arXiv:2603.13779},
90
- year = {2026}
91
  }
92
- ```
 
1
  ---
2
  library_name: transformers
3
+ license: apache-2.0
4
  tags:
5
  - anomaly-detection
6
  - vision-language-model
7
  - industrial-inspection
8
+ - comparison-aware
9
+ - qwen2.5-vl
10
+ pipeline_tag: image-text-to-text
11
+ language:
12
+ - en
13
+ base_model:
14
+ - Qwen/Qwen2.5-VL-7B-Instruct
15
  ---
16
 
17
+ # AD-Copilot: Comparison-Aware Anomaly Detection with Vision-Language Models
18
 
19
+ AD-Copilot extends Qwen2.5-VL-7B with a novel **comparison-aware visual encoder** that generates
20
+ special comparison tokens capturing differences between a reference image and a test image,
21
+ achieving **state-of-the-art results** on industrial anomaly detection benchmarks.
22
 
23
+ ## Key Innovation
24
 
25
+ - **ADCopilotCompareVisualEncoder**: Bidirectional cross-attention mechanism that compares reference and test images
26
+ - **100 comparison tokens** per image pair, injected into the language model sequence
27
+ - Achieves **78.74% accuracy** on OmniDiff benchmark (vs. 72.19% for base Qwen2.5-VL)
28
 
29
+ ## Links
 
 
 
 
30
 
31
+ | Resource | Link |
32
+ |----------|------|
33
+ | **Paper** | [arXiv:2603.13779](https://arxiv.org/abs/2603.13779v1) |
34
+ | **Code** | [GitHub](https://github.com/jam-cc/AD-Copilot) |
35
+ | **Demo** | [HuggingFace Space](https://huggingface.co/spaces/jiang-cc/AD-Copilot) |
36
 
37
+ ## Quick Start
 
 
 
 
 
 
 
 
 
 
 
 
 
38
 
39
  ```python
40
+ import torch
41
+ from transformers import AutoModelForVision2Seq, AutoProcessor
42
  from qwen_vl_utils import process_vision_info
43
 
44
+ model = AutoModelForVision2Seq.from_pretrained(
45
  "jiang-cc/AD-Copilot",
46
+ torch_dtype=torch.bfloat16,
47
+ device_map="auto",
48
+ trust_remote_code=True,
49
+ )
50
+ processor = AutoProcessor.from_pretrained(
51
+ "jiang-cc/AD-Copilot",
52
+ min_pixels=64 * 28 * 28,
53
+ max_pixels=1280 * 28 * 28,
54
+ trust_remote_code=True,
55
  )
 
56
 
57
  messages = [
58
  {
59
  "role": "user",
60
  "content": [
61
+ {"type": "image", "image": "path/to/good_image.png"},
62
+ {"type": "image", "image": "path/to/test_image.png"},
63
+ {"type": "text", "text": "The first image is good. Is there any anomaly in the second image? A.yes, B.no. Please answer the letter only."},
64
  ],
65
  }
66
  ]
67
 
68
  text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
69
  image_inputs, video_inputs = process_vision_info(messages)
70
+ inputs = processor(text=[text], images=[image_inputs], return_tensors="pt").to(model.device)
71
+
72
+ with torch.inference_mode():
73
+ output_ids = model.generate(**inputs, max_new_tokens=128, do_sample=False)
74
+
75
+ trimmed = [o[len(i):] for i, o in zip(inputs.input_ids, output_ids)]
76
+ print(processor.batch_decode(trimmed, skip_special_tokens=True)[0])
 
 
 
 
 
 
77
  ```
78
 
79
+ ## Benchmark Results (OmniDiff)
80
+
81
+ | Model | Visited IAD | Avg ACC |
82
+ |-------|-------------|---------|
83
+ | MiniCPM-V2.6 | 0 | 67.90% |
84
+ | EIAD | 128k | 69.40% |
85
+ | Qwen2.5-VL | 0 | 72.19% |
86
+ | **AD-Copilot (Ours)** | **206k** | **78.74%** |
87
+
88
+ ## Architecture
89
 
90
+ - **Base Model**: Qwen2.5-VL-7B-Instruct (28 layers, 3584 hidden size)
91
+ - **Vision Encoder**: Qwen2.5-VL ViT (32 layers, 1280 hidden size)
92
+ - **Comparison Encoder**: Bidirectional cross-attention + query decoder (100 tokens)
93
+ - **Parameters**: ~8B total
94
+ - **Dtype**: bfloat16
95
+
96
+ ## Citation
97
 
98
  ```bibtex
99
+ @article{adcopilot2025,
100
+ title={AD-Copilot: Comparison-Aware Anomaly Detection with Vision-Language Models},
101
+ author={Jiang, Xi and others},
102
+ journal={arXiv preprint arXiv:2603.13779},
103
+ year={2025}
104
  }
105
+ ```