Instructions to use jiang-cc/AD-Copilot with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use jiang-cc/AD-Copilot with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="jiang-cc/AD-Copilot", trust_remote_code=True)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModelForImageTextToText
model = AutoModelForImageTextToText.from_pretrained("jiang-cc/AD-Copilot", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use jiang-cc/AD-Copilot with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "jiang-cc/AD-Copilot"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "jiang-cc/AD-Copilot",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/jiang-cc/AD-Copilot

SGLang

How to use jiang-cc/AD-Copilot with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "jiang-cc/AD-Copilot" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "jiang-cc/AD-Copilot",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "jiang-cc/AD-Copilot" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "jiang-cc/AD-Copilot",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use jiang-cc/AD-Copilot with Docker Model Runner:
```
docker model run hf.co/jiang-cc/AD-Copilot
```

jiang-cc commited on Apr 8

Commit

366408f

verified ·

1 Parent(s): f2bc369

Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +68 -55

README.md CHANGED Viewed

@@ -1,92 +1,105 @@
 ---
 library_name: transformers
 tags:
   - anomaly-detection
   - vision-language-model
   - industrial-inspection
-  - multimodal
-  - in-context-learning
 ---
-# AD-Copilot
-A vision-language assistant for industrial anomaly detection via visual in-context comparison.
-## Model Details
-### Model Description
-- **Developed by:** Xi Jiang, Yue Guo, Jian Li, Yong Liu, Bin-Bin Gao, Hanqiu Deng, Jun Liu, Heng Zhao, Chengjie Wang, Feng Zheng
-- **Model type:** Vision-Language Model (VLM)
-- **Language(s):** English and Chinese
-- **License:** Apache 2.0
-- **Finetuned from:** [Qwen/Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct)
-### Model Sources
-- **Repository:** [jam-cc/AD-Copilot](https://github.com/jam-cc/AD-Copilot)
-- **Paper:** [arXiv:2603.13779](https://arxiv.org/abs/2603.13779v1)
-## Uses
-### Direct Use
-AD-Copilot can be used for:
-- Industrial anomaly detection and localization
-- Natural language question answering about product defects
-- Visual comparison between normal reference images and query images
-- General visual question answering
-## How to Get Started with the Model
 ```python
-from transformers import AutoModelForImageTextToText, AutoProcessor
 from qwen_vl_utils import process_vision_info
-model = AutoModelForImageTextToText.from_pretrained(
     "jiang-cc/AD-Copilot",
-    torch_dtype="auto",
-    device_map="auto"
 )
-processor = AutoProcessor.from_pretrained("jiang-cc/AD-Copilot")
 messages = [
     {
         "role": "user",
         "content": [
-            {"type": "image", "image": "<path_to_reference_image>"},
-            {"type": "image", "image": "<path_to_query_image>"},
-            {"type": "text", "text": "The first image is a normal reference. Is there any anomaly in the second image? If so, describe it."},
         ],
     }
 ]
 text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
 image_inputs, video_inputs = process_vision_info(messages)
-inputs = processor(
-    text=[text],
-    images=image_inputs,
-    videos=video_inputs,
-    return_tensors="pt"
-).to(model.device)
-output_ids = model.generate(**inputs, max_new_tokens=512)
-response = processor.batch_decode(
-    output_ids[:, inputs.input_ids.shape[1]:],
-    skip_special_tokens=True
-)[0]
-print(response)
 ```
-## Citation
-**BibTeX:**
 ```bibtex
-@article{jiang2026ad,
-  title   = {AD-Copilot: A Vision-Language Assistant for Industrial Anomaly Detection via Visual In-context Comparison},
-  author  = {Jiang, Xi and Guo, Yue and Li, Jian and Liu, Yong and Gao, Bin-Bin and Deng, Hanqiu and Liu, Jun and Zhao, Heng and Wang, Chengjie and Zheng, Feng},
-  journal = {arXiv preprint arXiv:2603.13779},
-  year    = {2026}
 }
-```

 ---
 library_name: transformers
+license: apache-2.0
 tags:
   - anomaly-detection
   - vision-language-model
   - industrial-inspection
+  - comparison-aware
+  - qwen2.5-vl
+pipeline_tag: image-text-to-text
+language:
+  - en
+base_model:
+  - Qwen/Qwen2.5-VL-7B-Instruct
 ---
+# AD-Copilot: Comparison-Aware Anomaly Detection with Vision-Language Models
+AD-Copilot extends Qwen2.5-VL-7B with a novel **comparison-aware visual encoder** that generates
+special comparison tokens capturing differences between a reference image and a test image,
+achieving **state-of-the-art results** on industrial anomaly detection benchmarks.
+## Key Innovation
+- **ADCopilotCompareVisualEncoder**: Bidirectional cross-attention mechanism that compares reference and test images
+- **100 comparison tokens** per image pair, injected into the language model sequence
+- Achieves **78.74% accuracy** on OmniDiff benchmark (vs. 72.19% for base Qwen2.5-VL)
+## Links
+| Resource | Link |
+|----------|------|
+| **Paper** | [arXiv:2603.13779](https://arxiv.org/abs/2603.13779v1) |
+| **Code** | [GitHub](https://github.com/jam-cc/AD-Copilot) |
+| **Demo** | [HuggingFace Space](https://huggingface.co/spaces/jiang-cc/AD-Copilot) |
+## Quick Start
 ```python
+import torch
+from transformers import AutoModelForVision2Seq, AutoProcessor
 from qwen_vl_utils import process_vision_info
+model = AutoModelForVision2Seq.from_pretrained(
     "jiang-cc/AD-Copilot",
+    torch_dtype=torch.bfloat16,
+    device_map="auto",
+    trust_remote_code=True,
+)
+processor = AutoProcessor.from_pretrained(
+    "jiang-cc/AD-Copilot",
+    min_pixels=64 * 28 * 28,
+    max_pixels=1280 * 28 * 28,
+    trust_remote_code=True,
 )
 messages = [
     {
         "role": "user",
         "content": [
+            {"type": "image", "image": "path/to/good_image.png"},
+            {"type": "image", "image": "path/to/test_image.png"},
+            {"type": "text", "text": "The first image is good. Is there any anomaly in the second image? A.yes, B.no. Please answer the letter only."},
         ],
     }
 ]
 text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
 image_inputs, video_inputs = process_vision_info(messages)
+inputs = processor(text=[text], images=[image_inputs], return_tensors="pt").to(model.device)
+with torch.inference_mode():
+    output_ids = model.generate(**inputs, max_new_tokens=128, do_sample=False)
+trimmed = [o[len(i):] for i, o in zip(inputs.input_ids, output_ids)]
+print(processor.batch_decode(trimmed, skip_special_tokens=True)[0])
 ```
+## Benchmark Results (OmniDiff)
+| Model | Visited IAD | Avg ACC |
+|-------|-------------|---------|
+| MiniCPM-V2.6 | 0 | 67.90% |
+| EIAD | 128k | 69.40% |
+| Qwen2.5-VL | 0 | 72.19% |
+| **AD-Copilot (Ours)** | **206k** | **78.74%** |
+## Architecture
+- **Base Model**: Qwen2.5-VL-7B-Instruct (28 layers, 3584 hidden size)
+- **Vision Encoder**: Qwen2.5-VL ViT (32 layers, 1280 hidden size)
+- **Comparison Encoder**: Bidirectional cross-attention + query decoder (100 tokens)
+- **Parameters**: ~8B total
+- **Dtype**: bfloat16
+## Citation
 ```bibtex
+@article{adcopilot2025,
+  title={AD-Copilot: Comparison-Aware Anomaly Detection with Vision-Language Models},
+  author={Jiang, Xi and others},
+  journal={arXiv preprint arXiv:2603.13779},
+  year={2025}
 }
+```