YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)
VLM-OVD (inference-only)
This repository contains inference-only artifacts exported from a training checkpoint.
Note: Current performance is limited due to ongoing research on class token mapping. Further investigation is needed to improve accuracy.
Benchmark
VRAM Usage (A100 80GB)
| Model | Model Weight | Peak VRAM | Max Batch Size |
|---|---|---|---|
| Grounding DINO Base | 0.891 GB | 2.491 GB | 16 |
| Grounding DINO Tiny | 0.645 GB | 2.150 GB | 16 |
| Qwen 2B OVD | 4.028 GB | 4.213 GB | 128 |
Latency (per image, avg 10 runs)
| Model | Latency |
|---|---|
| Grounding DINO Base | 113.28 ms |
| Grounding DINO Tiny | 99.79 ms |
| Qwen 2B OVD | 99.92 ms |
mAP (class-agnostic)
| Model | mAP | mAP (large) | mAP (medium) | mAP (small) |
|---|---|---|---|---|
| Grounding DINO Base | 0.5349 | 0.7293 | 0.5408 | 0.2984 |
| Grounding DINO Tiny | 0.4503 | 0.6474 | 0.4461 | 0.2221 |
| Qwen 2B OVD | 0.2401 | 0.4348 | 0.0321 | 0.0001 |
Quick start (single image)
import torch
import requests
from io import BytesIO
from PIL import Image, ImageDraw
from transformers import AutoConfig, AutoModel, AutoProcessor
repo_id = "xpuenabler/OVD_MOSP_Qwen_no_category"
image_source = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
query = "person, dog"
# Load image
if image_source.startswith(("http://", "https://")):
response = requests.get(image_source)
pil = Image.open(BytesIO(response.content)).convert("RGB")
else:
pil = Image.open(image_source).convert("RGB")
# Load config and model
cfg = AutoConfig.from_pretrained(repo_id, trust_remote_code=True)
model = AutoModel.from_pretrained(repo_id, trust_remote_code=True)
model.eval()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
# Load processor from the base VLM
processor = AutoProcessor.from_pretrained(cfg.vlm_model_name, trust_remote_code=True)
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": pil},
{"type": "text", "text": query},
],
}
]
inputs = processor.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=False,
return_dict=True,
return_tensors="pt",
)
pixel_values = inputs["pixel_values"].to(device)
image_grid_thw = inputs["image_grid_thw"].to(device)
input_ids = inputs["input_ids"].to(device)
attention_mask = inputs["attention_mask"].to(device)
# Run inference
with torch.no_grad():
outputs = model.forward_inference(
pixel_values=pixel_values,
input_ids=input_ids,
attention_mask=attention_mask,
image_grid_thw=image_grid_thw,
)
pred_boxes = outputs.pred_boxes[0].float().cpu() # (Q, 4) cxcywh normalized
pred_scores = outputs.pred_objectness[0].sigmoid().float().cpu() # (Q,)
# Top-k predictions
topk = 10
score_threshold = 0.5
order = torch.argsort(pred_scores, descending=True)[:topk]
print("Top boxes (cxcywh normalized):", pred_boxes[order])
print("Top scores:", pred_scores[order])
# Visualize and save
w, h = pil.size
vis = pil.copy()
draw = ImageDraw.Draw(vis)
min_box_size = 0.02 # Filter out degenerate boxes (normalized, 2%)
for idx in order:
score = pred_scores[idx].item()
if score < score_threshold:
continue
cx, cy, bw, bh = pred_boxes[idx].tolist()
# Skip degenerate boxes (near-zero size or at origin)
if bw < min_box_size or bh < min_box_size:
continue
# Convert cxcywh (normalized) to xyxy (pixel)
x1 = (cx - bw / 2) * w
y1 = (cy - bh / 2) * h
x2 = (cx + bw / 2) * w
y2 = (cy + bh / 2) * h
draw.rectangle([x1, y1, x2, y2], outline="red", width=2)
draw.text((x1, max(0, y1 - 12)), f"{score:.2f}", fill="red")
vis.save("output.jpg")
print("Saved visualization to output.jpg")
Notes
- Box format: Predictions are in
cxcywh(center-x, center-y, width, height), normalized to [0, 1]. - Base VLM: Check
config.jsonforvlm_model_name(e.g.,Qwen/Qwen3-VL-2B-Instruct). - Head type: Check
config.jsonforhead_type(class_agnosticorovd).
- Downloads last month
- 2
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
