YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

VLM-OVD (inference-only)

This repository contains inference-only artifacts exported from a training checkpoint.

Note: Current performance is limited due to ongoing research on class token mapping. Further investigation is needed to improve accuracy.

Benchmark

VRAM Usage (A100 80GB)

Model	Model Weight	Peak VRAM	Max Batch Size
Grounding DINO Base	0.891 GB	2.491 GB	16
Grounding DINO Tiny	0.645 GB	2.150 GB	16
Qwen 2B OVD	4.028 GB	4.213 GB	128

Latency (per image, avg 10 runs)

Model	Latency
Grounding DINO Base	113.28 ms
Grounding DINO Tiny	99.79 ms
Qwen 2B OVD	99.92 ms

mAP (class-agnostic)

Model	mAP	mAP (large)	mAP (medium)	mAP (small)
Grounding DINO Base	0.5349	0.7293	0.5408	0.2984
Grounding DINO Tiny	0.4503	0.6474	0.4461	0.2221
Qwen 2B OVD	0.2401	0.4348	0.0321	0.0001

Quick start (single image)

import torch
import requests
from io import BytesIO
from PIL import Image, ImageDraw
from transformers import AutoConfig, AutoModel, AutoProcessor

repo_id = "xpuenabler/OVD_MOSP_Qwen_no_category"
image_source = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
query = "person, dog"

# Load image
if image_source.startswith(("http://", "https://")):
    response = requests.get(image_source)
    pil = Image.open(BytesIO(response.content)).convert("RGB")
else:
    pil = Image.open(image_source).convert("RGB")

# Load config and model
cfg = AutoConfig.from_pretrained(repo_id, trust_remote_code=True)
model = AutoModel.from_pretrained(repo_id, trust_remote_code=True)
model.eval()

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

# Load processor from the base VLM
processor = AutoProcessor.from_pretrained(cfg.vlm_model_name, trust_remote_code=True)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": pil},
            {"type": "text", "text": query},
        ],
    }
]

inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=False,
    return_dict=True,
    return_tensors="pt",
)

pixel_values = inputs["pixel_values"].to(device)
image_grid_thw = inputs["image_grid_thw"].to(device)
input_ids = inputs["input_ids"].to(device)
attention_mask = inputs["attention_mask"].to(device)

# Run inference
with torch.no_grad():
    outputs = model.forward_inference(
        pixel_values=pixel_values,
        input_ids=input_ids,
        attention_mask=attention_mask,
        image_grid_thw=image_grid_thw,
    )

pred_boxes = outputs.pred_boxes[0].float().cpu()        # (Q, 4) cxcywh normalized
pred_scores = outputs.pred_objectness[0].sigmoid().float().cpu()  # (Q,)

# Top-k predictions
topk = 10
score_threshold = 0.5
order = torch.argsort(pred_scores, descending=True)[:topk]
print("Top boxes (cxcywh normalized):", pred_boxes[order])
print("Top scores:", pred_scores[order])

# Visualize and save
w, h = pil.size
vis = pil.copy()
draw = ImageDraw.Draw(vis)
min_box_size = 0.02  # Filter out degenerate boxes (normalized, 2%)

for idx in order:
    score = pred_scores[idx].item()
    if score < score_threshold:
        continue
    cx, cy, bw, bh = pred_boxes[idx].tolist()
    # Skip degenerate boxes (near-zero size or at origin)
    if bw < min_box_size or bh < min_box_size:
        continue
    # Convert cxcywh (normalized) to xyxy (pixel)
    x1 = (cx - bw / 2) * w
    y1 = (cy - bh / 2) * h
    x2 = (cx + bw / 2) * w
    y2 = (cy + bh / 2) * h
    draw.rectangle([x1, y1, x2, y2], outline="red", width=2)
    draw.text((x1, max(0, y1 - 12)), f"{score:.2f}", fill="red")

vis.save("output.jpg")
print("Saved visualization to output.jpg")

Notes

Box format: Predictions are in cxcywh (center-x, center-y, width, height), normalized to [0, 1].
Base VLM: Check config.json for vlm_model_name (e.g., Qwen/Qwen3-VL-2B-Instruct).
Head type: Check config.json for head_type (class_agnostic or ovd).

Downloads last month: 2

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support