YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

VLM-OVD (inference-only)

This repository contains inference-only artifacts exported from a training checkpoint.

Note: Current performance is limited due to ongoing research on class token mapping. Further investigation is needed to improve accuracy.

Benchmark

VRAM Usage (A100 80GB)

Model Model Weight Peak VRAM Max Batch Size
Grounding DINO Base 0.891 GB 2.491 GB 16
Grounding DINO Tiny 0.645 GB 2.150 GB 16
Qwen 2B OVD 4.028 GB 4.213 GB 128

Latency (per image, avg 10 runs)

Model Latency
Grounding DINO Base 113.28 ms
Grounding DINO Tiny 99.79 ms
Qwen 2B OVD 99.92 ms

mAP (class-agnostic)

Model mAP mAP (large) mAP (medium) mAP (small)
Grounding DINO Base 0.5349 0.7293 0.5408 0.2984
Grounding DINO Tiny 0.4503 0.6474 0.4461 0.2221
Qwen 2B OVD 0.2401 0.4348 0.0321 0.0001

Quick start (single image)

import torch
import requests
from io import BytesIO
from PIL import Image, ImageDraw
from transformers import AutoConfig, AutoModel, AutoProcessor

repo_id = "xpuenabler/OVD_MOSP_Qwen_no_category"
image_source = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
query = "person, dog"

# Load image
if image_source.startswith(("http://", "https://")):
    response = requests.get(image_source)
    pil = Image.open(BytesIO(response.content)).convert("RGB")
else:
    pil = Image.open(image_source).convert("RGB")

# Load config and model
cfg = AutoConfig.from_pretrained(repo_id, trust_remote_code=True)
model = AutoModel.from_pretrained(repo_id, trust_remote_code=True)
model.eval()

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

# Load processor from the base VLM
processor = AutoProcessor.from_pretrained(cfg.vlm_model_name, trust_remote_code=True)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": pil},
            {"type": "text", "text": query},
        ],
    }
]

inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=False,
    return_dict=True,
    return_tensors="pt",
)

pixel_values = inputs["pixel_values"].to(device)
image_grid_thw = inputs["image_grid_thw"].to(device)
input_ids = inputs["input_ids"].to(device)
attention_mask = inputs["attention_mask"].to(device)

# Run inference
with torch.no_grad():
    outputs = model.forward_inference(
        pixel_values=pixel_values,
        input_ids=input_ids,
        attention_mask=attention_mask,
        image_grid_thw=image_grid_thw,
    )

pred_boxes = outputs.pred_boxes[0].float().cpu()        # (Q, 4) cxcywh normalized
pred_scores = outputs.pred_objectness[0].sigmoid().float().cpu()  # (Q,)

# Top-k predictions
topk = 10
score_threshold = 0.5
order = torch.argsort(pred_scores, descending=True)[:topk]
print("Top boxes (cxcywh normalized):", pred_boxes[order])
print("Top scores:", pred_scores[order])

# Visualize and save
w, h = pil.size
vis = pil.copy()
draw = ImageDraw.Draw(vis)
min_box_size = 0.02  # Filter out degenerate boxes (normalized, 2%)

for idx in order:
    score = pred_scores[idx].item()
    if score < score_threshold:
        continue
    cx, cy, bw, bh = pred_boxes[idx].tolist()
    # Skip degenerate boxes (near-zero size or at origin)
    if bw < min_box_size or bh < min_box_size:
        continue
    # Convert cxcywh (normalized) to xyxy (pixel)
    x1 = (cx - bw / 2) * w
    y1 = (cy - bh / 2) * h
    x2 = (cx + bw / 2) * w
    y2 = (cy + bh / 2) * h
    draw.rectangle([x1, y1, x2, y2], outline="red", width=2)
    draw.text((x1, max(0, y1 - 12)), f"{score:.2f}", fill="red")

vis.save("output.jpg")
print("Saved visualization to output.jpg")

Notes

  • Box format: Predictions are in cxcywh (center-x, center-y, width, height), normalized to [0, 1].
  • Base VLM: Check config.json for vlm_model_name (e.g., Qwen/Qwen3-VL-2B-Instruct).
  • Head type: Check config.json for head_type (class_agnostic or ovd).

dog,person

Downloads last month
2
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support