YAML Metadata
Warning:
empty or missing yaml metadata in repo card
(https://huggingface.co/docs/hub/model-cards#model-card-metadata)
InternVL-OVD (inference-only)
This repository contains inference-only artifacts exported from a training checkpoint.
Grounding Accuracy
| Dataset | Accuracy |
|---|---|
| RefCOCO | 90.8 |
| RefCOCO+ | 87.1 |
| RefCOCO-g | 81.8 |
Inference Speed Comparison
| Model | Decoding | Latency (1 obj) | Latency (4 obj) |
|---|---|---|---|
| VLM only (MOSP) | Autoregressive | 1601.81 ms | 2487.41 ms |
| VLM only (SOSP-B) | Autoregressive | 857.70 ms | 1386.36 ms |
| VLM+DeTrHead | Single step | 80.67 ms | 245.52 ms |
Peak VRAM Usage
| Model | 1 Object | 4 Objects |
|---|---|---|
| VLM only (MOSP) | 3.99 GB | 5.85 GB |
| VLM only (SOSP-B) | 2.34 GB | 3.31 GB |
| VLM+DeTrHead | 2.63 GB | 4.50 GB |
- Num of image tokens : 128token/patch x 7patches
Single Inference
import torch
import requests
from io import BytesIO
from PIL import Image, ImageDraw
from transformers import AutoConfig, AutoModel, AutoTokenizer
repo_id = "xpuenabler/OVD_SOSP-B_Internvl_model2"
image_source = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
query = "dog"
# Load image
if image_source.startswith(("http://", "https://")):
response = requests.get(image_source)
pil = Image.open(BytesIO(response.content)).convert("RGB")
else:
pil = Image.open(image_source).convert("RGB")
# Load model
cfg = AutoConfig.from_pretrained(repo_id, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(cfg.vlm_model_name, trust_remote_code=True, use_fast=False)
model = AutoModel.from_pretrained(repo_id, trust_remote_code=True)
model.eval()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
# Run inference
outputs = model.infer_image(image=pil, query=query, tokenizer=tokenizer)
pred_boxes = outputs.pred_boxes[0].float().cpu()
# Visualize
w, h = pil.size
draw = ImageDraw.Draw(pil)
x1, y1, x2, y2 = pred_boxes[0].tolist()
x1, y1, x2, y2 = x1 * w, y1 * h, x2 * w, y2 * h
draw.rectangle([x1, y1, x2, y2], outline="red", width=3)
draw.text((x1, max(y1 - 20, 0)), query, fill="red")
pil.save("output.jpg")
print(f"Saved: output.jpg")
Batch Inference
import torch
import requests
from io import BytesIO
from PIL import Image, ImageDraw
from transformers import AutoConfig, AutoModel, AutoTokenizer
repo_id = "xpuenabler/OVD_SOSP-B_Internvl_model2"
image_source = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
queries = ["person", "dog"]
# Load image
if image_source.startswith(("http://", "https://")):
response = requests.get(image_source)
pil = Image.open(BytesIO(response.content)).convert("RGB")
else:
pil = Image.open(image_source).convert("RGB")
# Load model
cfg = AutoConfig.from_pretrained(repo_id, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(cfg.vlm_model_name, trust_remote_code=True, use_fast=False)
model = AutoModel.from_pretrained(repo_id, trust_remote_code=True)
model.eval()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
# Run inference
outputs = model.infer_batch(image=pil, queries=queries, tokenizer=tokenizer)
pred_boxes = outputs.pred_boxes.float().cpu()
# Visualize
w, h = pil.size
draw = ImageDraw.Draw(pil)
for boxes, query in zip(pred_boxes, queries):
x1, y1, x2, y2 = boxes[0].tolist()
x1, y1, x2, y2 = x1 * w, y1 * h, x2 * w, y2 * h
draw.rectangle([x1, y1, x2, y2], outline="red", width=3)
draw.text((x1, max(y1 - 20, 0)), query, fill="red")
pil.save("output.jpg")
print(f"Saved: output.jpg")
- Downloads last month
- 98
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support

.png)
