GeoVR / README.md
WHB139426's picture
Add pipeline tag and sample usage (#1)
d7da855
metadata
language:
  - en
license: mit
pipeline_tag: video-text-to-text

Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models

This is the model checkpoint of the GeoVR, a paradigm to restructure MLLM’s intrinsic representations with geometric awareness using purely 2D videos for Spatial Intelligence.

Sample Usage

To use this model, you need to clone the official repository to access the custom modeling files.

import torch
from utils.utils import *
from transformers import AutoProcessor
from models.qwen3vl_geo import Qwen3VLForConditionalGeneration

device = 'cuda:0'
model_id = "WHB139426/GeoVR-VGGT-Qwen3-VL-2B"

model = Qwen3VLForConditionalGeneration.from_pretrained(
    model_id,
    geometry_encoder_path=None,
    metric_model_path=None,
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    add_camera=False,
    add_scale=False,
    add_depth=False,
    distill_geometry_feature=False,
)
model.load_geometric_weights(model_id)
model.to(device)

num_frames = 32
processor = AutoProcessor.from_pretrained(model_id)
processor.video_processor.size = {"longest_edge": 384*num_frames*32*32, "shortest_edge": 4*num_frames*32*32}

messages = [
    {
        "role": "user",
        "content": [
            {"type": "video", "video": './assets/scene0111_02.mp4',},
            {"type": "text", "text": "Measuring distance from the nearest points, select the closest object (trash bin, door, table, refrigerator) to the tv. If multiple exist, use the nearest instance.
Options:
A. trash bin
B. door
C. table
D. refrigerator
Answer with the option's letter from the given choices directly."},
        ],
    }
]

generation_kwargs = {
    'do_sample': True,
    'top_p': 0.8,
    'top_k': 20,
    'temperature': 0.7,
    'repetition_penalty': 1.0,
    'max_new_tokens': 32*1024,
}

inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt",
    num_frames=num_frames,
    fps=None,
    enable_thinking=False,
).to(model.device)

with torch.cuda.amp.autocast(enabled=True, dtype=torch.bfloat16):
    with torch.inference_mode():
        generated_ids = model.generate(**inputs, **generation_kwargs) 
        output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0].strip()
print(output_text)

Citation

If you find this work useful, please consider citing:

@article{wang2026learning,
  title={Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models},
  author={Wang, Haibo and Huang, Lifu},
  journal={arXiv preprint arXiv:2606.05833},
  year={2026}
}