--- language: - en license: mit pipeline_tag: video-text-to-text --- # Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models This is the model checkpoint of the GeoVR, a paradigm to restructure MLLM’s intrinsic representations with geometric awareness using purely 2D videos for Spatial Intelligence. - **GitHub:** [https://github.com/WHB139426/GeoVR-MLLM](https://github.com/WHB139426/GeoVR-MLLM) - **Paper:** [https://arxiv.org/abs/2606.05833](https://arxiv.org/abs/2606.05833) ## Sample Usage To use this model, you need to clone the [official repository](https://github.com/WHB139426/GeoVR-MLLM) to access the custom modeling files. ```python import torch from utils.utils import * from transformers import AutoProcessor from models.qwen3vl_geo import Qwen3VLForConditionalGeneration device = 'cuda:0' model_id = "WHB139426/GeoVR-VGGT-Qwen3-VL-2B" model = Qwen3VLForConditionalGeneration.from_pretrained( model_id, geometry_encoder_path=None, metric_model_path=None, dtype=torch.bfloat16, attn_implementation="flash_attention_2", add_camera=False, add_scale=False, add_depth=False, distill_geometry_feature=False, ) model.load_geometric_weights(model_id) model.to(device) num_frames = 32 processor = AutoProcessor.from_pretrained(model_id) processor.video_processor.size = {"longest_edge": 384*num_frames*32*32, "shortest_edge": 4*num_frames*32*32} messages = [ { "role": "user", "content": [ {"type": "video", "video": './assets/scene0111_02.mp4',}, {"type": "text", "text": "Measuring distance from the nearest points, select the closest object (trash bin, door, table, refrigerator) to the tv. If multiple exist, use the nearest instance. Options: A. trash bin B. door C. table D. refrigerator Answer with the option's letter from the given choices directly."}, ], } ] generation_kwargs = { 'do_sample': True, 'top_p': 0.8, 'top_k': 20, 'temperature': 0.7, 'repetition_penalty': 1.0, 'max_new_tokens': 32*1024, } inputs = processor.apply_chat_template( messages, tokenize=True, add_generation_prompt=True, return_dict=True, return_tensors="pt", num_frames=num_frames, fps=None, enable_thinking=False, ).to(model.device) with torch.cuda.amp.autocast(enabled=True, dtype=torch.bfloat16): with torch.inference_mode(): generated_ids = model.generate(**inputs, **generation_kwargs) output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0].strip() print(output_text) ``` ## Citation If you find this work useful, please consider citing: ```bibtex @article{wang2026learning, title={Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models}, author={Wang, Haibo and Huang, Lifu}, journal={arXiv preprint arXiv:2606.05833}, year={2026} } ```