| --- |
| language: |
| - en |
| license: mit |
| pipeline_tag: video-text-to-text |
| --- |
| |
| # Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models |
|
|
| This is the model checkpoint of the GeoVR, a paradigm to restructure MLLM’s intrinsic representations with geometric awareness using purely 2D videos for Spatial Intelligence. |
|
|
| - **GitHub:** [https://github.com/WHB139426/GeoVR-MLLM](https://github.com/WHB139426/GeoVR-MLLM) |
| - **Paper:** [https://arxiv.org/abs/2606.05833](https://arxiv.org/abs/2606.05833) |
|
|
| ## Sample Usage |
|
|
| To use this model, you need to clone the [official repository](https://github.com/WHB139426/GeoVR-MLLM) to access the custom modeling files. |
|
|
| ```python |
| import torch |
| from utils.utils import * |
| from transformers import AutoProcessor |
| from models.qwen3vl_geo import Qwen3VLForConditionalGeneration |
| |
| device = 'cuda:0' |
| model_id = "WHB139426/GeoVR-VGGT-Qwen3-VL-2B" |
| |
| model = Qwen3VLForConditionalGeneration.from_pretrained( |
| model_id, |
| geometry_encoder_path=None, |
| metric_model_path=None, |
| dtype=torch.bfloat16, |
| attn_implementation="flash_attention_2", |
| add_camera=False, |
| add_scale=False, |
| add_depth=False, |
| distill_geometry_feature=False, |
| ) |
| model.load_geometric_weights(model_id) |
| model.to(device) |
| |
| num_frames = 32 |
| processor = AutoProcessor.from_pretrained(model_id) |
| processor.video_processor.size = {"longest_edge": 384*num_frames*32*32, "shortest_edge": 4*num_frames*32*32} |
| |
| messages = [ |
| { |
| "role": "user", |
| "content": [ |
| {"type": "video", "video": './assets/scene0111_02.mp4',}, |
| {"type": "text", "text": "Measuring distance from the nearest points, select the closest object (trash bin, door, table, refrigerator) to the tv. If multiple exist, use the nearest instance. |
| Options: |
| A. trash bin |
| B. door |
| C. table |
| D. refrigerator |
| Answer with the option's letter from the given choices directly."}, |
| ], |
| } |
| ] |
| |
| generation_kwargs = { |
| 'do_sample': True, |
| 'top_p': 0.8, |
| 'top_k': 20, |
| 'temperature': 0.7, |
| 'repetition_penalty': 1.0, |
| 'max_new_tokens': 32*1024, |
| } |
| |
| inputs = processor.apply_chat_template( |
| messages, |
| tokenize=True, |
| add_generation_prompt=True, |
| return_dict=True, |
| return_tensors="pt", |
| num_frames=num_frames, |
| fps=None, |
| enable_thinking=False, |
| ).to(model.device) |
| |
| with torch.cuda.amp.autocast(enabled=True, dtype=torch.bfloat16): |
| with torch.inference_mode(): |
| generated_ids = model.generate(**inputs, **generation_kwargs) |
| output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0].strip() |
| print(output_text) |
| ``` |
|
|
| ## Citation |
|
|
| If you find this work useful, please consider citing: |
|
|
| ```bibtex |
| @article{wang2026learning, |
| title={Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models}, |
| author={Wang, Haibo and Huang, Lifu}, |
| journal={arXiv preprint arXiv:2606.05833}, |
| year={2026} |
| } |
| ``` |