WHB139426
/

GeoVR

Video-Text-to-Text

Model card Files Files and versions

GeoVR / README.md

WHB139426's picture

Add pipeline tag and sample usage (#1)

d7da855 5 days ago

|

history blame contribute delete

2.99 kB

	---
	language:
	- en
	license: mit
	pipeline_tag: video-text-to-text
	---

	# Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models

	This is the model checkpoint of the GeoVR, a paradigm to restructure MLLM’s intrinsic representations with geometric awareness using purely 2D videos for Spatial Intelligence.

	- GitHub: [https://github.com/WHB139426/GeoVR-MLLM](https://github.com/WHB139426/GeoVR-MLLM)
	- Paper: [https://arxiv.org/abs/2606.05833](https://arxiv.org/abs/2606.05833)

	## Sample Usage

	To use this model, you need to clone the [official repository](https://github.com/WHB139426/GeoVR-MLLM) to access the custom modeling files.

	```python
	import torch
	from utils.utils import *
	from transformers import AutoProcessor
	from models.qwen3vl_geo import Qwen3VLForConditionalGeneration

	device = 'cuda:0'
	model_id = "WHB139426/GeoVR-VGGT-Qwen3-VL-2B"

	model = Qwen3VLForConditionalGeneration.from_pretrained(
	model_id,
	geometry_encoder_path=None,
	metric_model_path=None,
	dtype=torch.bfloat16,
	attn_implementation="flash_attention_2",
	add_camera=False,
	add_scale=False,
	add_depth=False,
	distill_geometry_feature=False,
	)
	model.load_geometric_weights(model_id)
	model.to(device)

	num_frames = 32
	processor = AutoProcessor.from_pretrained(model_id)
	processor.video_processor.size = {"longest_edge": 384num_frames3232, "shortest_edge": 4num_frames3232}

	messages = [
	{
	"role": "user",
	"content": [
	{"type": "video", "video": './assets/scene0111_02.mp4',},
	{"type": "text", "text": "Measuring distance from the nearest points, select the closest object (trash bin, door, table, refrigerator) to the tv. If multiple exist, use the nearest instance.
	Options:
	A. trash bin
	B. door
	C. table
	D. refrigerator
	Answer with the option's letter from the given choices directly."},
	],
	}
	]

	generation_kwargs = {
	'do_sample': True,
	'top_p': 0.8,
	'top_k': 20,
	'temperature': 0.7,
	'repetition_penalty': 1.0,
	'max_new_tokens': 32*1024,
	}

	inputs = processor.apply_chat_template(
	messages,
	tokenize=True,
	add_generation_prompt=True,
	return_dict=True,
	return_tensors="pt",
	num_frames=num_frames,
	fps=None,
	enable_thinking=False,
	).to(model.device)

	with torch.cuda.amp.autocast(enabled=True, dtype=torch.bfloat16):
	with torch.inference_mode():
	generated_ids = model.generate(inputs, generation_kwargs)
	output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0].strip()
	print(output_text)
	```

	## Citation

	If you find this work useful, please consider citing:

	```bibtex
	@article{wang2026learning,
	title={Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models},
	author={Wang, Haibo and Huang, Lifu},
	journal={arXiv preprint arXiv:2606.05833},
	year={2026}
	}
	```