How to use from
vLLM
Install from pip and serve model
# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "nvidia/VideoITG-8B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/VideoITG-8B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'
Use Docker
docker model run hf.co/nvidia/VideoITG-8B
Quick Links

VideoITG-8B

[🌐Homepage] [💻GitHub] [📜Tech Report] [🤗VideoITG-40K]

Introduction

VideoITG-8B is a multimodal video understanding model trained with instructed temporal grounding, equipped with the ability to enhance Video Large Language Models through intelligent frame selection. The model tackles the complexities of real-world video scenarios by aligning frame sampling with user instructions. Please check our paper for more details.

Model Details

  • Model name: VideoITG-8B
  • Architecture: Customized Eagle-8B base model, fine-tuned with Instructed Temporal Grounding
  • Model type: Multimodal Large Language Model with Video Understanding
  • Languages: English (primary), multilingual (partially)

Model Performance

Model Base Model Frames LongVideoBench MLVU VideoMME CG-Bench
VideoITG-7B InternVL2.5-8B 32 61.9 (+2.9%) 75.0 (+7.8%) 67.3 (+4.0%) 46.7 (+7.0%)
VideoITG-7B InternVL2.5-26B 32 63.0 (+1.0%) 78.9 (+6.1%) 69.9 (+2.5%) 48.7 (+6.0%)
VideoITG-7B LLaVA-Video-7B 32 61.6 (+3.6%) 74.6 (+8.6%) 66.1 (+3.0%) 42.8 (+9.0%)
VideoITG-7B LLaVA-Video-7B 64 60.9 (+7.4%) 76.3 (+7.6%) 66.4 (+1.9%) 42.9 (+8.1%)

Key Features

  • Instructed Temporal Grounding: Intelligently selects video frames based on user instructions
  • Plug-and-Play: Seamlessly integrates with existing video language models
  • Superior Temporal Understanding: Excels in tasks requiring precise temporal grounding

License

Citation

If you find this project useful, please cite our work:

@article{wang2025videoitg,
  title     = {VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding},
  author    = {Shihao Wang and Guo Chen and De-An Huang and Zhiqi Li and Minghan Li and Guilin Liu and Jose M. Alvarez and Lei Zhang and Zhiding Yu},
  journal   = {arXiv preprint arXiv:2507.13353},
  year      = {2025}
}

Acknowledgement

  • Eagle: The codebase we built upon
  • LMMs-Eval: Many thanks to the LMMs-Lab for the easy-to-use evaluation tools
  • LLaVA-OneVision and LLaVA-Video: We train our models with data from these great open-source projects
Downloads last month
186
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nvidia/VideoITG-8B

Collection including nvidia/VideoITG-8B

Paper for nvidia/VideoITG-8B