Video-Text-to-Text
Transformers
Safetensors
English
Chinese
internvideo3
text-generation
video-understanding
multimodal
long-video
agent
custom_code
Instructions to use yanziang/InternVideo3-8B-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use yanziang/InternVideo3-8B-Instruct with Transformers:
# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("yanziang/InternVideo3-8B-Instruct", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
| license: apache-2.0 | |
| language: | |
| - en | |
| - zh | |
| pipeline_tag: video-text-to-text | |
| tags: | |
| - video-understanding | |
| - multimodal | |
| - long-video | |
| - agent | |
| library_name: transformers | |
| # InternVideo3-8B-Instruct | |
| ## Introduction | |
| InternVideo3 is a multimodal large language model designed for long-horizon video understanding and agentic reasoning. It introduces **Multimodal Contextual Reasoning (MCR)**, an efficient formulation that unifies perception, planning, tool use, self-reflection, and memory within a single shared context, enabling recursive multi-step reasoning over long videos. | |
| ### Key Features | |
| - **M²LA (Multimodal Multi-head Latent Attention):** A KV-cache-efficient attention architecture that reduces memory footprint via low-rank latent factorization, enabling long-context reasoning (up to 256K tokens) without dropping tokens. | |
| - **Long-Video Understanding:** Trained with a short-to-long curriculum (up to 2048 frames at 4fps), supporting hour-long video comprehension. | |
| - **Agentic Video Reasoning:** Built-in support for recursive perception-action loops with tool use (temporal grounding, ASR, web search, video segmentation) and self-verification. | |
| - **Advanced Post-Training:** Combines rule-based group sequence policy optimization (R-GSPO) and on-policy distillation from Qwen3-235B for improved temporal reasoning. | |
| ### Architecture | |
| | Component | Details | | |
| |-----------|---------| | |
| | Vision Encoder | 27-layer ViT, hidden_size=1152, patch_size=16, temporal_patch_size=2 | | |
| | Language Model | 36-layer, hidden_size=4096, 32 attention heads | | |
| | KV Latent Rank | 896 per layer | | |
| | Max Context | 262,144 tokens | | |
| | Precision | BFloat16 | | |
| ## Quickstart | |
| ### Requirements | |
| ```bash | |
| pip install transformers>=4.57.3 torch qwen-vl-utils | |
| ``` | |
| ### Basic Usage | |
| ```python | |
| import torch | |
| from transformers import AutoModelForCausalLM, AutoProcessor | |
| model_path = "OpenGVLab/InternVideo3-8B-Instruct" | |
| model = AutoModelForCausalLM.from_pretrained( | |
| model_path, | |
| dtype=torch.bfloat16, | |
| attn_implementation="sdpa", | |
| device_map="auto", | |
| trust_remote_code=True, | |
| ) | |
| processor = AutoProcessor.from_pretrained( | |
| model_path, | |
| trust_remote_code=True, | |
| ) | |
| ``` | |
| ### Text-only Conversation | |
| ```python | |
| messages = [ | |
| { | |
| "role": "user", | |
| "content": [{"type": "text", "text": "Please introduce yourself."}], | |
| } | |
| ] | |
| text = processor.apply_chat_template( | |
| messages, tokenize=False, add_generation_prompt=True, enable_thinking=True | |
| ) | |
| inputs = processor(text=text, images=None, videos=None, do_resize=False, return_tensors="pt") | |
| inputs = inputs.to(model.device) | |
| output = model.generate(**inputs, max_new_tokens=1024, use_cache=True) | |
| generated_ids = [o[len(i):] for i, o in zip(inputs.input_ids, output)] | |
| print(processor.batch_decode(generated_ids, skip_special_tokens=True)[0]) | |
| ``` | |
| ### Video Understanding | |
| ```python | |
| video_path = "your_video.mp4" | |
| fps = 1 | |
| min_pixels = 128 * 32 * 32 | |
| max_pixels = 128 * 32 * 32 | |
| messages = [ | |
| { | |
| "role": "user", | |
| "content": [ | |
| {"type": "video", "video": video_path, "fps": fps}, | |
| {"type": "text", "text": "Please describe this video in detail."}, | |
| ], | |
| } | |
| ] | |
| processor.video_processor.size = { | |
| "longest_edge": max_pixels * max_frames, | |
| "shortest_edge": min_pixels * min_frames, | |
| } | |
| inputs = processor.apply_chat_template( | |
| messages, | |
| tokenize=True, | |
| add_generation_prompt=True, | |
| return_dict=True, | |
| fps=fps, | |
| return_tensors="pt", | |
| ) | |
| inputs = inputs.to(model.device) | |
| output = model.generate(**inputs, max_new_tokens=1024, use_cache=True) | |
| generated_ids = [o[len(i):] for i, o in zip(inputs.input_ids, output)] | |
| print(processor.batch_decode(generated_ids, skip_special_tokens=True)[0]) | |
| ``` | |
| ### Image Understanding | |
| ```python | |
| messages = [ | |
| { | |
| "role": "user", | |
| "content": [ | |
| {"type": "image", "image": "your_image.jpg"}, | |
| {"type": "text", "text": "Please describe this image in detail."}, | |
| ], | |
| } | |
| ] | |
| text = processor.apply_chat_template( | |
| messages, tokenize=False, add_generation_prompt=True, enable_thinking=True | |
| ) | |
| inputs = processor(text=text, images=images, videos=None, do_resize=False, return_tensors="pt") | |
| inputs = inputs.to(model.device) | |
| output = model.generate(**inputs, max_new_tokens=1024, use_cache=True) | |
| generated_ids = [o[len(i):] for i, o in zip(inputs.input_ids, output)] | |
| print(processor.batch_decode(generated_ids, skip_special_tokens=True)[0]) | |
| ``` | |
| ## Training Pipeline | |
| 1. **Continued Pretraining (CPT):** Recovers language ability and aligns vision features after M²LA conversion, using a mixture of text, image-text pairs, and video captions. | |
| 2. **Short-to-Long SFT:** Two-stage curriculum — Stage 1 at 2fps/512 frames (32K tokens), Stage 2 at 4fps/2048 frames (256K tokens). | |
| 3. **R-GSPO:** Rule-based reinforcement learning on temporal grounding (IoU reward) and video QA (correctness reward) to improve temporal reasoning. | |
| 4. **On-Policy Distillation:** Transfers capabilities from Qwen3-235B on samples where the student underperforms, using reverse-KL on student-sampled trajectories. | |
| ## Citation | |
| ```bibtex | |
| @article{internvideo3, | |
| title={InternVideo3: Multimodal Contextual Reasoning via Efficient Long-Horizon Agents}, | |
| author={InternVideo Team}, | |
| year={2025} | |
| } | |
| ``` | |
| ## License | |
| This project is released under the Apache 2.0 License. | |