Video-Text-to-Text
Transformers
Safetensors
English
Chinese
internvideo3
text-generation
video-understanding
multimodal
long-video
agent
custom_code
Instructions to use yanziang/InternVideo3-8B-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use yanziang/InternVideo3-8B-Instruct with Transformers:
# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("yanziang/InternVideo3-8B-Instruct", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
File size: 5,341 Bytes
b51432a | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 | ---
license: apache-2.0
language:
- en
- zh
pipeline_tag: video-text-to-text
tags:
- video-understanding
- multimodal
- long-video
- agent
library_name: transformers
---
# InternVideo3-8B-Instruct
## Introduction
InternVideo3 is a multimodal large language model designed for long-horizon video understanding and agentic reasoning. It introduces **Multimodal Contextual Reasoning (MCR)**, an efficient formulation that unifies perception, planning, tool use, self-reflection, and memory within a single shared context, enabling recursive multi-step reasoning over long videos.
### Key Features
- **M²LA (Multimodal Multi-head Latent Attention):** A KV-cache-efficient attention architecture that reduces memory footprint via low-rank latent factorization, enabling long-context reasoning (up to 256K tokens) without dropping tokens.
- **Long-Video Understanding:** Trained with a short-to-long curriculum (up to 2048 frames at 4fps), supporting hour-long video comprehension.
- **Agentic Video Reasoning:** Built-in support for recursive perception-action loops with tool use (temporal grounding, ASR, web search, video segmentation) and self-verification.
- **Advanced Post-Training:** Combines rule-based group sequence policy optimization (R-GSPO) and on-policy distillation from Qwen3-235B for improved temporal reasoning.
### Architecture
| Component | Details |
|-----------|---------|
| Vision Encoder | 27-layer ViT, hidden_size=1152, patch_size=16, temporal_patch_size=2 |
| Language Model | 36-layer, hidden_size=4096, 32 attention heads |
| KV Latent Rank | 896 per layer |
| Max Context | 262,144 tokens |
| Precision | BFloat16 |
## Quickstart
### Requirements
```bash
pip install transformers>=4.57.3 torch qwen-vl-utils
```
### Basic Usage
```python
import torch
from transformers import AutoModelForCausalLM, AutoProcessor
model_path = "OpenGVLab/InternVideo3-8B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
model_path,
dtype=torch.bfloat16,
attn_implementation="sdpa",
device_map="auto",
trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained(
model_path,
trust_remote_code=True,
)
```
### Text-only Conversation
```python
messages = [
{
"role": "user",
"content": [{"type": "text", "text": "Please introduce yourself."}],
}
]
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True, enable_thinking=True
)
inputs = processor(text=text, images=None, videos=None, do_resize=False, return_tensors="pt")
inputs = inputs.to(model.device)
output = model.generate(**inputs, max_new_tokens=1024, use_cache=True)
generated_ids = [o[len(i):] for i, o in zip(inputs.input_ids, output)]
print(processor.batch_decode(generated_ids, skip_special_tokens=True)[0])
```
### Video Understanding
```python
video_path = "your_video.mp4"
fps = 1
min_pixels = 128 * 32 * 32
max_pixels = 128 * 32 * 32
messages = [
{
"role": "user",
"content": [
{"type": "video", "video": video_path, "fps": fps},
{"type": "text", "text": "Please describe this video in detail."},
],
}
]
processor.video_processor.size = {
"longest_edge": max_pixels * max_frames,
"shortest_edge": min_pixels * min_frames,
}
inputs = processor.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
fps=fps,
return_tensors="pt",
)
inputs = inputs.to(model.device)
output = model.generate(**inputs, max_new_tokens=1024, use_cache=True)
generated_ids = [o[len(i):] for i, o in zip(inputs.input_ids, output)]
print(processor.batch_decode(generated_ids, skip_special_tokens=True)[0])
```
### Image Understanding
```python
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "your_image.jpg"},
{"type": "text", "text": "Please describe this image in detail."},
],
}
]
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True, enable_thinking=True
)
inputs = processor(text=text, images=images, videos=None, do_resize=False, return_tensors="pt")
inputs = inputs.to(model.device)
output = model.generate(**inputs, max_new_tokens=1024, use_cache=True)
generated_ids = [o[len(i):] for i, o in zip(inputs.input_ids, output)]
print(processor.batch_decode(generated_ids, skip_special_tokens=True)[0])
```
## Training Pipeline
1. **Continued Pretraining (CPT):** Recovers language ability and aligns vision features after M²LA conversion, using a mixture of text, image-text pairs, and video captions.
2. **Short-to-Long SFT:** Two-stage curriculum — Stage 1 at 2fps/512 frames (32K tokens), Stage 2 at 4fps/2048 frames (256K tokens).
3. **R-GSPO:** Rule-based reinforcement learning on temporal grounding (IoU reward) and video QA (correctness reward) to improve temporal reasoning.
4. **On-Policy Distillation:** Transfers capabilities from Qwen3-235B on samples where the student underperforms, using reverse-KL on student-sampled trajectories.
## Citation
```bibtex
@article{internvideo3,
title={InternVideo3: Multimodal Contextual Reasoning via Efficient Long-Horizon Agents},
author={InternVideo Team},
year={2025}
}
```
## License
This project is released under the Apache 2.0 License.
|