File size: 5,341 Bytes
b51432a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
---
license: apache-2.0
language:
- en
- zh
pipeline_tag: video-text-to-text
tags:
- video-understanding
- multimodal
- long-video
- agent
library_name: transformers
---

# InternVideo3-8B-Instruct

## Introduction

InternVideo3 is a multimodal large language model designed for long-horizon video understanding and agentic reasoning. It introduces **Multimodal Contextual Reasoning (MCR)**, an efficient formulation that unifies perception, planning, tool use, self-reflection, and memory within a single shared context, enabling recursive multi-step reasoning over long videos.

### Key Features

- **M²LA (Multimodal Multi-head Latent Attention):** A KV-cache-efficient attention architecture that reduces memory footprint via low-rank latent factorization, enabling long-context reasoning (up to 256K tokens) without dropping tokens.
- **Long-Video Understanding:** Trained with a short-to-long curriculum (up to 2048 frames at 4fps), supporting hour-long video comprehension.
- **Agentic Video Reasoning:** Built-in support for recursive perception-action loops with tool use (temporal grounding, ASR, web search, video segmentation) and self-verification.
- **Advanced Post-Training:** Combines rule-based group sequence policy optimization (R-GSPO) and on-policy distillation from Qwen3-235B for improved temporal reasoning.

### Architecture

| Component | Details |
|-----------|---------|
| Vision Encoder | 27-layer ViT, hidden_size=1152, patch_size=16, temporal_patch_size=2 |
| Language Model | 36-layer, hidden_size=4096, 32 attention heads |
| KV Latent Rank | 896 per layer |
| Max Context | 262,144 tokens |
| Precision | BFloat16 |

## Quickstart

### Requirements

```bash
pip install transformers>=4.57.3 torch qwen-vl-utils
```

### Basic Usage

```python
import torch
from transformers import AutoModelForCausalLM, AutoProcessor

model_path = "OpenGVLab/InternVideo3-8B-Instruct"

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    dtype=torch.bfloat16,
    attn_implementation="sdpa",
    device_map="auto",
    trust_remote_code=True,
)

processor = AutoProcessor.from_pretrained(
    model_path,
    trust_remote_code=True,
)
```

### Text-only Conversation

```python
messages = [
    {
        "role": "user",
        "content": [{"type": "text", "text": "Please introduce yourself."}],
    }
]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True, enable_thinking=True
)
inputs = processor(text=text, images=None, videos=None, do_resize=False, return_tensors="pt")
inputs = inputs.to(model.device)

output = model.generate(**inputs, max_new_tokens=1024, use_cache=True)
generated_ids = [o[len(i):] for i, o in zip(inputs.input_ids, output)]
print(processor.batch_decode(generated_ids, skip_special_tokens=True)[0])
```

### Video Understanding

```python
video_path = "your_video.mp4"

fps = 1
min_pixels = 128 * 32 * 32
max_pixels = 128 * 32 * 32

messages = [
    {
        "role": "user",
        "content": [
            {"type": "video", "video": video_path, "fps": fps},
            {"type": "text", "text": "Please describe this video in detail."},
        ],
    }
]

processor.video_processor.size = {
    "longest_edge": max_pixels * max_frames,
    "shortest_edge": min_pixels * min_frames,
}

inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    fps=fps,
    return_tensors="pt",
)
inputs = inputs.to(model.device)

output = model.generate(**inputs, max_new_tokens=1024, use_cache=True)
generated_ids = [o[len(i):] for i, o in zip(inputs.input_ids, output)]
print(processor.batch_decode(generated_ids, skip_special_tokens=True)[0])
```

### Image Understanding

```python
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "your_image.jpg"},
            {"type": "text", "text": "Please describe this image in detail."},
        ],
    }
]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True, enable_thinking=True
)
inputs = processor(text=text, images=images, videos=None, do_resize=False, return_tensors="pt")
inputs = inputs.to(model.device)

output = model.generate(**inputs, max_new_tokens=1024, use_cache=True)
generated_ids = [o[len(i):] for i, o in zip(inputs.input_ids, output)]
print(processor.batch_decode(generated_ids, skip_special_tokens=True)[0])
```

## Training Pipeline

1. **Continued Pretraining (CPT):** Recovers language ability and aligns vision features after M²LA conversion, using a mixture of text, image-text pairs, and video captions.
2. **Short-to-Long SFT:** Two-stage curriculum — Stage 1 at 2fps/512 frames (32K tokens), Stage 2 at 4fps/2048 frames (256K tokens).
3. **R-GSPO:** Rule-based reinforcement learning on temporal grounding (IoU reward) and video QA (correctness reward) to improve temporal reasoning.
4. **On-Policy Distillation:** Transfers capabilities from Qwen3-235B on samples where the student underperforms, using reverse-KL on student-sampled trajectories.

## Citation

```bibtex
@article{internvideo3,
  title={InternVideo3: Multimodal Contextual Reasoning via Efficient Long-Horizon Agents},
  author={InternVideo Team},
  year={2025}
}
```

## License

This project is released under the Apache 2.0 License.