metadata
language:
- en
base_model:
- mistralai/Mistral-7B-Instruct-v0.2
pipeline_tag: video-text-to-text
library_name: transformers
Infinite-VideoChat2
HuggingFace-compatible model files for VideoChat2-Infinity (Mistral-7B), a long video understanding model that extends VideoChat2-HD with infinite-length video processing via continuous long-term attention.
Model Architecture
- Vision Encoder – UMT-L (ViT, 1024-dim, 24 layers)
- Q-Former – bridges visual features to the LLM with 32 + 64 query tokens
- LLM – Mistral-7B-Instruct-v0.2
- Long-Term Attention – Gibbs-sampling-based continuous attention over basis functions for unbounded temporal context
File Overview
| File | Description |
|---|---|
config.json |
HuggingFace AutoConfig configuration |
configuration_videochat2.py |
Custom PretrainedConfig subclass (Config) |
videochat2_it_hd_mistral.py |
Main model class (VideoChat2_it_hd_mistral) |
blip2.py |
BLIP-2 base class for vision-language bridging |
vit.py |
Vision Transformer (UMT-L) implementation |
Qformer.py |
Q-Former module |
basis_functions.py |
Basis functions (Power, Sine, Cosine, Gaussian, Rectangular) for long-term attention |
long_term_attention_gibbs.py |
Long-term attention with Gibbs sampling |
model-*.safetensors |
Model weights (sharded, 4 parts) |
.gitignore |
Ignores model weight files |
Usage
The model is registered with HuggingFace auto_map, so it can be loaded directly:
from transformers import AutoConfig, AutoModel
config = AutoConfig.from_pretrained("Rihong/VideoChat2_Infinity_Mistral_7B_hf", trust_remote_code=True)
model = AutoModel.from_pretrained("Rihong/VideoChat2_Infinity_Mistral_7B_hf", trust_remote_code=True)
To upload model to HuggingFace, just run:
hf upload Rihong/VideoChat2_Infinity_Mistral_7B_hf ./lmms_eval/baselines/infty_videochat2/
Key Hyperparameters
| Parameter | Default | Description |
|---|---|---|
num_basis |
256 | Number of basis functions for long-term attention |
tau |
0.75 | Temperature for Gibbs sampling |
alpha |
0.75 | Mixing coefficient |
sticky |
true |
Enable sticky memories |
hd_num |
6 | Number of high-definition crops |
local_size |
224 | Local crop resolution |