Infinite-VideoChat2

HuggingFace-compatible model files for VideoChat2-Infinity (Mistral-7B), a long video understanding model that extends VideoChat2-HD with infinite-length video processing via continuous long-term attention.

Model Architecture

  • Vision Encoder – UMT-L (ViT, 1024-dim, 24 layers)
  • Q-Former – bridges visual features to the LLM with 32 + 64 query tokens
  • LLM – Mistral-7B-Instruct-v0.2
  • Long-Term Attention – Gibbs-sampling-based continuous attention over basis functions for unbounded temporal context

File Overview

File Description
config.json HuggingFace AutoConfig configuration
configuration_videochat2.py Custom PretrainedConfig subclass (Config)
videochat2_it_hd_mistral.py Main model class (VideoChat2_it_hd_mistral)
blip2.py BLIP-2 base class for vision-language bridging
vit.py Vision Transformer (UMT-L) implementation
Qformer.py Q-Former module
basis_functions.py Basis functions (Power, Sine, Cosine, Gaussian, Rectangular) for long-term attention
long_term_attention_gibbs.py Long-term attention with Gibbs sampling
model-*.safetensors Model weights (sharded, 4 parts)
.gitignore Ignores model weight files

Usage

The model is registered with HuggingFace auto_map, so it can be loaded directly:

from transformers import AutoConfig, AutoModel

config = AutoConfig.from_pretrained("Rihong/VideoChat2_Infinity_Mistral_7B_hf", trust_remote_code=True)
model = AutoModel.from_pretrained("Rihong/VideoChat2_Infinity_Mistral_7B_hf", trust_remote_code=True)

To upload model to HuggingFace, just run:

hf upload Rihong/VideoChat2_Infinity_Mistral_7B_hf ./lmms_eval/baselines/infty_videochat2/

Key Hyperparameters

Parameter Default Description
num_basis 256 Number of basis functions for long-term attention
tau 0.75 Temperature for Gibbs sampling
alpha 0.75 Mixing coefficient
sticky true Enable sticky memories
hd_num 6 Number of high-definition crops
local_size 224 Local crop resolution

References

Downloads last month
66
Safetensors
Model size
8B params
Tensor type
I64
·
F32
·
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Rihong/VideoChat2_HD_Infinity_Mistral_7B

Finetuned
(1088)
this model