VideoMamba-Small — Driver Behavior (4-class, from scratch)

VideoMamba-Small trained from random init (no pretrain) on the mango driver-behavior dataset (4 classes), for comparison against Video Swin-T / SiftFormer on the same task.

Results (this checkpoint, epoch 97)

Protocol	Split	Top-1
single-clip val	`test_val_seed42` (31,993)	98.69%
12-view test (4 seg × 3 crop)	`test/reorganized` (20,000)	99.67%

Per-class (single-clip val): 정상 96.00 / 졸음 98.68 / 주의분산 98.84 / 폭행 99.93%.

Note: this val split differs from the GitHub siftformer repo's split_dataset(seed=42, val_ratio=0.15) (34,418 val). For an exact 1:1 vs Video Swin-T, load this model inside that repo and evaluate on its split (see "Fair comparison" below).

Files

pytorch_model.bin — inference weights (model state_dict, 416 tensors)
videomamba.py — model definition (OpenGVLab VideoMamba, video_sm)
checkpoint-best.pth is not included (training ckpt w/ optimizer); ask if needed.

Config (must match to load)

model        = videomamba_small   # patch16, embed_dim 384, depth 24, rms_norm
num_classes  = 4
num_frames   = 8
tubelet_size = 1
img_size     = 224
sampling     = uniform 8 frames (Kinetics_sparse), single clip
norm         = ImageNet mean[0.485,0.456,0.406] std[0.229,0.224,0.225]

Label map: 0 정상 / 1 졸음 / 2 주의분산 / 3 폭행 (matches the GitHub repo).

Dependencies — IMPORTANT (bimamba fork, NOT standard mamba-ssm)

videomamba.py calls Mamba(..., bimamba=True). The bimamba argument exists only in OpenGVLab's mamba fork — standard mamba-ssm (e.g. 2.3.0 from PyPI) does NOT have it and will raise TypeError: unexpected keyword argument 'bimamba'.

Install the bundled forks from the OpenGVLab repo (CUDA build, Linux/WSL — native Windows build is very hard):

git clone https://github.com/OpenGVLab/VideoMamba
pip install -e VideoMamba/causal-conv1d   # bundled fork
pip install -e VideoMamba/mamba           # bundled fork (provides bimamba)
pip install timm==0.4.12 einops

If building the fork is impractical (e.g. native Windows), evaluate VideoMamba on a Linux box instead and only share the val split file list across machines — see "Fair comparison" below.

Load & single-clip inference

import torch
from videomamba import videomamba_small  # this repo's file

model = videomamba_small(num_classes=4, num_frames=8, img_size=224)
sd = torch.load("pytorch_model.bin", map_location="cpu")
model.load_state_dict(sd, strict=True)
model.eval().cuda()

# x: (B, C=3, T=8, H=224, W=224), ImageNet-normalized, uniform 8 frames, center crop
with torch.no_grad(), torch.amp.autocast("cuda", dtype=torch.bfloat16):
    logits = model(x)          # (B, 4)
pred = logits.argmax(-1)

Fair comparison — recommended split-of-labor

VideoMamba needs the bimamba fork (easy on Linux), Swin needs none (easy on Windows). So don't force both into one OS — share the val file list instead:

On the machine that has Swin's split, dump the exact seed-42/0.15 val file list:

import json
from src.data.dataset import DriverBehaviorDataset, split_dataset, DatasetConfig
ds = DriverBehaviorDataset(DatasetConfig(...))
_, val, _ = split_dataset(ds, train_ratio=0.7, val_ratio=0.15, seed=42)
paths = [ds.samples[i][0] for i in val.indices]
json.dump(paths, open("val_seed42_filelist.json", "w"), ensure_ascii=False)

Share val_seed42_filelist.json. Evaluate VideoMamba on a Linux box over exactly those files (same mango videos), single-clip argmax top-1.
Swin's number on that same list is its existing benchmark value.

Both models then sit on the identical val instances → true 1:1.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Video Classification

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support