VideoMamba-Small — Driver Behavior (4-class, from scratch)
VideoMamba-Small trained from random init (no pretrain) on the mango driver-behavior dataset (4 classes), for comparison against Video Swin-T / SiftFormer on the same task.
Results (this checkpoint, epoch 97)
| Protocol | Split | Top-1 |
|---|---|---|
| single-clip val | test_val_seed42 (31,993) |
98.69% |
| 12-view test (4 seg × 3 crop) | test/reorganized (20,000) |
99.67% |
Per-class (single-clip val): 정상 96.00 / 졸음 98.68 / 주의분산 98.84 / 폭행 99.93%.
Note: this val split differs from the GitHub
siftformerrepo'ssplit_dataset(seed=42, val_ratio=0.15)(34,418 val). For an exact 1:1 vs Video Swin-T, load this model inside that repo and evaluate on its split (see "Fair comparison" below).
Files
pytorch_model.bin— inference weights (model state_dict, 416 tensors)videomamba.py— model definition (OpenGVLab VideoMamba,video_sm)checkpoint-best.pthis not included (training ckpt w/ optimizer); ask if needed.
Config (must match to load)
model = videomamba_small # patch16, embed_dim 384, depth 24, rms_norm
num_classes = 4
num_frames = 8
tubelet_size = 1
img_size = 224
sampling = uniform 8 frames (Kinetics_sparse), single clip
norm = ImageNet mean[0.485,0.456,0.406] std[0.229,0.224,0.225]
Label map: 0 정상 / 1 졸음 / 2 주의분산 / 3 폭행 (matches the GitHub repo).
Dependencies — IMPORTANT (bimamba fork, NOT standard mamba-ssm)
videomamba.py calls Mamba(..., bimamba=True). The bimamba argument exists
only in OpenGVLab's mamba fork — standard mamba-ssm (e.g. 2.3.0 from PyPI) does
NOT have it and will raise TypeError: unexpected keyword argument 'bimamba'.
Install the bundled forks from the OpenGVLab repo (CUDA build, Linux/WSL — native Windows build is very hard):
git clone https://github.com/OpenGVLab/VideoMamba
pip install -e VideoMamba/causal-conv1d # bundled fork
pip install -e VideoMamba/mamba # bundled fork (provides bimamba)
pip install timm==0.4.12 einops
If building the fork is impractical (e.g. native Windows), evaluate VideoMamba on a Linux box instead and only share the val split file list across machines — see "Fair comparison" below.
Load & single-clip inference
import torch
from videomamba import videomamba_small # this repo's file
model = videomamba_small(num_classes=4, num_frames=8, img_size=224)
sd = torch.load("pytorch_model.bin", map_location="cpu")
model.load_state_dict(sd, strict=True)
model.eval().cuda()
# x: (B, C=3, T=8, H=224, W=224), ImageNet-normalized, uniform 8 frames, center crop
with torch.no_grad(), torch.amp.autocast("cuda", dtype=torch.bfloat16):
logits = model(x) # (B, 4)
pred = logits.argmax(-1)
Fair comparison — recommended split-of-labor
VideoMamba needs the bimamba fork (easy on Linux), Swin needs none (easy on Windows). So don't force both into one OS — share the val file list instead:
- On the machine that has Swin's split, dump the exact seed-42/0.15 val file list:
import json from src.data.dataset import DriverBehaviorDataset, split_dataset, DatasetConfig ds = DriverBehaviorDataset(DatasetConfig(...)) _, val, _ = split_dataset(ds, train_ratio=0.7, val_ratio=0.15, seed=42) paths = [ds.samples[i][0] for i in val.indices] json.dump(paths, open("val_seed42_filelist.json", "w"), ensure_ascii=False) - Share
val_seed42_filelist.json. Evaluate VideoMamba on a Linux box over exactly those files (same mango videos), single-clip argmax top-1. - Swin's number on that same list is its existing benchmark value.
Both models then sit on the identical val instances → true 1:1.