A unified architecture that couples SigLIP2 vision encoder with a trainable Mamba state space model for temporal reasoning with O(N) complexity.
Trained under the Multiple Instance Learning (MIL) paradigm with the Temporal Feature Magnitude (RTFM) loss, SigMamba achieves 89.82% frame-level AUC on the UCF-Crime benchmark while processing over 1000 frames per second on a single GPU.