arxiv:2602.09146

SemanticMoments: Training-Free Motion Similarity via Third Moment Features

Published on Feb 9

· Submitted by

noam rotstein on Feb 16

BRIA AI

Upvote

Authors:

Abstract

Temporal statistics in semantic feature space provide a scalable approach for motion-centric video understanding, outperforming existing RGB, flow, and text-supervised methods.

AI-generated summary

Retrieving videos based on semantic motion is a fundamental, yet unsolved, problem. Existing video representation approaches overly rely on static appearance and scene context rather than motion dynamics, a bias inherited from their training data and objectives. Conversely, traditional motion-centric inputs like optical flow lack the semantic grounding needed to understand high-level motion. To demonstrate this inherent bias, we introduce the SimMotion benchmarks, combining controlled synthetic data with a new human-annotated real-world dataset. We show that existing models perform poorly on these benchmarks, often failing to disentangle motion from appearance. To address this gap, we propose SemanticMoments, a simple, training-free method that computes temporal statistics (specifically, higher-order moments) over features from pre-trained semantic models. Across our benchmarks, SemanticMoments consistently outperforms existing RGB, flow, and text-supervised methods. This demonstrates that temporal statistics in a semantic feature space provide a scalable and perceptually grounded foundation for motion-centric video understanding.

View arXiv page View PDF Project page Add to collection

Community

noamrot

Paper submitter about 11 hours ago

Modern video representations like VideoMAE and V-JEPA2 are biased toward appearance rather than motion. We introduce SemanticMoments, a training-free representation for semantic motion similarity.