STATERA Model Zoo
This repository hosts the official pre-trained checkpoints for STATERA (Spatio-Temporal Analysis of Tensor Embeddings for Rigid-body Asymmetry).
STATERA tracks the hidden Center of Mass (CoM) of off-balance objects directly from raw video. Instead of relying on the object's surface or shape, it watches the video over time to understand momentum and gravity, using a modified Meta V-JEPA vision model.
Primary Model Checkpoints
We provide two primary versions of the model in the root directory.
Important Note on File Size & PyTorch Hub: Each STATERA checkpoint is roughly 1.25 GB. Because we unfroze the final two transformer blocks of the V-JEPA backbone during training to adapt its latent space to Newtonian physics, the base weights were physically modified. To ensure seamless out-of-the-box inference, our checkpoints save the entire integrated state dictionary (the full ViT-Large backbone + custom decoder).
(Note: Your local inference script must still connect to PyTorch Hub to fetch Meta's underlying Python class definitions to build the architecture graph in memory before our modified 1.25 GB weights are loaded into it).
1. STATERA-50K-Crescent.pth (The True Physics Tracker)
- Training Constraint: Trained with phase-aware spatial targets (a crescent shape matching the rotational angle, which then decays to a high-frequency point).
- Behavior: Achieves the highest physical disentanglement (34.44% Physics Capture Ratio) by actively hunting the offset mass. However, its strict precision makes it subject to Visual-Kinematic Aliasing (Bimodal Splits) during bounces. This is due to a simulator bias where the heavy face of the object falls closer to the floor most of the time, causing the model to occasionally guess the bottom of the object instead of the true mass.
2. STATERA-50K-Sigma.pth (The Quantitative SOTA)
- Training Constraint: Trained with phase-agnostic Isotropic Gaussian targets to cure the gravitational settling bias.
- Behavior: Highly robust with exceptionally low Kinematic Jitter. While it achieves the lowest mathematical composite score (KECS), Expected Spatial Dispersion (ESD) analysis reveals this is driven by "Expectation Collapse"—predicting high-entropy probability masses near the geometric center to artificially game the metrics. Its prediction heatmap is very large and diffuse, making its actual point localization inaccurate.
Ablations Folder (/ablations)
In addition to the main models, we provide our baseline comparison models inside the /ablations folder for research reproducibility. These correspond directly to the ablation studies detailed in our paper:
STATERA-1K-DINOv2.pth(1.25 GB): Evaluates a purely spatial foundation model. Equipped with the exact same 1D temporal mixer and 2.5D decoder as STATERA. Proves that temporal convolutions applied post-extraction cannot recover lost intra-frame momentum, causing the model to collapse to the geometric centroid.STATERA-1K-ResNet3D.pth(146 MB): Evaluates a standard 3D-CNN temporal baseline trained end-to-end. (Note the smaller file size, as it does not utilize the ViT-Large backbone). Evaluates temporal extraction without V-JEPA's latent tubelet priors, resulting in wild overshooting artifacts.STATERA-1K-Anchor.pth(1.25 GB): The standard low-data baseline trained on 1,000 sequences to demonstrate spatial overfitting and temporal starvation.STATERA-1K-Frozen-Anchor.pth(1.25 GB): An architecture ablation where the final two transformer blocks of the V-JEPA backbone were kept strictly frozen, preventing the latent space from adapting to kinematics.STATERA-1K-Standard-Sigma.pth(1.25 GB): Baseline target dynamics testing standard Gaussian smoothing without applying the variance-decay curriculum.STATERA-1K-Static-Dot.pth(1.25 GB): Baseline target dynamics testing a static, non-decaying coordinate dot, leading to severe gradient instability during sub-pixel continuous coordinate extraction.
Architecture Details
- Base Model: Meta V-JEPA 2.1 (ViT-Large, initialized via PyTorch Hub:
vjepa2_1_vit_large_384). - Partial Fine-Tuning: The final two transformer blocks of the V-JEPA backbone were unfrozen during training to adapt the visual latent space to Newtonian mechanics.
- Time Processing: Uses a 1D Convolution to mix 8 chunks of time together, allowing the model to "see" velocity.
- Output: A 2.5M parameter custom decoder outputs a smooth, continuous 2D heatmap guessing where the hidden mass is, alongside a depth estimator to keep the model grounded in 3D space.
Quick Start & Inference
These files are designed to be loaded directly into the STATERA PyTorch code. For full installation instructions, testing tools, and a demo you can run locally in your browser, please visit the Official GitHub Repository.