File size: 6,643 Bytes
fc83580 cd4fb00 d33aefe cd4fb00 fc83580 cd4fb00 d33aefe cd4fb00 fc83580 cd4fb00 fc83580 cd4fb00 fc83580 cd4fb00 d33aefe cd4fb00 fc83580 cd4fb00 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 | ---
license: apache-2.0
language:
- en
pipeline_tag: video-classification
library_name: pytorch
base_model: facebook/vjepa2
datasets:
- Animesh-null/HiddenMass-50K
tags:
- vision
- video
- kinematics
- v-jepa
- physics
- sim-to-real
---
# STATERA Model Zoo
This repository hosts the official pre-trained checkpoints for **STATERA** (**S**patio-**T**emporal **A**nalysis of **T**ensor **E**mbeddings for **R**igid-body **A**symmetry).
STATERA tracks the hidden Center of Mass (CoM) of off-balance objects directly from raw video. Instead of relying on the object's surface or shape, it watches the video over time to understand momentum and gravity, using a modified Meta V-JEPA vision model.
## Primary Model Checkpoints
We provide two primary versions of the model in the root directory.
> **Important Note on File Size & PyTorch Hub:** Each STATERA checkpoint is roughly **1.25 GB**. Because we unfroze the final two transformer blocks of the V-JEPA backbone during training to adapt its latent space to Newtonian physics, the base weights were physically modified. To ensure seamless out-of-the-box inference, our checkpoints save the entire integrated state dictionary (the full ViT-Large backbone + custom decoder).
>
> *(Note: Your local inference script must still connect to PyTorch Hub to fetch Meta's underlying Python class definitions to build the architecture graph in memory before our modified 1.25 GB weights are loaded into it).*
### 1. `STATERA-50K-Crescent.pth`
* **Training Constraint:** Trained with phase-aware spatial targets (using the `--target_type crescent` flag matching the rotational angle, which then decays to a high-frequency point).
* **Behavior:** Achieves the highest physical disentanglement (**40.96% Physics Capture Ratio**) by actively hunting the offset mass and securing the highest overall **Unified HiddenMass Score (HMS)**. However, its strict precision makes it subject to Visual-Kinematic Aliasing (bimodal splits) during complex bounces. This is due to a "settling-state dataset bias" where the heavy face of the object naturally falls closer to the floor most of the time during simulated resting phases, occasionally causing the model to guess the gravitational bottom instead of the true mass mid-air.
### 2. `STATERA-50K-Sigma.pth`
* **Training Constraint:** Trained with phase-agnostic Isotropic Gaussian targets (using the `--target_type dot` flag alongside curriculum label smoothing) to cure the gravitational settling bias.
* **Behavior:** Highly robust with exceptionally low spatial error (N-CoME) and Normalized Kinematic Jitter. However, spatial entropy analysis reveals this is driven by **"Expectation Collapse"** predicting high-entropy probability masses near the geometric center to minimize Euclidean risk and artificially safely score well on standard metrics. Its prediction heatmap is very diffuse, making it less physically accurate at capturing the true offset mass direction compared to the Crescent model.
---
## Ablations Folder (`/ablations`)
In addition to the main models, we provide our baseline comparison models inside the `/ablations` folder for research reproducibility. These correspond directly to the baseline and ablation studies detailed in our paper (Section 4.4, Table 1 & Table 2, and Appendix E):
* **`STATERA-1K-DINOv2.pth` (1.25 GB):** Evaluates a purely spatial foundation model (DINOv2). Equipped with the exact same 1D temporal mixer and 2.5D decoder as STATERA. Proves that temporal convolutions applied post-extraction cannot recover lost intra-frame momentum, causing the model to collapse to the visual/geometric centroid.
* **`STATERA-1K-VideoMAE.pth` (1.25 GB):** Evaluates VideoMAE v2 as an alternative temporal foundation. Despite possessing identical 1D sequence mixing, it forces the network to memorize visual surface textures due to its pixel-level reconstruction objective, strongly suggesting that true kinematic extraction requires predictive latent physics (like V-JEPA) rather than just spatio-temporal attention.
* **`STATERA-1K-ResNet3D.pth` (146 MB):** Evaluates a standard 3D-CNN temporal baseline trained end-to-end. *(Note the smaller file size, as it does not utilize the ViT-Large backbone).* Lacks V-JEPA's latent tubelet priors, resulting in wild overshooting artifacts and poor physical disentanglement.
* **`STATERA-1K-No-Z-Depth.pth` (1.25 GB):** Ablates the 1D Z-Depth regularizer. Demonstrates that without absolute 3D depth supervision, the dynamically-cropped 2D network loses physical scale constraints causing predictions to severely overshoot the object's physical bounds, driving up Euclidean error.
* **`STATERA-1K-Frozen-Anchor.pth` (1.25 GB):** An architecture ablation where the final two transformer blocks of the V-JEPA backbone were kept strictly frozen. Performance significantly degrades, mirroring DINOv2 and proving the necessity to fine-tune the temporal representations specifically for kinematic tasks.
* **`STATERA-1K-Anchor.pth` (1.25 GB):** The standard low-data baseline trained on only 1,000 sequences to demonstrate spatial overfitting and temporal starvation.
* **`STATERA-1K-Standard-Sigma.pth` (1.25 GB):** Baseline target dynamics testing standard Gaussian smoothing without applying the variance-decay curriculum.
* **`STATERA-1K-Static-Dot.pth` (1.25 GB):** Baseline target dynamics testing a static, non-decaying coordinate dot, leading to severe gradient instability during continuous sub-pixel coordinate extraction.
## Architecture Details
- **Base Model:** Meta V-JEPA 2.1 (ViT-Large [16-frame sequence compressed to T=8], initialized via PyTorch Hub: `vjepa2_1_vit_large_384`).
- **Partial Fine-Tuning:** The final two transformer blocks of the V-JEPA backbone were unfrozen during training (utilizing `--finetune_blocks 2` and gradient accumulation) to adapt the visual latent space to Newtonian mechanics.
- **Time Processing:** Uses a 1D Convolution to mix the temporal tubelets together in the high-dimensional latent space, allowing the model to smoothly preserve momentum.
- **Output:** A ~2.5M parameter Spatial Preservation Decoder outputs a smooth, continuous 2D heatmap (extracted via a Temperature-Scaled Soft-Argmax) guessing where the hidden mass is, alongside a 1D Z-Depth estimator to act as a perspective-invariant geometric regularizer.
## Quick Start & Inference
These files are designed to be loaded directly into the STATERA PyTorch code. For full installation instructions, testing tools, and a demo you can run locally with a GUI, please visit the [Official GitHub Repository](https://github.com/Animesh-Varma/STATERA/). |