VideoVAE+ 16z β Tactile Finetune
A visuo-tactile finetune of VideoVAE+
(the sota-4-16z, 16-latent-channel, text-free variant). Starting from the
released sota-4-16z.ckpt, the autoencoder was finetuned to encode/decode
tactile sensor video (left/right tactile streams) alongside RGB view frames.
Model
- Architecture:
AutoencoderKL2plus1D_1dcnn(factorized 2+1D KL autoencoder, 1D temporal CNN) - Latent channels (
z_channels): 16 - Spatial compression: 8Γ (
ch_mult=[1,2,4,4], 3 downsamples) - Temporal compression: 4Γ (16 frames β 4 latent timesteps)
- Text conditioning: none (
caption_guide: False) - Base checkpoint:
sota-4-16z.ckptfrom VideoVerses/VideoVAEPlus - Finetune objective:
LPIPSWithDiscriminator3D(KL weight 1e-6, disc weight 0.5), base LR 5e-5
Files
videovae_plus_16z_tactile.ckptβ final finetuned weights (Lightning checkpoint, ~5 GB).- Source code, configs, and inference scripts mirrored from the working repo.
Usage
# reconstruct a video with the finetuned autoencoder
python inference_video.py \
--config configs/inference/config_16z_infer_noloss.yaml \
--ckpt videovae_plus_16z_tactile.ckpt \
--input examples/videos/gt/0510_episode_000_tactile_left.mp4
See the included configs/train/config_16z_tactile.yaml for the exact finetuning
recipe. Load the checkpoint into the AutoencoderKL2plus1D_1dcnn model defined in
src/models/autoencoder2plus1d_1dcnn.py.
License & attribution
Code and base model from VideoVerses/VideoVAEPlus (Apache-2.0). This repository adds tactile-finetuned weights and the corresponding training/inference configs.
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support