VideoVAE+ 16z — Tactile Finetune

A visuo-tactile finetune of VideoVAE+ (the sota-4-16z, 16-latent-channel, text-free variant). Starting from the released sota-4-16z.ckpt, the autoencoder was finetuned to encode/decode tactile sensor video (left/right tactile streams) alongside RGB view frames.

Model

Architecture: AutoencoderKL2plus1D_1dcnn (factorized 2+1D KL autoencoder, 1D temporal CNN)
Latent channels (z_channels): 16
Spatial compression: 8× (ch_mult=[1,2,4,4], 3 downsamples)
Temporal compression: 4× (16 frames → 4 latent timesteps)
Text conditioning: none (caption_guide: False)
Base checkpoint: sota-4-16z.ckpt from VideoVerses/VideoVAEPlus
Finetune objective: LPIPSWithDiscriminator3D (KL weight 1e-6, disc weight 0.5), base LR 5e-5

Files

videovae_plus_16z_tactile.ckpt — final finetuned weights (Lightning checkpoint, ~5 GB).
Source code, configs, and inference scripts mirrored from the working repo.

Usage

# reconstruct a video with the finetuned autoencoder
python inference_video.py \
    --config configs/inference/config_16z_infer_noloss.yaml \
    --ckpt   videovae_plus_16z_tactile.ckpt \
    --input  examples/videos/gt/0510_episode_000_tactile_left.mp4

See the included configs/train/config_16z_tactile.yaml for the exact finetuning recipe. Load the checkpoint into the AutoencoderKL2plus1D_1dcnn model defined in src/models/autoencoder2plus1d_1dcnn.py.

License & attribution

Code and base model from VideoVerses/VideoVAEPlus (Apache-2.0). This repository adds tactile-finetuned weights and the corresponding training/inference configs.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Video-to-Video

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support