VideoVAE+ 16z β€” Tactile Finetune

A visuo-tactile finetune of VideoVAE+ (the sota-4-16z, 16-latent-channel, text-free variant). Starting from the released sota-4-16z.ckpt, the autoencoder was finetuned to encode/decode tactile sensor video (left/right tactile streams) alongside RGB view frames.

Model

  • Architecture: AutoencoderKL2plus1D_1dcnn (factorized 2+1D KL autoencoder, 1D temporal CNN)
  • Latent channels (z_channels): 16
  • Spatial compression: 8Γ— (ch_mult=[1,2,4,4], 3 downsamples)
  • Temporal compression: 4Γ— (16 frames β†’ 4 latent timesteps)
  • Text conditioning: none (caption_guide: False)
  • Base checkpoint: sota-4-16z.ckpt from VideoVerses/VideoVAEPlus
  • Finetune objective: LPIPSWithDiscriminator3D (KL weight 1e-6, disc weight 0.5), base LR 5e-5

Files

  • videovae_plus_16z_tactile.ckpt β€” final finetuned weights (Lightning checkpoint, ~5 GB).
  • Source code, configs, and inference scripts mirrored from the working repo.

Usage

# reconstruct a video with the finetuned autoencoder
python inference_video.py \
    --config configs/inference/config_16z_infer_noloss.yaml \
    --ckpt   videovae_plus_16z_tactile.ckpt \
    --input  examples/videos/gt/0510_episode_000_tactile_left.mp4

See the included configs/train/config_16z_tactile.yaml for the exact finetuning recipe. Load the checkpoint into the AutoencoderKL2plus1D_1dcnn model defined in src/models/autoencoder2plus1d_1dcnn.py.

License & attribution

Code and base model from VideoVerses/VideoVAEPlus (Apache-2.0). This repository adds tactile-finetuned weights and the corresponding training/inference configs.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support