FlashWAM-RoboTwin / README.md
armanakbari4's picture
Add model card
d26c24e verified
metadata
license: apache-2.0
library_name: diffusers
tags:
  - robotics
  - world-model
  - diffusion
  - step-distillation
  - lingbot-va
pipeline_tag: robotics

Flash-WAM — RoboTwin (distilled)

Single-step distilled checkpoint for Flash-WAM: Modality-Aware Distillation for World Action Models, applied to LingBot-VA and evaluated on RoboTwin 2.0. Flash-WAM distills each modality with a consistency function matched to its noise regime (linear-gradient-scaling for the action stream, variance-preserving for the video stream), compressing inference to a single step per modality for up to a 23× speedup while preserving teacher-level task success.

This repository contains the complete model (distilled transformer + encoders):

Component Description
transformer/ Distilled Flash-WAM student
vae/ VAE (from the LingBot-VA teacher)
text_encoder/ UMT5-XXL text encoder (from the teacher)
tokenizer/ T5 tokenizer

Links

Usage

For environment setup and evaluation, follow the Flash-WAM repository and LingBot-VA. Point the inference server at this checkpoint directory.

Citation

@misc{akbari2026flashwammodalityawaredistillationworld,
      title={Flash-WAM: Modality-Aware Distillation for World Action Models}, 
      author={Arman Akbari and Ci Zhang and Arash Akbari and Lin Zhao and Yixiao Chen and Weiwei Chen and Xuan Zhang and Geng Yuan and Yanzhi Wang},
      year={2026},
      eprint={2606.05254},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2606.05254}, 
}

License: Apache-2.0.