YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

TempAct: Advancing Temporal Plausibility in Autoregressive Video Generation via Planner-Executor RL

     

Reinforcement learning for temporal-order fidelity in autoregressive video generation.

TempAct fine-tunes streaming / causal video diffusion models so that the generated video respects the temporal order of actions described in the prompt (e.g. "first pour the sauce, then add the cheese, then the pepperoni"). It is a planner–executor RL framework that jointly optimizes

  • an LLM planner (a LoRA-tuned Qwen3 that decomposes the global instruction into span-aware step prompts), and
  • an AR diffusion executor (the video generator that realizes each step under its own generated history),

trained against multimodal reward models (a Qwen3-VL video reward + an LLM judge for action decomposition).

TempAct Overview and Motivation

Figure 1. Overview and Motivation of TempAct. Single-prompt AR generation conditions every chunk on the same global instruction, while step-prompt generation provides explicit stage-wise conditions but still relies on a fixed executor. TempAct introduces a planner–executor RL framework that jointly optimizes temporal decomposition and prompt-transition execution, producing more faithful event progression under temporally complex instructions.

Two base video backbones are supported:

Backbone Pipeline Training entry Config
Self-Forcing self_forcing/causal_pipeline.py scripts/train_flow_grpo_llm_diffusion_mix_acc.py config/self_forcing.py:self_forcing_llm_diffusion_rl_mix
LongLive self_forcing/interactive_causal_pipeline.py scripts/train_longlive_flow_grpo_llm_diffusion_mix_acc.py config/longlive.py:longlive_llm_diffusion_rl_mix

Repository layout

config/                 ml_collections training configs (self_forcing.py, longlive.py, base.py)
flow_grpo/              RL machinery + reward models
  β”œβ”€β”€ rewards.py            reward registry / multi-reward aggregation
  β”œβ”€β”€ qwen3vl_video_reward.py   Qwen3-VL video reward (vLLM OpenAI client)
  β”œβ”€β”€ llm_judge_score.py        action-decomposition LLM judge reward
  β”œβ”€β”€ video_pickscore_scorer.py PickScore-based local reward
  β”œβ”€β”€ stat_tracking.py, ema.py, diffusers_patch/ ...
self_forcing/           video backbones (Wan2.1), causal pipelines, LLM policy
scripts/                training launchers + accelerate config
  β”œβ”€β”€ run_multi_node_vllm_reward.sh   multi-node launcher (entry point)
  β”œβ”€β”€ train_flow_grpo_llm_diffusion_mix_acc.py        (Self-Forcing)
  └── train_longlive_flow_grpo_llm_diffusion_mix_acc.py (LongLive)
tools/                  inference + evaluation
  β”œβ”€β”€ inference_unified.py        unified inference for both backbones
  β”œβ”€β”€ eval_qwen3*.py, eval_gemini*.py, metric/ ...
dataset/                temporal-order prompt sets (train / eval)
ckpt/                   released LoRA checkpoints (download from HuggingFace)
inference_scripts_self_forcing.sh / inference_scripts_longlive.sh   inference entry points

Released checkpoints

All TempAct LoRA weights are released at huggingface.co/jing1119/TempAct. Download them into a top-level ckpt/ directory (the configs and inference scripts reference these paths directly):

pip install -U "huggingface_hub[cli]"
huggingface-cli download jing1119/TempAct --local-dir ckpt
Path Role Description
ckpt/pre_llm_policy_lora/ Planner cold-start LLM planner LoRA cold-started on the plan format. Completes the planning task on its own β€” used as the baseline and as the warm-start for RL training.
ckpt/self_forcing/ Self-Forcing Planner + Executor TempAct-finetuned weights for the Self-Forcing backbone. Contains lora/ (diffusion executor) and llm_lora/ (LLM planner).
ckpt/longlive/ LongLive Planner + Executor TempAct-finetuned weights for the LongLive backbone. Contains lora/ (diffusion executor) and llm_lora/ (LLM planner).

ckpt/ is git-ignored β€” the weights live on HuggingFace, not in this repo.


Environment setup

The RL machinery is based on the Flow-GRPO codebase; the video backbones build on Self-Forcing and LongLive (Wan2.1-T2V-1.3B).

git clone https://github.com/jingw193/TempAct.git
cd TempAct

conda create -n tempact python=3.10 -y
conda activate tempact
pip install torch==2.6.0 torchvision==0.21.0 --index-url https://download.pytorch.org/whl/cu126
pip install -e .            # see setup.py / requirements.txt

Paths: The released TempAct LoRA checkpoints are referenced via the local ckpt/ directory (see Released checkpoints). Base-model paths (Self-Forcing / Wan2.1 / LongLive / Qwen3) still use /path/to/... placeholders in the configs and scripts β€” replace them with your local download paths, or set ROOT_DIR in the launcher script. No private absolute paths are committed.

Model checkpoints to download

Checkpoint Used as Source
Self-Forcing DMD (self_forcing_dmd.pt) base generator gdhe17/Self-Forcing
Wan2.1-T2V-1.3B VAE / text encoder backbone Wan-AI/Wan2.1-T2V-1.3B
LongLive-1.3B (longlive_base.pt, lora.pt) LongLive backbone Efficient-Large-Model/LongLive-1.3B
Qwen3-1.7B LLM prompt-rewrite policy Qwen/Qwen3-1.7B

Point the corresponding config.self_model_path / config.llm_model_path / *.yaml generator_ckpt fields at the downloaded paths.


Reward servers (required for training)

Rewards are served over OpenAI-compatible HTTP endpoints (vLLM). Launch them before training and expose their addresses via environment variables.

# Qwen3-VL video reward server (one per node)
vllm serve Qwen/Qwen3-VL-8B-Instruct \
    --host 0.0.0.0 --port 8000 \
    --tensor-parallel-size 1 --max-model-len 32768 \
    --limit-mm-per-prompt image=16 --trust-remote-code

# LLM judge server (action-decomposition reward)
vllm serve Qwen/Qwen3-8B --host 0.0.0.0 --port 7000

Environment variables read by the reward modules:

Variable Default Consumed by
VLLM_BASE_URL, VLLM_MODEL, VLLM_API_KEY http://127.0.0.1:8000, Qwen/Qwen3-VL-8B-Instruct, EMPTY flow_grpo/qwen3vl_video_reward.py
LLM_BASE_URL, LLM_MODEL, LLM_API_KEY http://127.0.0.1:8000, Qwen/Qwen3-8B, EMPTY flow_grpo/llm_judge_score.py
GEMINI_APP_ID, GEMINI_APP_KEY, GEMINI_BASE_URL (empty) optional Gemini reward / eval (flow_grpo/gemini_reward.py, tools/api_file/*)

The launcher scripts/run_multi_node_vllm_reward.sh assigns one reward server per node by MACHINE_RANK β€” edit the VLLM_SERVERS / LLM_SERVERS arrays with your own host:port entries.


Training

Distributed training uses accelerate. The multi-node launcher reads cluster topology from environment variables (CHIEF_IP, INDEX, HOST_NUM, HOST_GPU_NUM, …) and is invoked per node.

export WANDB_API_KEY=xxx
export WANDB_ENTITY=xxx

# Self-Forcing (default) β€” edit ROOT_DIR / server arrays first
bash scripts/run_multi_node_vllm_reward.sh

The launcher runs, by default:

accelerate launch \
  --config_file ${ROOT_DIR}/scripts/accelerate_files/accelerate_config.yaml \
  --machine_rank ${MACHINE_RANK} --main_process_ip ${MASTER_IP} \
  --main_process_port ${MASTER_PORT} --num_processes ${TOTAL_PROCS} \
  --num_machines ${NUM_NODES} \
  ${ROOT_DIR}/scripts/train_flow_grpo_llm_diffusion_mix_acc.py \
  --config config/self_forcing.py:self_forcing_llm_diffusion_rl_mix

To train the LongLive backbone instead, comment the Self-Forcing block in the launcher and uncomment the LongLive block (it calls train_longlive_flow_grpo_llm_diffusion_mix_acc.py with config/longlive.py:longlive_llm_diffusion_rl_mix).

Reward weights, learning rates, frame counts, KL/clip coefficients, etc. live in the config entry functions (config/self_forcing.py, config/longlive.py).

Both configs warm-start the LLM planner from the released cold-start baseline config.train.llm_lora_path = "ckpt/pre_llm_policy_lora" β€” download it first (see Released checkpoints).


Inference

Both backbones share tools/inference_unified.py (--mode self_forcing | longlive), which auto-detects diffusion / LLM LoRA folders inside the checkpoint directory and partitions prompts round-robin across GPUs.

# Self-Forcing β€” uses the released ckpt/self_forcing weights
bash inference_scripts_self_forcing.sh

# LongLive β€” uses the released ckpt/longlive weights
bash inference_scripts_longlive.sh

Direct invocation:

torchrun --nproc_per_node=8 tools/inference_unified.py \
    --mode self_forcing \
    --config_path self_forcing/config/self_forcing_dmd.yaml \
    --model_path  /path/to/self_forcing_dmd.pt \
    --lora_path   ckpt/self_forcing \
    --prompt_path dataset/temporal_eval_combined_100.csv \
    --output_file /path/to/output_dir \
    --sample_frames 36 --gap_frame 12

--lora_path is the checkpoint directory; the script auto-detects lora/ (diffusion executor) and llm_lora/ (LLM planner) inside it. When llm_lora/ is absent it falls back to the cold-start planner baseline ckpt/pre_llm_policy_lora (override with --llm_fallback_lora_path). Key flags: --sample_frames, --gap_frame, --fps, --seed, --dtype.


Datasets

Temporal-order prompt sets live under dataset/temporal_order/ (training) and dataset/temporal_eval_*.csv (evaluation). Training CSVs have a single prompt column; evaluation CSVs add a category column.


Evaluation

tools/ provides reward / preference scorers over generated videos:

  • tools/eval_qwen3_multi.py β€” Qwen3-VL scoring
  • tools/eval_gemini.py β€” Gemini scoring
  • tools/eval_pickscore.py, tools/eval_video_align.py β€” PickScore / VideoAlign
  • tools/metric/ β€” compute final scores for Temporal-Following score

Qualitative Comparison

Side-by-side video comparisons (Single Prompt vs. Step Prompt vs. TempAct (Ours)) for both backbones are available on the project page. Single-prompt generation blends actions across chunks; step prompts improve stage clarity but still miss state transitions; TempAct correctly realizes the intended event progression.

Self-Forcing backbone

Prompt Single Prompt Step Prompt TempAct (Ours)
Ex.1 Chef places a tomato, slices it into rounds, then arranges the slices on a plate video video video
Ex.2 Dog crouches, pounces to grab the ball, then returns it to its owner's feet video video video
Ex.3 Squirrel examines an acorn, digs a hole, then buries the acorn video video video

LongLive backbone

Prompt Single Prompt Step Prompt TempAct (Ours)
Ex.1 Woman opens a jewelry box, holds up a pearl necklace, then fastens it on video video video
Ex.2 Tears lettuce into a bowl, slices cucumber over it, then drizzles oil and tosses video video video
Ex.3 Places a laptop on the desk, picks up a notebook, then opens it to a blank page video video video

Acknowledgements

This project builds on Flow-GRPO, Self-Forcing, LongLive, and Wan2.1. We thank the authors for releasing their code and models.

Citation

If you find TempAct useful, please cite our work:

@article{wang2026tempact,
  title     = {TempAct: Advancing Temporal Plausibility in Autoregressive Video Generation
               via Planner-Executor RL},
  author    = {Wang, Jing and Zhou, Xiangxin and Liang, Jiajun and Liu, Kaiqi
               and Pang, Wanyuan and Xie, Zhenyu and Pang, Tianyu and Liang, Xiaodan},
  journal   = {arXiv preprint arXiv:2606.28016},
  year      = {2026}
}

License

Released under the Apache License 2.0 β€” see LICENSE.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for jing1119/TempAct