Papers
arxiv:2602.01624

PISCES: Annotation-free Text-to-Video Post-Training via Optimal Transport-Aligned Rewards

Published on Feb 2
· Submitted by
Minh-Quan Le
on Feb 3
Authors:
,
,
,
,
,

Abstract

PISCES is an annotation-free text-to-video generation method that uses dual optimal transport-aligned rewards to improve visual quality and semantic alignment without human preference annotations.

AI-generated summary

Text-to-video (T2V) generation aims to synthesize videos with high visual quality and temporal consistency that are semantically aligned with input text. Reward-based post-training has emerged as a promising direction to improve the quality and semantic alignment of generated videos. However, recent methods either rely on large-scale human preference annotations or operate on misaligned embeddings from pre-trained vision-language models, leading to limited scalability or suboptimal supervision. We present PISCES, an annotation-free post-training algorithm that addresses these limitations via a novel Dual Optimal Transport (OT)-aligned Rewards module. To align reward signals with human judgment, PISCES uses OT to bridge text and video embeddings at both distributional and discrete token levels, enabling reward supervision to fulfill two objectives: (i) a Distributional OT-aligned Quality Reward that captures overall visual quality and temporal coherence; and (ii) a Discrete Token-level OT-aligned Semantic Reward that enforces semantic, spatio-temporal correspondence between text and video tokens. To our knowledge, PISCES is the first to improve annotation-free reward supervision in generative post-training through the lens of OT. Experiments on both short- and long-video generation show that PISCES outperforms both annotation-based and annotation-free methods on VBench across Quality and Semantic scores, with human preference studies further validating its effectiveness. We show that the Dual OT-aligned Rewards module is compatible with multiple optimization paradigms, including direct backpropagation and reinforcement learning fine-tuning.

Community

Paper submitter

PISCES is an annotation-free post-training framework for text-to-video models.
We tackle a key bottleneck in VLM-based rewards, text/video embedding misalignment, by using Optimal Transport (OT) to align text embeddings to the video space.
This yields dual OT-aligned rewards that mimic how humans judge T2V: a distributional quality reward and a token-level semantic reward, improving fidelity and prompt faithfulness across short and long video generators.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2602.01624 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2602.01624 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.01624 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.