ICML-2027 / CodeSight-Proposal.md

Codex Restore

Restore previous repo contents and add deep-dive document

48c208c 26 days ago

preview code

Raw

History Blame Contribute Delete

25.8 kB

CodeSight: On-Policy Self-Distillation for Video Question Answering with Code-Privileged Supervision

Research Proposal

Abstract

Video Question Answering (VideoQA) requires models to understand complex visual scenes involving multiple objects, their states, quantities, and temporal dynamics. Current approaches either rely on sparse outcome-level rewards (e.g., GRPO) that provide no per-token learning signal, or require expensive human annotations for dense supervision. We propose CodeSight, a framework that applies On-Policy Self-Distillation (OPSD) to VideoQA by exploiting a novel form of privileged information: the source code that generated the video. Using programmatic video generation tools (e.g., HyperFrames), we construct videos whose HTML/animation source code contains precise, machine-readable ground truth about every object's identity, state, position, count, and temporal behavior. A single model serves as both teacher (conditioned on video + source code) and student (conditioned on video only), with the teacher providing dense token-level supervision on the student's own on-policy rollouts. This eliminates the need for external teacher models, human annotation, and addresses the reward sparsity problem in video RL. We further introduce an automated QA generation pipeline that extracts structured facts from code to produce diverse, verifiable question-answer pairs. Experiments are designed on three compositional reasoning dimensions --- object counting, state tracking, and temporal event ordering --- across synthetic videos of varying complexity.

1. Introduction

1.1 Problem Statement

Video Language Models (Video-LLMs) have made rapid progress, yet they frequently hallucinate about basic visual facts: miscounting objects, misidentifying states, and confusing temporal order. Benchmarks like VidHalluc (CVPR 2025) confirm that temporal hallucinations remain pervasive. A deeper issue, highlighted by ViUniT (CVPR 2025), is that models can produce correct answers for wrong reasons --- a problem that outcome-level evaluation cannot detect.

Training these models faces a fundamental tension:

Supervised Fine-Tuning (SFT) provides dense token-level signal but suffers from distribution mismatch: training on teacher-generated sequences diverges from the model's own inference behavior.
Reinforcement Learning (GRPO/PPO) trains on the model's own outputs (on-policy) but provides only sparse, outcome-level rewards. For video reasoning, this is especially problematic: a correct final answer receives uniform credit across all tokens, while an incorrect answer receives uniform penalty, regardless of which reasoning steps were right or wrong.

OPSD (2025) resolves this tension for mathematical reasoning by using the same model as both teacher (with privileged access to reference solutions) and student (with only the problem), providing dense token-level KL supervision on the student's own rollouts. However, OPSD has only been demonstrated on text-only math benchmarks. Extending it to multimodal video understanding requires a suitable form of privileged information.

1.2 Key Insight

We observe that programmatically generated videos carry a unique duality: the rendered video is a visual signal, while the source code that produced it is a complete, structured, machine-readable description of every visual element. For a video defined in HyperFrames:

<div id="stage" data-composition-id="scene-1" data-width="1920" data-height="1080">
  <div id="red-ball" data-start="0" data-duration="5" data-track-index="0"
       style="position:absolute; left:100px; top:200px; width:50px; height:50px;
              background:red; border-radius:50%;">
  </div>
  <div id="blue-cube" data-start="2" data-duration="3" data-track-index="1"
       style="position:absolute; left:400px; top:300px; width:60px; height:60px;
              background:blue;">
  </div>
  <img id="cat-photo" data-start="1" data-duration="4" data-track-index="2"
       src="/Wendy-Fly/ICML-2027/resolve/main/cat.png" style="position:absolute; left:600px; top:100px;"/>
</div>

The code tells us exactly:

What objects exist: red-ball, blue-cube, cat-photo (count = 3)
Object properties: color, shape, size, position
Temporal dynamics: red-ball appears at t=0 for 5s, blue-cube at t=2 for 3s, cat at t=1 for 4s
Spatial layout: red-ball at (100,200), blue-cube at (400,300), cat at (600,100)
Animation states: any GSAP/CSS animation keyframes define state transitions

This is strictly richer than any human annotation: it is exhaustive, precise, and obtained at zero marginal cost.

1.3 Contributions

CodeSight Framework: The first application of on-policy self-distillation to multimodal video understanding, using video generation source code as privileged information.
Code-to-QA Pipeline: An automated pipeline that parses HyperFrames HTML to extract structured scene graphs and generates diverse, verifiable QA pairs across counting, state, and temporal dimensions.
Dense Token-Level Video Supervision: We demonstrate that per-token KL divergence from a code-privileged teacher provides stronger learning signals than sparse GRPO rewards for video reasoning.
CodeSight-Bench: A benchmark of programmatically generated videos with automatically derived ground-truth annotations for object count, state, and temporal event ordering.

2. Related Work

2.1 Video Question Answering

VideoQA has evolved from simple recognition to complex compositional reasoning. Recent benchmarks push toward fine-grained understanding:

TUNA (ACL 2025) evaluates fine-grained temporal understanding on dense dynamic videos.
MotionBench (CVPR 2025) benchmarks fine-grained motion understanding.
VELOCITI (CVPR 2025) tests compositional reasoning with strict entailment.
VidHalluc (CVPR 2025) specifically evaluates temporal hallucinations in Video-LLMs.
Video Thinking Test (ICCV 2025) provides a holistic benchmark for advanced video reasoning.

However, none of these benchmarks provide code-level ground truth that enables automatic, dense supervision during training.

2.2 Object State Understanding in Video

Understanding object states and their changes is crucial for video reasoning:

MOSCATO (ICCV 2025) predicts multiple object state changes through actions.
SAGE (NeurIPS 2025, Oral) provides a unified framework for generalizable object state recognition via state-action graph embedding.
ObjChangeVR (EACL 2026) reasons about object state changes from continuous egocentric VR views.
Compositional 4D Dynamic Scenes (ICLR 2025) uses physics priors for video QA.

These works focus on recognizing states from video. Our approach is complementary: we generate videos with known states and use the generation code as supervision signal.

2.3 Video Generation Evaluation

Evaluating generated video quality requires structured ground truth:

T2V-CompBench (CVPR 2025) evaluates compositional text-to-video generation across numeracy, spatial relations, and actions.
Neuro-Symbolic Evaluation of T2V (CVPR 2025) uses formal verification to check object constraints.
ETVA (ICCV 2025) evaluates text-to-video alignment via fine-grained QA generation.
TC-Bench (ACL 2025 Findings) benchmarks temporal compositionality in conditional video generation.

These works evaluate generated videos against text prompts. We flip the direction: we use the generation code itself as structured ground truth for training.

2.4 Reinforcement Learning for Video-Language Models

Recent work applies RL to improve video reasoning:

DeepVideo-R1 (NeurIPS 2025) uses difficulty-aware regressive GRPO for video reasoning.
Video-RTS (EMNLP 2025) rethinks RL and test-time scaling for video reasoning.
Video-as-Answer (CVPR 2026) applies Joint-GRPO for video event prediction.

All these methods use sparse outcome-level rewards, suffering from the credit assignment problem. GRPO specifically exhibits reward diversity collapse (demonstrated in OPSD paper Fig.3) where >50% of batches show zero reward variance, halting learning. This problem is amplified in video QA where reasoning chains are longer and more complex.

2.5 Knowledge Distillation for Video Understanding

Agent-of-Thoughts Distillation (CVPR 2025) distills reasoning from large to small Video-LLMs.
VITED (CVPR 2025) distills temporal evidence from video.
Visual Program Distillation (EMNLP 2025 Findings) distills visual programs with augmentation.

These approaches require a separate, larger teacher model. OPSD eliminates this requirement by using the same model with asymmetric conditioning.

2.6 On-Policy Self-Distillation

OPSD (2025) introduces self-distillation where a single LLM acts as both teacher (conditioned on reference solution) and student (conditioned on problem only), minimizing per-token KL divergence on the student's own rollouts:

$\mathcal{L}_{\text{OPSD}}(\theta) = \mathbb{E}_{(x,y^*)\sim\mathcal{S}} \mathbb{E}_{\hat{y}\sim p_S(\cdot|x)} \sum_{n=1}^{|\hat{y}|} D_{\text{KL}}(p_T(\cdot|x,y^*,\hat{y}_{<n}) \| p_S(\cdot|x,\hat{y}_{<n}))$

OPSD achieves +5.7 points over GRPO on AIME25 for Qwen3-1.7B while using 128x fewer generation tokens. However, it has only been applied to text-only mathematical reasoning.

2.7 Visual Unit Testing

ViUniT (CVPR 2025) generates visual unit tests to verify program correctness in visual programming, reducing "correct for wrong reasons" by 40%. Our work shares the philosophy of using programmatic verification but differs fundamentally: ViUniT tests programs after generation, while CodeSight uses code as privileged information during training.

3. Method

3.1 Overview

CodeSight consists of three components:

Code-Grounded Video-QA Construction (Section 3.2): Automatically generate videos and QA pairs from HTML code.
On-Policy Self-Distillation with Code Privilege (Section 3.3): Train a Video-LLM using code-privileged teacher and video-only student.
Automated Difficulty Curriculum (Section 3.4): Scale scene complexity progressively.

3.2 Code-Grounded Video-QA Construction

3.2.1 Video Generation via HyperFrames

We use HyperFrames, an open-source HTML-to-video rendering framework designed for AI agents. Videos are defined as HTML compositions with semantic data attributes:

HTML Code  ──→  HyperFrames Renderer  ──→  MP4 Video
    │                                          │
    └── Structured Scene Description    Visual Signal
        (privileged information)        (input at inference)

HyperFrames supports multiple animation runtimes (GSAP, Anime.js, Lottie, Three.js, CSS Animations, WAAPI), enabling diverse visual content. Rendering is deterministic: identical code always produces identical video.

3.2.2 Scene Graph Extraction from Code

We parse the HTML composition into a structured Scene Graph $G = (O, R, E)$:

Objects $O = {o_1, ..., o_N}$: Each HTML element with data-start and data-duration attributes. Properties extracted include:
- Identity: element id and tag type
- Visual attributes: color, size, position, opacity (from style)
- Temporal span: [data-start, data-start + data-duration]
- Layer: data-track-index
Relations $R$: Spatial relations computed from CSS positions (left-of, above, overlapping), temporal relations (before, during, after), and layer ordering (in-front-of, behind).
Events $E$: State changes extracted from animation definitions:
- GSAP timelines: keyframe sequences with property targets
- CSS animations: @keyframes rule parsing
- WAAPI: element.animate() parameter extraction

Formally, for each object $o_i$, we extract:

$o_i = (\text{id}_i, \text{type}_i, \text{attrs}_i, [t_i^{\text{start}}, t_i^{\text{end}}], \text{layer}_i, \text{anims}_i)$

3.2.3 QA Generation from Scene Graph

Given scene graph $G$, we generate QA pairs across three compositional dimensions:

Counting Questions: "How many [type/color] objects are visible at time $t$?"

Answer derived by filtering ${o_i \mid t_i^{\text{start}} \leq t \leq t_i^{\text{end}} \wedge \text{match}(o_i, \text{query})}$

State Questions: "What is the [property] of [object] at time $t$?"

Answer derived by evaluating animation state at timestamp $t$:
- For GSAP: timeline.seek(t) and read property
- For CSS: compute interpolated keyframe value
- For position: read transform/left/top at $t$

Temporal Questions: "Does [object A] appear before or after [object B]?"

Answer derived by comparing $t_A^{\text{start}}$ vs $t_B^{\text{start}}$

Composite Questions: "After the red ball moves to the right, how many blue objects remain visible?"

Answer derived by combining temporal events with counting queries

We use template-based generation with LLM-assisted paraphrasing to ensure linguistic diversity while maintaining answer correctness. Each QA pair $(q, a)$ has a provenance trace linking it to specific code elements, enabling automatic verification.

3.2.4 Code Context Formatting

The privileged information $y^*$ provided to the teacher is formatted as a structured code summary:

[Scene Summary]
Objects: 3 elements (red-ball, blue-cube, cat-photo)
Timeline:
  - red-ball: t=0s to t=5s, position (100,200), red circle 50px
  - blue-cube: t=2s to t=5s, position (400,300), blue square 60px
  - cat-photo: t=1s to t=5s, position (600,100), image
Animations:
  - red-ball: moves from (100,200) to (800,200) over 3s starting at t=1s
Spatial: red-ball LEFT-OF blue-cube, blue-cube LEFT-OF cat-photo

This structured summary is more effective than raw HTML because it: (a) removes syntactic noise, (b) pre-computes derived facts (spatial relations, temporal overlaps), and (c) fits naturally into the model's context window.

3.3 On-Policy Self-Distillation with Code Privilege

3.3.1 Problem Formulation

Given a video $v$ and question $q$, the goal is to train a Video-LLM $\pi_\theta$ to produce correct answer $a$. We define two policies sharing parameters $\theta$:

Student Policy (inference-time condition): $p_S(\cdot \mid v, q) = \pi_\theta(\cdot \mid [v, q])$

Teacher Policy (training-time, with code privilege): $p_T(\cdot \mid v, q, y^*) = \pi_\theta(\cdot \mid [v, q, y^*])$

where $y^*$ is the structured code context from Section 3.2.4. Both policies use identical parameters $\theta$; they differ only in input conditioning.

3.3.2 Training Objective

Following OPSD, we minimize per-token forward KL divergence on student-generated rollouts:

$\mathcal{L}_{\text{CodeSight}}(\theta) = \mathbb{E}_{(v,q,y^*)\sim\mathcal{D}} \; \mathbb{E}_{\hat{a}\sim p_S(\cdot|v,q)} \left[ \frac{1}{|\hat{a}|} \sum_{n=1}^{|\hat{a}|} D_{\text{KL}}\big(p_T(\cdot|v,q,y^*,\hat{a}_{<n}) \;\|\; p_S(\cdot|v,q,\hat{a}_{<n})\big) \right]$

Key properties:

On-policy: $\hat{a}$ is sampled from the student $p_S$, ensuring training distribution matches inference.
Dense supervision: Every token position $n$ receives a gradient signal, unlike GRPO's outcome-level reward.
No external teacher: Same model $\pi_\theta$ serves both roles; the teacher is frozen to initial parameters for stability.
Full-vocabulary signal: KL is computed over the entire vocabulary at each position, not just the sampled token.

3.3.3 Per-Token Pointwise Clipping

Following OPSD's finding that stylistic tokens can dominate the KL loss, we apply pointwise clipping:

$D_{\text{clip}}(p_T \| p_S) = \frac{1}{|\hat{a}|} \sum_{n=1}^{|\hat{a}|} \sum_{w \in \mathcal{V}} \min\big(\ell_{n,w}, \tau\big)$

where $\ell_{n,w} = p_T(w|\cdot) \log \frac{p_T(w|\cdot)}{p_S(w|\cdot)}$ and $\tau$ is a clipping threshold. This is particularly important for video QA where formatting tokens (e.g., "The answer is:") may have high KL but carry no reasoning signal.

3.3.4 Training Algorithm

Algorithm 1: CodeSight Training

Input: Dataset D = {(v_i, q_i, y*_i)}, model π_θ, frozen teacher π_θ₀
Output: Updated model π_θ

1: θ₀ ← θ                          // Freeze teacher parameters
2: for each minibatch {(v, q, y*)} ⊂ D do
3:   // On-policy rollout (student generates)
4:   â ~ p_S(·|v, q; θ)             // Student sees video + question only
5:
6:   // Compute per-token teacher distribution (frozen)
7:   for n = 1 to |â| do
8:     p_T^n ← π_θ₀(·|v, q, y*, â_{<n})   // Teacher sees video + question + code
9:     p_S^n ← π_θ(·|v, q, â_{<n})         // Student sees video + question
10:  end for
11:
12:  // Compute clipped forward KL loss
13:  L ← (1/|â|) Σ_n Σ_w min(p_T^n(w) log(p_T^n(w)/p_S^n(w)), τ)
14:
15:  // Update student only (teacher frozen)
16:  θ ← θ - η ∇_θ L
17: end for

3.3.5 Interpretation as Dense-Reward Policy Gradient

The OPSD loss admits an equivalent policy gradient formulation:

$\mathcal{L}(\theta) = -\mathbb{E}\left[\frac{1}{|\hat{a}|}\sum_{n=1}^{|\hat{a}|} A_n \cdot \log p_S(\hat{a}_n | v, q, \hat{a}_{<n})\right]$

where the per-token advantage is:

$A_n = \log p_T(\hat{a}_n | v, q, y^*, \hat{a}_{<n}) - \log p_S(\hat{a}_n | v, q, \hat{a}_{<n})$

This means: at each token, the teacher (who can see the code) provides a signal about whether the student's token choice aligns with code-informed reasoning. Tokens where the code-informed teacher strongly disagrees with the student receive larger gradients --- this is automatic token-level credit assignment.

For video QA, this is powerful: if the student says "there are 3 objects" but the code shows 4, the advantage for the count token will be strongly negative, directly correcting the counting error without needing outcome-level reward.

3.4 Automated Difficulty Curriculum

We design a 3-level curriculum based on scene complexity:

Level	Objects	Animations	Temporal Events	Question Types
L1: Simple	1-3	Static or single	1-2 appear/disappear	Single-hop counting, attribute
L2: Medium	4-8	Linear motion, fade	3-5 state changes	Multi-hop, temporal ordering
L3: Complex	9-15	Chained animations, interactions	6+ overlapping events	Composite reasoning

Scene generation is fully automated: we write code templates with parameterized object counts, animation types, and timing patterns, then sample concrete scenes from these templates.

3.5 Comparison with Alternative Approaches

Property	SFT	GRPO (DeepVideo-R1)	External Distillation	CodeSight (Ours)
On-policy data	No	Yes	No	Yes
Dense token signal	Yes	No (sparse reward)	Yes	Yes
Low sampling cost	Yes	No (G rollouts)	Yes	Yes (1 rollout)
No external teacher	Yes	Yes	No	Yes
Multimodal	N/A	Video	Video	Video
Annotation-free	No	Needs reward labels	Needs teacher outputs	Yes (from code)

4. Experimental Design

4.1 CodeSight-Bench Construction

We construct a benchmark across three difficulty levels with the following statistics:

Split	Videos	QA Pairs	Avg Objects/Video	Avg Duration
Train	10,000	100,000	5.2	8.3s
Val	1,000	10,000	5.4	8.5s
Test	2,000	20,000	5.3	8.4s

Question type distribution (balanced across splits):

Counting: 25% ("How many red objects at t=3s?")
State identification: 25% ("What color is the large circle?")
Temporal ordering: 25% ("Which appears first, A or B?")
Composite: 25% ("After A disappears, how many objects remain?")

Video diversity: We use HyperFrames' supported animation runtimes (GSAP, CSS, WAAPI) and built-in catalog blocks (shader transitions, text overlays, data charts) to ensure visual diversity.

4.2 Baselines

SFT: Standard supervised fine-tuning on (video, question, answer) triples.
GRPO: Group Relative Policy Optimization following DeepVideo-R1's approach, with binary correctness reward.
Offline Distillation: Teacher generates answers conditioned on code; student trains on teacher's sequences (standard KD, off-policy).
GRPO + Code Reward: GRPO with an additional reward term based on code-verified answer correctness (giving GRPO access to code information, but only as outcome reward).
ViUniT-style Selection: Generate multiple programs, use visual unit tests for selection (adapted to VideoQA).

4.3 Models

Primary: Qwen2.5-VL-7B (strong open-source Video-LLM)
Scaling study: Qwen2.5-VL-3B, Qwen2.5-VL-7B, Qwen2.5-VL-72B
Cross-architecture: LLaVA-Video-7B, InternVL2.5-8B

4.4 Evaluation Metrics

Accuracy: Exact match for counting/state; semantic match for temporal.
Reasoning Correctness: Following ViUniT, we evaluate whether models are "right for the right reasons" via:
- Generate answer on original video
- Generate answer on perturbed video (e.g., remove one object, change a color)
- Correct reasoning should produce different answers for different videos
Token Efficiency: Total generation tokens consumed during training (OPSD's key advantage).
Per-Dimension Breakdown: Separate accuracy for counting, state, temporal, and composite questions.

4.5 Ablation Studies

Privileged information format: Raw HTML vs. structured summary vs. scene graph JSON
Divergence function: Forward KL vs. reverse KL vs. JSD (following OPSD Table 3)
Clipping threshold $\tau$: Impact on training stability
Generation length: Short (256) vs. medium (512) vs. long (1024) student rollouts
Curriculum: With vs. without difficulty progression
Number of QA pairs per video: 5, 10, 20 --- effect of question diversity

4.6 Analysis

Per-token advantage visualization: Which tokens receive the strongest teacher correction? We hypothesize counting tokens and temporal markers will show highest KL.
Reward diversity collapse analysis: Compare GRPO's reward variance across batches vs. CodeSight's KL variance, demonstrating that dense supervision avoids collapse.
Transfer to real video: Fine-tune on CodeSight-Bench, evaluate on natural video benchmarks (TUNA, MotionBench) to test whether improved compositional reasoning transfers.
Scaling laws: How does performance scale with (a) number of training videos, (b) scene complexity, (c) model size?

5. Expected Contributions and Impact

5.1 Technical Contributions

First multimodal OPSD: Extending on-policy self-distillation from text-only math to vision-language video understanding.
Code-as-privileged-information paradigm: Demonstrating that video generation source code is a powerful, zero-cost form of privileged information for training.
Automatic dense supervision for video: Solving the sparse reward problem in video RL without human annotation.
CodeSight-Bench: A new benchmark with machine-verifiable ground truth derived from generation code.

5.2 Broader Impact

Data flywheel: Code → video → QA generation is fully automatic, enabling unlimited training data at near-zero marginal cost.
Sim-to-real transfer: If CodeSight-trained models show improved compositional reasoning on real video benchmarks, this validates synthetic-to-real transfer for video understanding.
Beyond HyperFrames: The framework generalizes to any programmatic video generation tool (Manim, Remotion, Blender scripting, game engines) --- any system where code produces video.

6. Timeline

Phase	Duration	Deliverable
Phase 1: Data Pipeline	4 weeks	Code-to-QA pipeline, CodeSight-Bench v1 (5K videos)
Phase 2: OPSD Implementation	3 weeks	CodeSight training loop on Qwen2.5-VL-7B
Phase 3: Baseline Experiments	3 weeks	SFT, GRPO, offline distillation comparisons
Phase 4: Ablations & Analysis	3 weeks	All ablation studies, per-token analysis
Phase 5: Scaling & Transfer	3 weeks	Multi-model, real-video transfer experiments
Phase 6: Paper Writing	4 weeks	Full paper submission

7. Key References

OPSD: Self-Distilled Reasoner: On-Policy Self-Distillation for LLMs (arXiv 2601.18734, 2025)
DeepVideo-R1: Video Reinforcement Fine-Tuning via Difficulty-aware GRPO (NeurIPS 2025)
Video-RTS: Rethinking RL and Test-Time Scaling for Video Reasoning (EMNLP 2025)
ViUniT: Visual Unit Tests for More Robust Visual Programming (CVPR 2025)
MOSCATO: Predicting Multiple Object State Change Through Actions (ICCV 2025)
SAGE: Generalizable Object State Recognition with State-Action Graph Embedding (NeurIPS 2025 Oral)
T2V-CompBench: Compositional Text-to-Video Generation Benchmark (CVPR 2025)
Neuro-Symbolic Evaluation of T2V Models using Formal Verification (CVPR 2025)
VidHalluc: Evaluating Temporal Hallucinations in Video-LLMs (CVPR 2025)
TUNA: Fine-grained Temporal Understanding on Dense Dynamic Videos (ACL 2025)
MotionBench: Fine-grained Video Motion Understanding for VLMs (CVPR 2025)
Agent-of-Thoughts Distillation for Video-LLM Reasoning (CVPR 2025)
HyperFrames: HTML-to-Video Rendering Framework (HeyGen, 2025)
Imagine While Reasoning in Space: Multimodal Visualization-of-Thought (ICML 2025)
Rendering-Aware RL for Vector Graphics Generation (NeurIPS 2025)