CodeSight: On-Policy Self-Distillation for Video Question Answering with Code-Privileged Supervision
Research Proposal
Abstract
Video Question Answering (VideoQA) requires models to understand complex visual scenes involving multiple objects, their states, quantities, and temporal dynamics. Current approaches either rely on sparse outcome-level rewards (e.g., GRPO) that provide no per-token learning signal, or require expensive human annotations for dense supervision. We propose CodeSight, a framework that applies On-Policy Self-Distillation (OPSD) to VideoQA by exploiting a novel form of privileged information: the source code that generated the video. Using programmatic video generation tools (e.g., HyperFrames), we construct videos whose HTML/animation source code contains precise, machine-readable ground truth about every object's identity, state, position, count, and temporal behavior. A single model serves as both teacher (conditioned on video + source code) and student (conditioned on video only), with the teacher providing dense token-level supervision on the student's own on-policy rollouts. This eliminates the need for external teacher models, human annotation, and addresses the reward sparsity problem in video RL. We further introduce an automated QA generation pipeline that extracts structured facts from code to produce diverse, verifiable question-answer pairs. Experiments are designed on three compositional reasoning dimensions --- object counting, state tracking, and temporal event ordering --- across synthetic videos of varying complexity.
1. Introduction
1.1 Problem Statement
Video Language Models (Video-LLMs) have made rapid progress, yet they frequently hallucinate about basic visual facts: miscounting objects, misidentifying states, and confusing temporal order. Benchmarks like VidHalluc (CVPR 2025) confirm that temporal hallucinations remain pervasive. A deeper issue, highlighted by ViUniT (CVPR 2025), is that models can produce correct answers for wrong reasons --- a problem that outcome-level evaluation cannot detect.
Training these models faces a fundamental tension:
- Supervised Fine-Tuning (SFT) provides dense token-level signal but suffers from distribution mismatch: training on teacher-generated sequences diverges from the model's own inference behavior.
- Reinforcement Learning (GRPO/PPO) trains on the model's own outputs (on-policy) but provides only sparse, outcome-level rewards. For video reasoning, this is especially problematic: a correct final answer receives uniform credit across all tokens, while an incorrect answer receives uniform penalty, regardless of which reasoning steps were right or wrong.
OPSD (2025) resolves this tension for mathematical reasoning by using the same model as both teacher (with privileged access to reference solutions) and student (with only the problem), providing dense token-level KL supervision on the student's own rollouts. However, OPSD has only been demonstrated on text-only math benchmarks. Extending it to multimodal video understanding requires a suitable form of privileged information.
1.2 Key Insight
We observe that programmatically generated videos carry a unique duality: the rendered video is a visual signal, while the source code that produced it is a complete, structured, machine-readable description of every visual element. For a video defined in HyperFrames:
<div id="stage" data-composition-id="scene-1" data-width="1920" data-height="1080">
<div id="red-ball" data-start="0" data-duration="5" data-track-index="0"
style="position:absolute; left:100px; top:200px; width:50px; height:50px;
background:red; border-radius:50%;">
</div>
<div id="blue-cube" data-start="2" data-duration="3" data-track-index="1"
style="position:absolute; left:400px; top:300px; width:60px; height:60px;
background:blue;">
</div>
<img id="cat-photo" data-start="1" data-duration="4" data-track-index="2"
src="/Wendy-Fly/ICML-2027/resolve/main/cat.png" style="position:absolute; left:600px; top:100px;"/>
</div>
The code tells us exactly:
- What objects exist: red-ball, blue-cube, cat-photo (count = 3)
- Object properties: color, shape, size, position
- Temporal dynamics: red-ball appears at t=0 for 5s, blue-cube at t=2 for 3s, cat at t=1 for 4s
- Spatial layout: red-ball at (100,200), blue-cube at (400,300), cat at (600,100)
- Animation states: any GSAP/CSS animation keyframes define state transitions
This is strictly richer than any human annotation: it is exhaustive, precise, and obtained at zero marginal cost.
1.3 Contributions
- CodeSight Framework: The first application of on-policy self-distillation to multimodal video understanding, using video generation source code as privileged information.
- Code-to-QA Pipeline: An automated pipeline that parses HyperFrames HTML to extract structured scene graphs and generates diverse, verifiable QA pairs across counting, state, and temporal dimensions.
- Dense Token-Level Video Supervision: We demonstrate that per-token KL divergence from a code-privileged teacher provides stronger learning signals than sparse GRPO rewards for video reasoning.
- CodeSight-Bench: A benchmark of programmatically generated videos with automatically derived ground-truth annotations for object count, state, and temporal event ordering.
2. Related Work
2.1 Video Question Answering
VideoQA has evolved from simple recognition to complex compositional reasoning. Recent benchmarks push toward fine-grained understanding:
- TUNA (ACL 2025) evaluates fine-grained temporal understanding on dense dynamic videos.
- MotionBench (CVPR 2025) benchmarks fine-grained motion understanding.
- VELOCITI (CVPR 2025) tests compositional reasoning with strict entailment.
- VidHalluc (CVPR 2025) specifically evaluates temporal hallucinations in Video-LLMs.
- Video Thinking Test (ICCV 2025) provides a holistic benchmark for advanced video reasoning.
However, none of these benchmarks provide code-level ground truth that enables automatic, dense supervision during training.
2.2 Object State Understanding in Video
Understanding object states and their changes is crucial for video reasoning:
- MOSCATO (ICCV 2025) predicts multiple object state changes through actions.
- SAGE (NeurIPS 2025, Oral) provides a unified framework for generalizable object state recognition via state-action graph embedding.
- ObjChangeVR (EACL 2026) reasons about object state changes from continuous egocentric VR views.
- Compositional 4D Dynamic Scenes (ICLR 2025) uses physics priors for video QA.
These works focus on recognizing states from video. Our approach is complementary: we generate videos with known states and use the generation code as supervision signal.
2.3 Video Generation Evaluation
Evaluating generated video quality requires structured ground truth:
- T2V-CompBench (CVPR 2025) evaluates compositional text-to-video generation across numeracy, spatial relations, and actions.
- Neuro-Symbolic Evaluation of T2V (CVPR 2025) uses formal verification to check object constraints.
- ETVA (ICCV 2025) evaluates text-to-video alignment via fine-grained QA generation.
- TC-Bench (ACL 2025 Findings) benchmarks temporal compositionality in conditional video generation.
These works evaluate generated videos against text prompts. We flip the direction: we use the generation code itself as structured ground truth for training.
2.4 Reinforcement Learning for Video-Language Models
Recent work applies RL to improve video reasoning:
- DeepVideo-R1 (NeurIPS 2025) uses difficulty-aware regressive GRPO for video reasoning.
- Video-RTS (EMNLP 2025) rethinks RL and test-time scaling for video reasoning.
- Video-as-Answer (CVPR 2026) applies Joint-GRPO for video event prediction.
All these methods use sparse outcome-level rewards, suffering from the credit assignment problem. GRPO specifically exhibits reward diversity collapse (demonstrated in OPSD paper Fig.3) where >50% of batches show zero reward variance, halting learning. This problem is amplified in video QA where reasoning chains are longer and more complex.
2.5 Knowledge Distillation for Video Understanding
- Agent-of-Thoughts Distillation (CVPR 2025) distills reasoning from large to small Video-LLMs.
- VITED (CVPR 2025) distills temporal evidence from video.
- Visual Program Distillation (EMNLP 2025 Findings) distills visual programs with augmentation.
These approaches require a separate, larger teacher model. OPSD eliminates this requirement by using the same model with asymmetric conditioning.
2.6 On-Policy Self-Distillation
OPSD (2025) introduces self-distillation where a single LLM acts as both teacher (conditioned on reference solution) and student (conditioned on problem only), minimizing per-token KL divergence on the student's own rollouts:
OPSD achieves +5.7 points over GRPO on AIME25 for Qwen3-1.7B while using 128x fewer generation tokens. However, it has only been applied to text-only mathematical reasoning.
2.7 Visual Unit Testing
ViUniT (CVPR 2025) generates visual unit tests to verify program correctness in visual programming, reducing "correct for wrong reasons" by 40%. Our work shares the philosophy of using programmatic verification but differs fundamentally: ViUniT tests programs after generation, while CodeSight uses code as privileged information during training.
3. Method
3.1 Overview
CodeSight consists of three components:
- Code-Grounded Video-QA Construction (Section 3.2): Automatically generate videos and QA pairs from HTML code.
- On-Policy Self-Distillation with Code Privilege (Section 3.3): Train a Video-LLM using code-privileged teacher and video-only student.
- Automated Difficulty Curriculum (Section 3.4): Scale scene complexity progressively.
3.2 Code-Grounded Video-QA Construction
3.2.1 Video Generation via HyperFrames
We use HyperFrames, an open-source HTML-to-video rendering framework designed for AI agents. Videos are defined as HTML compositions with semantic data attributes:
HTML Code βββ HyperFrames Renderer βββ MP4 Video
β β
βββ Structured Scene Description Visual Signal
(privileged information) (input at inference)
HyperFrames supports multiple animation runtimes (GSAP, Anime.js, Lottie, Three.js, CSS Animations, WAAPI), enabling diverse visual content. Rendering is deterministic: identical code always produces identical video.
3.2.2 Scene Graph Extraction from Code
We parse the HTML composition into a structured Scene Graph $G = (O, R, E)$:
Objects $O = {o_1, ..., o_N}$: Each HTML element with
data-startanddata-durationattributes. Properties extracted include:- Identity: element
idand tag type - Visual attributes: color, size, position, opacity (from
style) - Temporal span:
[data-start, data-start + data-duration] - Layer:
data-track-index
- Identity: element
Relations $R$: Spatial relations computed from CSS positions (left-of, above, overlapping), temporal relations (before, during, after), and layer ordering (in-front-of, behind).
Events $E$: State changes extracted from animation definitions:
- GSAP timelines: keyframe sequences with property targets
- CSS animations:
@keyframesrule parsing - WAAPI:
element.animate()parameter extraction
Formally, for each object $o_i$, we extract:
3.2.3 QA Generation from Scene Graph
Given scene graph $G$, we generate QA pairs across three compositional dimensions:
Counting Questions: "How many [type/color] objects are visible at time $t$?"
- Answer derived by filtering ${o_i \mid t_i^{\text{start}} \leq t \leq t_i^{\text{end}} \wedge \text{match}(o_i, \text{query})}$
State Questions: "What is the [property] of [object] at time $t$?"
- Answer derived by evaluating animation state at timestamp $t$:
- For GSAP:
timeline.seek(t)and read property - For CSS: compute interpolated keyframe value
- For position: read transform/left/top at $t$
- For GSAP:
Temporal Questions: "Does [object A] appear before or after [object B]?"
- Answer derived by comparing $t_A^{\text{start}}$ vs $t_B^{\text{start}}$
Composite Questions: "After the red ball moves to the right, how many blue objects remain visible?"
- Answer derived by combining temporal events with counting queries
We use template-based generation with LLM-assisted paraphrasing to ensure linguistic diversity while maintaining answer correctness. Each QA pair $(q, a)$ has a provenance trace linking it to specific code elements, enabling automatic verification.
3.2.4 Code Context Formatting
The privileged information $y^*$ provided to the teacher is formatted as a structured code summary:
[Scene Summary]
Objects: 3 elements (red-ball, blue-cube, cat-photo)
Timeline:
- red-ball: t=0s to t=5s, position (100,200), red circle 50px
- blue-cube: t=2s to t=5s, position (400,300), blue square 60px
- cat-photo: t=1s to t=5s, position (600,100), image
Animations:
- red-ball: moves from (100,200) to (800,200) over 3s starting at t=1s
Spatial: red-ball LEFT-OF blue-cube, blue-cube LEFT-OF cat-photo
This structured summary is more effective than raw HTML because it: (a) removes syntactic noise, (b) pre-computes derived facts (spatial relations, temporal overlaps), and (c) fits naturally into the model's context window.
3.3 On-Policy Self-Distillation with Code Privilege
3.3.1 Problem Formulation
Given a video $v$ and question $q$, the goal is to train a Video-LLM $\pi_\theta$ to produce correct answer $a$. We define two policies sharing parameters $\theta$:
Student Policy (inference-time condition):
Teacher Policy (training-time, with code privilege):
where $y^*$ is the structured code context from Section 3.2.4. Both policies use identical parameters $\theta$; they differ only in input conditioning.
3.3.2 Training Objective
Following OPSD, we minimize per-token forward KL divergence on student-generated rollouts:
Key properties:
- On-policy: $\hat{a}$ is sampled from the student $p_S$, ensuring training distribution matches inference.
- Dense supervision: Every token position $n$ receives a gradient signal, unlike GRPO's outcome-level reward.
- No external teacher: Same model $\pi_\theta$ serves both roles; the teacher is frozen to initial parameters for stability.
- Full-vocabulary signal: KL is computed over the entire vocabulary at each position, not just the sampled token.
3.3.3 Per-Token Pointwise Clipping
Following OPSD's finding that stylistic tokens can dominate the KL loss, we apply pointwise clipping:
where $\ell_{n,w} = p_T(w|\cdot) \log \frac{p_T(w|\cdot)}{p_S(w|\cdot)}$ and $\tau$ is a clipping threshold. This is particularly important for video QA where formatting tokens (e.g., "The answer is:") may have high KL but carry no reasoning signal.
3.3.4 Training Algorithm
Algorithm 1: CodeSight Training
Input: Dataset D = {(v_i, q_i, y*_i)}, model Ο_ΞΈ, frozen teacher Ο_ΞΈβ
Output: Updated model Ο_ΞΈ
1: ΞΈβ β ΞΈ // Freeze teacher parameters
2: for each minibatch {(v, q, y*)} β D do
3: // On-policy rollout (student generates)
4: Γ’ ~ p_S(Β·|v, q; ΞΈ) // Student sees video + question only
5:
6: // Compute per-token teacher distribution (frozen)
7: for n = 1 to |Γ’| do
8: p_T^n β Ο_ΞΈβ(Β·|v, q, y*, Γ’_{<n}) // Teacher sees video + question + code
9: p_S^n β Ο_ΞΈ(Β·|v, q, Γ’_{<n}) // Student sees video + question
10: end for
11:
12: // Compute clipped forward KL loss
13: L β (1/|Γ’|) Ξ£_n Ξ£_w min(p_T^n(w) log(p_T^n(w)/p_S^n(w)), Ο)
14:
15: // Update student only (teacher frozen)
16: ΞΈ β ΞΈ - Ξ· β_ΞΈ L
17: end for
3.3.5 Interpretation as Dense-Reward Policy Gradient
The OPSD loss admits an equivalent policy gradient formulation:
where the per-token advantage is:
This means: at each token, the teacher (who can see the code) provides a signal about whether the student's token choice aligns with code-informed reasoning. Tokens where the code-informed teacher strongly disagrees with the student receive larger gradients --- this is automatic token-level credit assignment.
For video QA, this is powerful: if the student says "there are 3 objects" but the code shows 4, the advantage for the count token will be strongly negative, directly correcting the counting error without needing outcome-level reward.
3.4 Automated Difficulty Curriculum
We design a 3-level curriculum based on scene complexity:
| Level | Objects | Animations | Temporal Events | Question Types |
|---|---|---|---|---|
| L1: Simple | 1-3 | Static or single | 1-2 appear/disappear | Single-hop counting, attribute |
| L2: Medium | 4-8 | Linear motion, fade | 3-5 state changes | Multi-hop, temporal ordering |
| L3: Complex | 9-15 | Chained animations, interactions | 6+ overlapping events | Composite reasoning |
Scene generation is fully automated: we write code templates with parameterized object counts, animation types, and timing patterns, then sample concrete scenes from these templates.
3.5 Comparison with Alternative Approaches
| Property | SFT | GRPO (DeepVideo-R1) | External Distillation | CodeSight (Ours) |
|---|---|---|---|---|
| On-policy data | No | Yes | No | Yes |
| Dense token signal | Yes | No (sparse reward) | Yes | Yes |
| Low sampling cost | Yes | No (G rollouts) | Yes | Yes (1 rollout) |
| No external teacher | Yes | Yes | No | Yes |
| Multimodal | N/A | Video | Video | Video |
| Annotation-free | No | Needs reward labels | Needs teacher outputs | Yes (from code) |
4. Experimental Design
4.1 CodeSight-Bench Construction
We construct a benchmark across three difficulty levels with the following statistics:
| Split | Videos | QA Pairs | Avg Objects/Video | Avg Duration |
|---|---|---|---|---|
| Train | 10,000 | 100,000 | 5.2 | 8.3s |
| Val | 1,000 | 10,000 | 5.4 | 8.5s |
| Test | 2,000 | 20,000 | 5.3 | 8.4s |
Question type distribution (balanced across splits):
- Counting: 25% ("How many red objects at t=3s?")
- State identification: 25% ("What color is the large circle?")
- Temporal ordering: 25% ("Which appears first, A or B?")
- Composite: 25% ("After A disappears, how many objects remain?")
Video diversity: We use HyperFrames' supported animation runtimes (GSAP, CSS, WAAPI) and built-in catalog blocks (shader transitions, text overlays, data charts) to ensure visual diversity.
4.2 Baselines
- SFT: Standard supervised fine-tuning on (video, question, answer) triples.
- GRPO: Group Relative Policy Optimization following DeepVideo-R1's approach, with binary correctness reward.
- Offline Distillation: Teacher generates answers conditioned on code; student trains on teacher's sequences (standard KD, off-policy).
- GRPO + Code Reward: GRPO with an additional reward term based on code-verified answer correctness (giving GRPO access to code information, but only as outcome reward).
- ViUniT-style Selection: Generate multiple programs, use visual unit tests for selection (adapted to VideoQA).
4.3 Models
- Primary: Qwen2.5-VL-7B (strong open-source Video-LLM)
- Scaling study: Qwen2.5-VL-3B, Qwen2.5-VL-7B, Qwen2.5-VL-72B
- Cross-architecture: LLaVA-Video-7B, InternVL2.5-8B
4.4 Evaluation Metrics
- Accuracy: Exact match for counting/state; semantic match for temporal.
- Reasoning Correctness: Following ViUniT, we evaluate whether models are "right for the right reasons" via:
- Generate answer on original video
- Generate answer on perturbed video (e.g., remove one object, change a color)
- Correct reasoning should produce different answers for different videos
- Token Efficiency: Total generation tokens consumed during training (OPSD's key advantage).
- Per-Dimension Breakdown: Separate accuracy for counting, state, temporal, and composite questions.
4.5 Ablation Studies
- Privileged information format: Raw HTML vs. structured summary vs. scene graph JSON
- Divergence function: Forward KL vs. reverse KL vs. JSD (following OPSD Table 3)
- Clipping threshold $\tau$: Impact on training stability
- Generation length: Short (256) vs. medium (512) vs. long (1024) student rollouts
- Curriculum: With vs. without difficulty progression
- Number of QA pairs per video: 5, 10, 20 --- effect of question diversity
4.6 Analysis
- Per-token advantage visualization: Which tokens receive the strongest teacher correction? We hypothesize counting tokens and temporal markers will show highest KL.
- Reward diversity collapse analysis: Compare GRPO's reward variance across batches vs. CodeSight's KL variance, demonstrating that dense supervision avoids collapse.
- Transfer to real video: Fine-tune on CodeSight-Bench, evaluate on natural video benchmarks (TUNA, MotionBench) to test whether improved compositional reasoning transfers.
- Scaling laws: How does performance scale with (a) number of training videos, (b) scene complexity, (c) model size?
5. Expected Contributions and Impact
5.1 Technical Contributions
- First multimodal OPSD: Extending on-policy self-distillation from text-only math to vision-language video understanding.
- Code-as-privileged-information paradigm: Demonstrating that video generation source code is a powerful, zero-cost form of privileged information for training.
- Automatic dense supervision for video: Solving the sparse reward problem in video RL without human annotation.
- CodeSight-Bench: A new benchmark with machine-verifiable ground truth derived from generation code.
5.2 Broader Impact
- Data flywheel: Code β video β QA generation is fully automatic, enabling unlimited training data at near-zero marginal cost.
- Sim-to-real transfer: If CodeSight-trained models show improved compositional reasoning on real video benchmarks, this validates synthetic-to-real transfer for video understanding.
- Beyond HyperFrames: The framework generalizes to any programmatic video generation tool (Manim, Remotion, Blender scripting, game engines) --- any system where code produces video.
6. Timeline
| Phase | Duration | Deliverable |
|---|---|---|
| Phase 1: Data Pipeline | 4 weeks | Code-to-QA pipeline, CodeSight-Bench v1 (5K videos) |
| Phase 2: OPSD Implementation | 3 weeks | CodeSight training loop on Qwen2.5-VL-7B |
| Phase 3: Baseline Experiments | 3 weeks | SFT, GRPO, offline distillation comparisons |
| Phase 4: Ablations & Analysis | 3 weeks | All ablation studies, per-token analysis |
| Phase 5: Scaling & Transfer | 3 weeks | Multi-model, real-video transfer experiments |
| Phase 6: Paper Writing | 4 weeks | Full paper submission |
7. Key References
- OPSD: Self-Distilled Reasoner: On-Policy Self-Distillation for LLMs (arXiv 2601.18734, 2025)
- DeepVideo-R1: Video Reinforcement Fine-Tuning via Difficulty-aware GRPO (NeurIPS 2025)
- Video-RTS: Rethinking RL and Test-Time Scaling for Video Reasoning (EMNLP 2025)
- ViUniT: Visual Unit Tests for More Robust Visual Programming (CVPR 2025)
- MOSCATO: Predicting Multiple Object State Change Through Actions (ICCV 2025)
- SAGE: Generalizable Object State Recognition with State-Action Graph Embedding (NeurIPS 2025 Oral)
- T2V-CompBench: Compositional Text-to-Video Generation Benchmark (CVPR 2025)
- Neuro-Symbolic Evaluation of T2V Models using Formal Verification (CVPR 2025)
- VidHalluc: Evaluating Temporal Hallucinations in Video-LLMs (CVPR 2025)
- TUNA: Fine-grained Temporal Understanding on Dense Dynamic Videos (ACL 2025)
- MotionBench: Fine-grained Video Motion Understanding for VLMs (CVPR 2025)
- Agent-of-Thoughts Distillation for Video-LLM Reasoning (CVPR 2025)
- HyperFrames: HTML-to-Video Rendering Framework (HeyGen, 2025)
- Imagine While Reasoning in Space: Multimodal Visualization-of-Thought (ICML 2025)
- Rendering-Aware RL for Vector Graphics Generation (NeurIPS 2025)