# CodeSight: On-Policy Self-Distillation for Video Question Answering with Code-Privileged Supervision ## Research Proposal --- ## Abstract Video Question Answering (VideoQA) requires models to understand complex visual scenes involving multiple objects, their states, quantities, and temporal dynamics. Current approaches either rely on sparse outcome-level rewards (e.g., GRPO) that provide no per-token learning signal, or require expensive human annotations for dense supervision. We propose **CodeSight**, a framework that applies On-Policy Self-Distillation (OPSD) to VideoQA by exploiting a novel form of privileged information: **the source code that generated the video**. Using programmatic video generation tools (e.g., HyperFrames), we construct videos whose HTML/animation source code contains precise, machine-readable ground truth about every object's identity, state, position, count, and temporal behavior. A single model serves as both teacher (conditioned on video + source code) and student (conditioned on video only), with the teacher providing dense token-level supervision on the student's own on-policy rollouts. This eliminates the need for external teacher models, human annotation, and addresses the reward sparsity problem in video RL. We further introduce an automated QA generation pipeline that extracts structured facts from code to produce diverse, verifiable question-answer pairs. Experiments are designed on three compositional reasoning dimensions --- object counting, state tracking, and temporal event ordering --- across synthetic videos of varying complexity. --- ## 1. Introduction ### 1.1 Problem Statement Video Language Models (Video-LLMs) have made rapid progress, yet they frequently hallucinate about basic visual facts: miscounting objects, misidentifying states, and confusing temporal order. Benchmarks like VidHalluc (CVPR 2025) confirm that temporal hallucinations remain pervasive. A deeper issue, highlighted by ViUniT (CVPR 2025), is that models can produce **correct answers for wrong reasons** --- a problem that outcome-level evaluation cannot detect. Training these models faces a fundamental tension: - **Supervised Fine-Tuning (SFT)** provides dense token-level signal but suffers from distribution mismatch: training on teacher-generated sequences diverges from the model's own inference behavior. - **Reinforcement Learning (GRPO/PPO)** trains on the model's own outputs (on-policy) but provides only sparse, outcome-level rewards. For video reasoning, this is especially problematic: a correct final answer receives uniform credit across all tokens, while an incorrect answer receives uniform penalty, regardless of which reasoning steps were right or wrong. OPSD (2025) resolves this tension for mathematical reasoning by using the same model as both teacher (with privileged access to reference solutions) and student (with only the problem), providing dense token-level KL supervision on the student's own rollouts. However, OPSD has only been demonstrated on text-only math benchmarks. Extending it to multimodal video understanding requires a suitable form of privileged information. ### 1.2 Key Insight We observe that **programmatically generated videos carry a unique duality**: the rendered video is a visual signal, while the source code that produced it is a complete, structured, machine-readable description of every visual element. For a video defined in HyperFrames: ```html

``` The code tells us *exactly*: - **What objects exist**: red-ball, blue-cube, cat-photo (count = 3) - **Object properties**: color, shape, size, position - **Temporal dynamics**: red-ball appears at t=0 for 5s, blue-cube at t=2 for 3s, cat at t=1 for 4s - **Spatial layout**: red-ball at (100,200), blue-cube at (400,300), cat at (600,100) - **Animation states**: any GSAP/CSS animation keyframes define state transitions This is **strictly richer** than any human annotation: it is exhaustive, precise, and obtained at zero marginal cost. ### 1.3 Contributions 1. **CodeSight Framework**: The first application of on-policy self-distillation to multimodal video understanding, using video generation source code as privileged information. 2. **Code-to-QA Pipeline**: An automated pipeline that parses HyperFrames HTML to extract structured scene graphs and generates diverse, verifiable QA pairs across counting, state, and temporal dimensions. 3. **Dense Token-Level Video Supervision**: We demonstrate that per-token KL divergence from a code-privileged teacher provides stronger learning signals than sparse GRPO rewards for video reasoning. 4. **CodeSight-Bench**: A benchmark of programmatically generated videos with automatically derived ground-truth annotations for object count, state, and temporal event ordering. --- ## 2. Related Work ### 2.1 Video Question Answering VideoQA has evolved from simple recognition to complex compositional reasoning. Recent benchmarks push toward fine-grained understanding: - **TUNA** (ACL 2025) evaluates fine-grained temporal understanding on dense dynamic videos. - **MotionBench** (CVPR 2025) benchmarks fine-grained motion understanding. - **VELOCITI** (CVPR 2025) tests compositional reasoning with strict entailment. - **VidHalluc** (CVPR 2025) specifically evaluates temporal hallucinations in Video-LLMs. - **Video Thinking Test** (ICCV 2025) provides a holistic benchmark for advanced video reasoning. However, none of these benchmarks provide *code-level* ground truth that enables automatic, dense supervision during training. ### 2.2 Object State Understanding in Video Understanding object states and their changes is crucial for video reasoning: - **MOSCATO** (ICCV 2025) predicts multiple object state changes through actions. - **SAGE** (NeurIPS 2025, Oral) provides a unified framework for generalizable object state recognition via state-action graph embedding. - **ObjChangeVR** (EACL 2026) reasons about object state changes from continuous egocentric VR views. - **Compositional 4D Dynamic Scenes** (ICLR 2025) uses physics priors for video QA. These works focus on *recognizing* states from video. Our approach is complementary: we *generate* videos with known states and use the generation code as supervision signal. ### 2.3 Video Generation Evaluation Evaluating generated video quality requires structured ground truth: - **T2V-CompBench** (CVPR 2025) evaluates compositional text-to-video generation across numeracy, spatial relations, and actions. - **Neuro-Symbolic Evaluation of T2V** (CVPR 2025) uses formal verification to check object constraints. - **ETVA** (ICCV 2025) evaluates text-to-video alignment via fine-grained QA generation. - **TC-Bench** (ACL 2025 Findings) benchmarks temporal compositionality in conditional video generation. These works evaluate *generated* videos against text prompts. We flip the direction: we use the *generation code itself* as structured ground truth for training. ### 2.4 Reinforcement Learning for Video-Language Models Recent work applies RL to improve video reasoning: - **DeepVideo-R1** (NeurIPS 2025) uses difficulty-aware regressive GRPO for video reasoning. - **Video-RTS** (EMNLP 2025) rethinks RL and test-time scaling for video reasoning. - **Video-as-Answer** (CVPR 2026) applies Joint-GRPO for video event prediction. All these methods use **sparse outcome-level rewards**, suffering from the credit assignment problem. GRPO specifically exhibits reward diversity collapse (demonstrated in OPSD paper Fig.3) where >50% of batches show zero reward variance, halting learning. This problem is amplified in video QA where reasoning chains are longer and more complex. ### 2.5 Knowledge Distillation for Video Understanding - **Agent-of-Thoughts Distillation** (CVPR 2025) distills reasoning from large to small Video-LLMs. - **VITED** (CVPR 2025) distills temporal evidence from video. - **Visual Program Distillation** (EMNLP 2025 Findings) distills visual programs with augmentation. These approaches require a separate, larger teacher model. OPSD eliminates this requirement by using the same model with asymmetric conditioning. ### 2.6 On-Policy Self-Distillation **OPSD** (2025) introduces self-distillation where a single LLM acts as both teacher (conditioned on reference solution) and student (conditioned on problem only), minimizing per-token KL divergence on the student's own rollouts: $$\mathcal{L}_{\text{OPSD}}(\theta) = \mathbb{E}_{(x,y^*)\sim\mathcal{S}} \mathbb{E}_{\hat{y}\sim p_S(\cdot|x)} \sum_{n=1}^{|\hat{y}|} D_{\text{KL}}(p_T(\cdot|x,y^*,\hat{y}_{