Video Captioning, Multimodal Large Language Models (MLLMs), Video Understanding, Vision-Language Models, Reinforcement Learning, Dataset Construction