See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding Paper โข 2605.18018 โข Published May 18 โข 33