| [12.04.2024] Q: I have tried to run this model for video captioning. However, it only returns a caption for each frame. In the original paper, the model supports video through multiple frames. Is this support at HuggingFace as well? | |
| A: For video captioning I'd recommend taking a look at the GIT checkpoints fine-tuned on video datasets, like https://huggingface.co/microsoft/git-base-vatex |