Video-Text-to-Text
Transformers
Safetensors
English
qwen3_5
text-generation
video
multimodal
video-captioning
temporal-grounding
qwen
VLM
custom_code
Instructions to use NemoStation/Marlin-2B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use NemoStation/Marlin-2B with Transformers:
# Load model directly from transformers import AutoProcessor, AutoModelForCausalLM processor = AutoProcessor.from_pretrained("NemoStation/Marlin-2B", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("NemoStation/Marlin-2B", trust_remote_code=True) - Notebooks
- Google Colab
- Kaggle
Question about the evaluation metrics for captioning benchmarks
#3
by ygyjrc - opened
Hi, thanks for releasing Marlin-2B and the evaluation results. I have a question regarding the metric used in the leaderboard figures.
In the captioning plots, the y-axis is labeled as “VideoEvalV2 mean / 10” for benchmarks such as DREAM-1K and CaReBench. I noticed that the reported scores do not match the official leaderboard scores, which use the Recall/Precision/F1 metric.
Could you clarify: What exactly is “VideoEvalV2”?
I’m very interested in video caption tasks, so I’d really appreciate any clarification.