allenai
/

Molmo2-VideoPoint-4B

@@ -2,6 +2,8 @@
 license: apache-2.0
 datasets:
 - allenai/Molmo2-VideoPoint
 language:
 - en
 base_model:
@@ -29,7 +31,7 @@ You can find all models in the Molmo2 family [here](https://huggingface.co/colle
 **Learn more** about the Molmo2 family [in our announcement blog post](https://allenai.org/blog/molmo2).
 Molmo2-VideoPoint-4B is based on [Qwen3-4B-Instruct](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507) and uses [SigLIP 2](https://huggingface.co/google/siglip-so400m-patch14-384) as vision backbone.
-It is mostly trained on the Molmo2-VideoPoint data only and meant to be used for video pointing and counting only.
 Ai2 is commited to open science. The Molmo2 datasets are available [here](https://huggingface.co/collections/allenai/molmo2-data).
 All other artifacts used in creating Molmo2 (training code, evaluations, intermediate checkpoints) will be made available at a later date, furthering our commitment to open-source AI development and reproducibility.
@@ -148,32 +150,24 @@ print(points)
 ## Evaluations
-We report the Average Score on 15 Academic Benchmarks here.
-For details on the evals, refer to the main video results table in our [technical report](https://allenai.org/papers/molmo2).
-| Model | Average Score on 15 Academic Benchmarks |
-|-----------------------------|-----------------------------------------|
-| GPT-5 | 70.6 |
-| GPT-5 mini | 65.0 |
-| Gemini 3 Pro | 70.0 |
-| Gemini 2.5 Pro | 71.2 |
-| Gemini 2.5 Flash | 66.7 |
-| Claude Sonnet 4.5 | 59.6 |
-| InternVL3.5-4B | 53.4 |
-| InternVL3.5-8B | 54.1 |
-| Qwen3-VL-4B | 58.1 |
-| Qwen3-VL-8B | 59.5 |
-| Keye-VL-1.5-8B | 55.7 |
-| GLM-4.1V-9B | 56.9 |
-| MiniCPM-V-4.5-8B | 56.6 |
-| Eagle2.5-8B | 60.7 |
-| PLM-3B | 53.9 |
-| PLM-8B | 56.2 |
-| LLaVA-Video-7B | 52.7 |
-| VideoChat-Flash-7B | 56.1 |
-| **Molmo2-4B (this model)** | 62.8 |
-| Molmo2-8B | 63.1 |
-| Molmo2-7B | 59.7 |
 ## License and Use

 license: apache-2.0
 datasets:
 - allenai/Molmo2-VideoPoint
+- allenai/pixmo-points
+- allenai/pixmo-cap
 language:
 - en
 base_model:
 **Learn more** about the Molmo2 family [in our announcement blog post](https://allenai.org/blog/molmo2).
 Molmo2-VideoPoint-4B is based on [Qwen3-4B-Instruct](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507) and uses [SigLIP 2](https://huggingface.co/google/siglip-so400m-patch14-384) as vision backbone.
+**Different from the general checkpoints, Molmo2-VideoPoint-4B is finetuned on the Molmo2-VideoPoint data only, after pre-training on pixmo-cap, pixmo-points and tulu's data. It is meant to be used for video pointing and counting only**.
 Ai2 is commited to open science. The Molmo2 datasets are available [here](https://huggingface.co/collections/allenai/molmo2-data).
 All other artifacts used in creating Molmo2 (training code, evaluations, intermediate checkpoints) will be made available at a later date, furthering our commitment to open-source AI development and reproducibility.
 ## Evaluations
+We report the accuracy and close accuracy on Molmo2-VideoCountEval here.
+For details on the evals, refer to our [technical report](https://allenai.org/papers/molmo2).
+| Model | Accuracy | Close Acc. |
+|-----------------------------|-----------------------------------------|-----------------------------------------|
+| GPT-5 | 35.8 | 50.3 |
+| GPT-5 mini | 29.8 | 49.3 |
+| Gemini 3 Pro | **37.1** | 53.1 |
+| Gemini 2.5 Pro | 35.8 | **56.5** |
+| Gemini 2.5 Flash | 31.9 | 48.2 |
+| Claude Sonnet 4.5 | 27.2  | 45.1 |
+| Qwen3-VL-4B | 25.3 | 44.3 |
+| Qwen3-VL-8B | 29.6 | 47.7 |
+| Molmo2-4B | 34.3 | <u>56.1</u> |
+| Molmo2-8B | 35.5 | 53.3 |
+| Molmo2-7B | 33.2 | 50.5 |
+| **Molmo2-VideoPoint-4B (this model)** | <u>36.8</u> | **56.5** |
 ## License and Use